Track word count over time


#1

I am writing my thesis using some latex packages in atom.
To keep track of my progress, I would like to monitor the number of words, characters, etc. I have written since I started the project.
Is there a package that can do that for me?


#2

Many.


#3

Rather none.
None seem to track word count etc. over time.
I want to be able to plot my progress; i.e. word-count by day or similar.


#4

Wow, looks like it is the hello world of packages… :slight_smile: Some authors haven’t find yet how to change the default description!

@Stingery It is an interesting idea, and like @DamnedScholar I initially didn’t understand your request. Could be a good task for one of these beginner package writers. Would need to save daily (when is one of the hard parts… perhaps regularly until the day is done) how many words there are for a given file. Would need some kind of database, or a well structured text document. Should keep track of file renaming, at least when you do that inside Atom. Should be able to show a progress chart, or perhaps something like GitHub contribution checkerboard.

Or perhaps it can be an external tool, using your Git history (if you use this tool) to compute evolution of your work.


#5

This gets the work done in python

wordsFile="yourfile.foo"
statFile="stats.json"

import re
rule=re.compile(r"[a-zA-Z0-9]")

wordCount=0

with open(wordsFile,'r') as fichier:
    text=fichier.read()
lines=text.split('\n')
for line in lines:
    words=line.split(' ')
    for word in words:
        if rule.search(word):
            wordCount+=1

import json
try:
    with open(statFile,'r') as fichier:
        dic=json.load(fichier)
        dic['#lastCount']
except:
    dic={}
    dic['#lastCount']=0

import time
timeString=time.strftime('%c')

dic[timeString]={}
dic[timeString]['totalWords']=wordCount
dic[timeString]['newWords']=wordCount-dic['#lastCount']
dic['#lastCount']=wordCount

with open(statFile,'r') as fichier:
    json.dump(dic,fichier,indent=2)

needs you to install some python interpreter (winpython’s the best on web), have your interpreter in your PATH if on windows then lauch the file anytime you want a stat or have a batch launch it regularly for yo

OR maybe some good fella’s gonna make this or anything working an Atom package and yout topic thus will be fully answered :smile: :sunny:


#7

i’ll try to get into package making soon but here are some milestones if someone has the keys in hand

  • https://atom.io/packages/count-word seems to be a complete stat collector
  • the said package could cooly be a fork of the package below
  • could add a “create stat file for this script file” button on the count-word popup
  • some file stored in the Atom folders could store wich statFile to link with wich scriptFile
  • settings could tell wether to update existing statfile each time a file is edited and even create a stat for each opened file

if i find time before anybody that would be a nice training for atom package creation
:heart_decoration: keep coding


#8

Out of curiosity, I had a look at this package, and as I feared from the screenshot, it uses a method a bit too simplistic for counting words, just getting the number of items separated by some form of spaces.
Won’t work in French, where some punctuation signs are isolated, like

« Je suis déçu ! »

and even with some form of English — some signs can be isolated.
Plus the example is given on a HTML sample, and ideally it should exclude the markup… (but I admit it is hard)
Even Markdown would generate wrong word count, with the titles or similar.

Not to bash this specific package, I guess most of those listed use a similar simplistic method…
Hey, I am no a pythonist, but from what I understand, you use a similar method too… :slight_smile:

Oh well, I suppose it depends on the definition of “word”, which differs between softwares anyway.


#9

the python script i posted filters the words with the regex

rule=re.compile(r"[a-zA-Z0-9]")

in other words will only be counted as word a sequence of strings that contains at least a lowercase letter a-z an uppercase one A-Z or a number 0-9 and you can define you own rule helped by wonderful website https://regex101.com/
if you want to count bidule.truc as 2 separate words you can add levels of the for-splitting ladder like that

for line in lines:
    words=line.split(' ')
    for word in words:
        pointWords=word.split('.')
        for pointWord in pointWords:
            .....
            if rule.search(pointWord):#here is where the regex filters words
                wordCount+=1

i don’t know about the atom packages but in the idea of developping one that fits your question adding the custom regex and custom splitting could be interesting too


#10

Great! I only saw the split(' '), and missed the additional rule. Which can be, for most cases, r"[a-zA-Z0-9'-]" but indeed this will vary with language, specific needs, etc.


#11

this would count ’ and - as full words meaning
John said ’ hello ’ - Mary did ’ nt answer
would be 11 words but if that’s what you want

the rule as it is used says a word is valid AS SOON as one of the characters is found so did'nt counts as 1 word even if ’ is not in the regex


#12

Yeah, the rule had to be refined to count dash only between two letters (in-line, back-end, etc.).
Single quote is rarely isolated: O’Reilly, didn’t, 70’s, aujourd’hui, etc. Now, perhaps people might want to count didn’t, couldn’t as two words (did not, could not), but I don’t know if there is a consensus on that rule.
AFAIK, English people use double quotes for… quotes, and stick them to words, unlike French: “Hello” – Mary didn’t answer. Note I also used a long dash… :wink: