What is the cause of discrepancies among different word counters
I created a word counter in Ruby as a little exercise in learning Ruby.
I've used the word counters on JavaScriptKit.com and WordCountTool.com as well as the one in Open Office Writer.
So开发者_JAVA百科me text produced the following results
OpenOffice: 458 words
JavaScriptKit: 453 words
WordCountTool: 455 words
Mine: 461 words
My question is this: Why do the counts differ for the same exact excerpt across all counters?
What are problems in a script that might cause an inaccurate, but still close count?
What are some ways I could improve upon my script so that it's more accurate?
You're really asking for a definition of a "word", which for counting purposes could mean very different things. Let's take your original post as an example.
The most simplistic counting tool would be
text.split.count #=> 111
Yet what if you had put "Why do the counts differ/change for the same[...]"
? Well, clearly "differ/change" is two words, so we should probably count forward slashes as word delimiters. In fact, just because I forgot to put a space between a full stop and the next word, doesn't make them the same word, so let's include full stops as delimiters too. Yet I can't be bothered to check whether it's a URL, so those websites you mention will have to count as two words:
text.split(/[\s\.\/\?]+/).count #=> 113
Ok, that's cool, but actually numbers are not technically words - and if they were spoken, 458 would be "four hundred and fifty eight" which is actually 5 words. So let's discount them too
text.split(/[\s\.\/\?0-9]+/).count #=> 109
You get the idea. The results you got only differed by 8 words - so clearly their definitions of a word are not all that different. But word counts are only ever a rough guide, so don't worry about the discrepancies.
You'll get different results depending on what the author of the WC has decided to be a 'word'. Certain types of punctuation could be classed as a word seperator depending on the counter as a well as whitespaces, newlines etc...
Some info from the wikipedia article on WC http://en.wikipedia.org/wiki/Word_count
Different word counting programs may give varying results, depending on the definition of "word"
精彩评论