开发者

What is the cause of discrepancies among different word counters

I created a word counter in Ruby as a little exercise in learning Ruby.

I've used the word counters on JavaScriptKit.com and WordCountTool.com as well as the one in Open Office Writer.

So开发者_JAVA百科me text produced the following results

OpenOffice: 458 words
JavaScriptKit: 453 words
WordCountTool: 455 words
Mine: 461 words

My question is this: Why do the counts differ for the same exact excerpt across all counters?

What are problems in a script that might cause an inaccurate, but still close count?

What are some ways I could improve upon my script so that it's more accurate?


You're really asking for a definition of a "word", which for counting purposes could mean very different things. Let's take your original post as an example.

The most simplistic counting tool would be

text.split.count                      #=> 111

Yet what if you had put "Why do the counts differ/change for the same[...]"? Well, clearly "differ/change" is two words, so we should probably count forward slashes as word delimiters. In fact, just because I forgot to put a space between a full stop and the next word, doesn't make them the same word, so let's include full stops as delimiters too. Yet I can't be bothered to check whether it's a URL, so those websites you mention will have to count as two words:

text.split(/[\s\.\/\?]+/).count       #=> 113

Ok, that's cool, but actually numbers are not technically words - and if they were spoken, 458 would be "four hundred and fifty eight" which is actually 5 words. So let's discount them too

text.split(/[\s\.\/\?0-9]+/).count    #=> 109

You get the idea. The results you got only differed by 8 words - so clearly their definitions of a word are not all that different. But word counts are only ever a rough guide, so don't worry about the discrepancies.


You'll get different results depending on what the author of the WC has decided to be a 'word'. Certain types of punctuation could be classed as a word seperator depending on the counter as a well as whitespaces, newlines etc...

Some info from the wikipedia article on WC http://en.wikipedia.org/wiki/Word_count

Different word counting programs may give varying results, depending on the definition of "word"

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜