Reliably getting a character count for .doc files
What's a reliable way to automatically count the characters and/or words in a .doc or .docx file?
The only real requirement is a reasonably accurate and reasonably reliable count.
It needs to work with documents co开发者_如何学Cntaining something other than Latin script, so counting characters is good enough for most cases. The count does not necessarily need to match Word's, but the closer the better. Since there are a gazillion different apps that can generate .doc files, it's okay to fail to count anything, but this case needs to be catchable so we're aware that a count may be inaccurate. For all other cases the count must be, say, at least 99% accurate at least 99% of the time.I'm open as to the involved technologies, but something that can run on a *NIX command line would be greatly preferred.
Is there a reasonable solution for this?
Here's a link to some Linux word-to-text converters.
For example you could use
antiword file.doc | wc
to do the counting.
Edit:
This link shows that AbiWord has a command-line interface, that you could use to convert the .docx format to .txt and then count the words using "wc". AbiWord does support the docx format
Mac OS X has support for reading word files built into the system frameworks, so if you have that, it's easy. MacRuby sample:
NSSpellChecker.sharedSpellChecker.countWordsInString(NSAttributedString.alloc.initWithURL(fileURL, documentAttributes:nil), language:nil)
More portably — though it gives up support for docx — you could simply get Antiword and do antiword | wc -w
.
Microsoft has published a specification for the Office binary file formats. Parsing a .DOC file doesn't look trivial, but with some care you should be able to get a dependable, repeatable result. I have no idea how closely it'll match with what Word shows -- that will probably depend (at least partly) on how you define "word" -- for example, whether you consider a group of digits a "word" or not. It probably won't take a lot to figure out how Word treats cases like that, so getting a close match shouldn't be terribly difficult.
If you consider online applications as a solution, yes, there is a solution.
This not so pretty (regarding the design) site offers both word and character count: http://allworldphone.com/count-words-characters.htm
I don't think there is a limit, and it shouldn't be a problem to just copy/paste the contents of your documents into the corresponding textarea and see the result.
Regarding the 100% or 99% accuracy, you could test it with a few (i.e. 20-50 words) by counting them yourself first.
I hope this helps. Regards. Chris
精彩评论