How do I count the number of words (text) in an HTML source
I have some html documents for which I need to return the number of words in the document. This count should only include actual text (so no html 开发者_开发技巧tags e.g. html, br, etc).
Any ideas how to do this? Naturally, I would prefer to re-use some code.
Thanks,
Assaf
Strip out the HTML tags , get the text content , reuse Jsoup
Read file line by line , hold a
Map<String, Integer> wordToCountMap
and read through and operate on theMap
Solution with jsoup
private int countWords(String html) throws Exception {
org.jsoup.nodes.Document dom = Jsoup.parse(html);
String text = dom.text();
return text.split(" ").length;
}
I would add an extra step to Jigar's answer:
- Parse out the document text using JSoup or Jericho or Dom4j
Tokenise the resulting text. This depends on your definition of a "word". It is unlikely to be as simple as splitting on white-space. And you'll need to deal with punctuation etc. So take a look at the various Tokeniser's available e.g from the Lucene or Stanford NLP projects. Here are some simple examples you will encounter:
"Today I'm going to New York!"
- Is "I'm" one word or two? What about "New York"?"We applied two meta-filters in the analysis"
- Is "meta-filter" one word or two?
And what about badly formatted text, e.g missing of a space at the end of a sentence:
"So we went there.And on arrival..."
Tokenising is tricky...
- Iterate through your tokens and count them up, e.g using a HashMap.
精彩评论