开发者

find the top words, relative to all documents

I have some 100.000+ text documents. I'd like to find a way to answer this (somewhat ambiguous) question:

For a given subset of documents, what are the n开发者_运维技巧 most frequent words - related to the full set of documents?

I'd like to present trends, eg. a word cloud showing something like "these are the topics that are especially hot in the given date range". (Yes, I know that this is an oversimplification: words != topics etc.)

It seems that I could possibly calculate something like tf-idf values for all words in all documents, and then do some number crunching, but I don't want to reinvent any wheels here.

I'm planning on possibly using Lucene or Solr for indexing the documents. Would they help me with this question - how? Or would you recommend some other tools in addition / instead?


This should work: http://lucene.apache.org/java/3_1_0/api/contrib-misc/org/apache/lucene/misc/HighFreqTerms.html

This Stack Overflow question also covers term frequencies in general with Lucene.

If you were not using Lucene already, the operation you are talking about is a classic introductory problem for Hadoop (the "word count" problem).

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜