find the top words, relative to all documents
I have some 100.000+ text documents. I'd like to find a way to answer this (somewhat ambiguous) question:
For a given subset of documents, what are the n开发者_运维技巧 most frequent words - related to the full set of documents?
I'd like to present trends, eg. a word cloud showing something like "these are the topics that are especially hot in the given date range". (Yes, I know that this is an oversimplification: words != topics etc.)
It seems that I could possibly calculate something like tf-idf values for all words in all documents, and then do some number crunching, but I don't want to reinvent any wheels here.
I'm planning on possibly using Lucene or Solr for indexing the documents. Would they help me with this question - how? Or would you recommend some other tools in addition / instead?
This should work: http://lucene.apache.org/java/3_1_0/api/contrib-misc/org/apache/lucene/misc/HighFreqTerms.html
This Stack Overflow question also covers term frequencies in general with Lucene.
If you were not using Lucene already, the operation you are talking about is a classic introductory problem for Hadoop (the "word count" problem).
精彩评论