How to count term frequency for set of documents?
i have a Lucene-Index with following documents:
doc1 := { caldari, jita, shield, planet }
doc2 := { gallente, dodixie, armor, planet }
doc3 := { amarr, laser, armor, planet }
doc4 := { minmatar, rens, space }
doc5 := { jove, space, secret, planet }
so these 5 documents use 14 different terms:
[ caldari, jita, shield, planet, galle开发者_运维技巧nte, dodixie, armor, amarr, laser, minmatar, rens, jove, space, secret ]
the frequency of each term:
[ 1, 1, 1, 4, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1 ]
for easy reading:
[ caldari:1, jita:1, shield:1, planet:4, gallente:1, dodixie:1,
armor:2, amarr:1, laser:1, minmatar:1, rens:1, jove:1, space:2, secret:1 ]
What i do want to know now is, how to obtain the term frequency vector for a set of documents?
for example:
Set<Documents> docs := [ doc2, doc3 ]
termFrequencies = magicFunction(docs);
System.out.pring( termFrequencies );
would result in the ouput:
[ caldari:0, jita:0, shield:0, planet:2, gallente:1, dodixie:1,
armor:2, amarr:1, laser:1, minmatar:0, rens:0, jove:0, space:0, secret:0 ]
remove all zeros:
[ planet:2, gallente:1, dodixie:1, armor:2, amarr:1, laser:1 ]
Notice, that the result vetor contains only the term frequencies of the set of documents. NOT the overall frequencies of the whole index! The term 'planet' is present 4 times in the whole index but the source set of documents only contains it 2 times.
A naive implementation would be to just iterate over all documents in the
docs
set, create a map and count each term.
But i need a solution that would also work with a document set size of
100.000 or 500.000.
Is there a feature in Lucene i can use to obtain this term vector? If there is no such feature, how would a data structure look like someone can create at index time to obtain such a term vector easily and fast?
I'm not that Lucene expert so i'am sorry if the solution is obvious or trivial.
Maybe worth to mention: the solution should work fast enough for a web application, applied to client search queries.
Go here: http://lucene.apache.org/java/3_0_1/api/core/index.html and check this method
org.apache.lucene.index.IndexReader.getTermFreqVectors(int docno);
you will have to know the document id. This is an internal lucene id and it usually changes on every index update (that has deletes :-)).
I believe there is a similar method for lucene 2.x.x
I don't know Lucene, however; your naive implementation will scale, provided you don't read the entire document into memory at one time (i.e use an on-line parser). English text is about 83% redundant so your biggest document will have a map with 85000 entries in it. Use one map per thread (and one thread per file, pooled obviouly) and you will scale just fine.
Update: If your term list does not change frequently; you might try building a search tree out of the characters in your term list, or building a perfect hash function (http://www.gnu.org/software/gperf/) to speed up file parsing (mapping from search terms to target strings). Probably just a big HashMap would perform about as well.
精彩评论