How to count term frequency for set of documents?

2023-01-01 21:52 问答作者：

i have a Lucene-Index with following documents:

doc1 := { caldari, jita, shield, planet }
doc2 := { gallente, dodixie, armor, planet }
doc3 := { amarr, laser, armor, planet }
doc4 := { minmatar, rens, space }
doc5 := { jove, space, secret, planet }

so these 5 documents use 14 different terms:

[ caldari, jita, shield, planet, galle开发者_运维技巧nte, dodixie, armor, amarr, laser, minmatar, rens, jove, space, secret ]

the frequency of each term:

[ 1, 1, 1, 4, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1 ]

for easy reading:

[ caldari:1, jita:1, shield:1, planet:4, gallente:1, dodixie:1, 
armor:2, amarr:1, laser:1, minmatar:1, rens:1, jove:1, space:2, secret:1 ]

What i do want to know now is, how to obtain the term frequency vector for a set of documents?

for example:

Set<Documents> docs := [ doc2, doc3 ]

termFrequencies = magicFunction(docs); 

System.out.pring( termFrequencies );

would result in the ouput:

[ caldari:0, jita:0, shield:0, planet:2, gallente:1, dodixie:1, 
armor:2, amarr:1, laser:1, minmatar:0, rens:0, jove:0, space:0, secret:0 ]

remove all zeros:

[ planet:2, gallente:1, dodixie:1, armor:2, amarr:1, laser:1 ]

Notice, that the result vetor contains only the term frequencies of the set of documents. NOT the overall frequencies of the whole index! The term 'planet' is present 4 times in the whole index but the source set of documents only contains it 2 times.

A naive implementation would be to just iterate over all documents in the docs set, create a map and count each term. But i need a solution that would also work with a document set size of 100.000 or 500.000.

Is there a feature in Lucene i can use to obtain this term vector? If there is no such feature, how would a data structure look like someone can create at index time to obtain such a term vector easily and fast?

I'm not that Lucene expert so i'am sorry if the solution is obvious or trivial.

Maybe worth to mention: the solution should work fast enough for a web application, applied to client search queries.

Go here: http://lucene.apache.org/java/3_0_1/api/core/index.html and check this method

org.apache.lucene.index.IndexReader.getTermFreqVectors(int docno);

you will have to know the document id. This is an internal lucene id and it usually changes on every index update (that has deletes :-)).

I believe there is a similar method for lucene 2.x.x

I don't know Lucene, however; your naive implementation will scale, provided you don't read the entire document into memory at one time (i.e use an on-line parser). English text is about 83% redundant so your biggest document will have a map with 85000 entries in it. Use one map per thread (and one thread per file, pooled obviouly) and you will scale just fine.

Update: If your term list does not change frequently; you might try building a search tree out of the characters in your term list, or building a perfect hash function (http://www.gnu.org/software/gperf/) to speed up file parsing (mapping from search terms to target strings). Probably just a big HashMap would perform about as well.

继续阅读：lucene

How to count term frequency for set of documents?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？