Lucene 3 iterating over all hits

2023-01-08 01:24 问答作者：

I'm in the process of updating a tool that uses a Lucene index. As part of this update we are moving from Lucene 2.0.0 to 3.0.2. For the most part this开发者_运维知识库 has been entirely straightforward. However, in one instance I cant seem to find a straightforward conversion.

Basically I have a simple query and I need to iterate over all hits. In Lucene 2 this was simple, e.g.:

Hits hits = indexSearcher.search(query);
for(int i=0 ; i<hits.length() ; i++){
  // Process hit
}

In Lucene 3 the API for IndexSearcher has changed significantly and although I can bash together something that works, it is only by getting the top X documents and making sure that X is sufficiently large.

While the number of hits (in my case) is typically between zero and ten, there are anomalous situation where they could number much higher. Having a fixed limit therefor feels wrong. Furthermore, setting the limit really high causes OOME which means that space for all X possible hits is allocated immediately. As this operation is carried out alot, something reasonably efficient is desired.

Edit:

Currently I've got the following to work:

TopDocs hits = indexSearcher.search(query, MAX_HITS);
for (int i=0 ; i<hits.totalHits ; i++) {
   // Process hit
}

This works fine except that

a) what if there are more hits then MAX_HITS ?

and

b) if MAX_HITS is large then I'm wasting memory as room for each hit is allocated before the search is performed.

As most of the time there will only be a few hits, I don't mind doing follow up searches to get the subsequent hits, but I cant seem to find a way to do that.

IndexSearcher has a method docFreq(Term). Invoking it does not seem to have a performance penalty and its output is then a suitable input parameter for the number of documents to get.

E.g.

int freq = searcher.docFreq(new Term(FIELD, value));
TopDocs hits = indexSearcher.search(query, freq);
for (int i=0 ; i<hits.totalHits ; i++) {
   // Process hit
}

This works because my query is essentially a TermQuery. If it was a more complex query then this wouldn't be suitable.

@Kris - I ran into this issue as well, this worked for me. Try this:

TopDocs tp = ms.search(query, 1); 

TopDocs hits = indexSearcher.search(query, tp.totalHits);
for (int i=0 ; i<hits.totalHits ; i++) {
   // Process hit
}

According to Uwe in the link below tp.totalHits ".. will still count all hits, but return only 1. "

See full details in link from java-user lucene apache mail archives - http://www.gossamer-threads.com/lists/lucene/java-user/95032

Why don't you use Searcher.search(Query query, int n) ? You can specify the number of results you want back, and you can use the TopDocs object that is returned to iterate through the results.

Using Hits to process long result sets was a bad idea, because in the background the hits object would run more searches to fill in results that it didn't already have.

TopDocs only contains ids and scores, so you shouldn't have a memory problem even for large n.

How about using NumDocs from the index reader as the maximum number of results.

Do watch out for the edge case of zero documents in the index though...

Hope this helps,

继续阅读：lucene

Lucene 3 iterating over all hits

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

抽烟只抽炫赫门？

Infinite gtk warnings when I right click on the icon