Lucene 3 iterating over all hits
I'm in the process of updating a tool that uses a Lucene index. As part of this update we are moving from Lucene 2.0.0 to 3.0.2. For the most part this开发者_运维知识库 has been entirely straightforward. However, in one instance I cant seem to find a straightforward conversion.
Basically I have a simple query and I need to iterate over all hits. In Lucene 2 this was simple, e.g.:
Hits hits = indexSearcher.search(query);
for(int i=0 ; i<hits.length() ; i++){
// Process hit
}
In Lucene 3 the API for IndexSearcher
has changed significantly and although I can bash together something that works, it is only by getting the top X
documents and making sure that X
is sufficiently large.
While the number of hits (in my case) is typically between zero and ten, there are anomalous situation where they could number much higher. Having a fixed limit therefor feels wrong. Furthermore, setting the limit really high causes OOME which means that space for all X
possible hits is allocated immediately. As this operation is carried out alot, something reasonably efficient is desired.
Edit:
Currently I've got the following to work:
TopDocs hits = indexSearcher.search(query, MAX_HITS);
for (int i=0 ; i<hits.totalHits ; i++) {
// Process hit
}
This works fine except that
a) what if there are more hits then MAX_HITS
?
and
b) if MAX_HITS is large then I'm wasting memory as room for each hit is allocated before the search is performed.
As most of the time there will only be a few hits, I don't mind doing follow up searches to get the subsequent hits, but I cant seem to find a way to do that.
IndexSearcher has a method docFreq(Term)
. Invoking it does not seem to have a performance penalty and its output is then a suitable input parameter for the number of documents to get.
E.g.
int freq = searcher.docFreq(new Term(FIELD, value));
TopDocs hits = indexSearcher.search(query, freq);
for (int i=0 ; i<hits.totalHits ; i++) {
// Process hit
}
This works because my query is essentially a TermQuery
. If it was a more complex query then this wouldn't be suitable.
@Kris - I ran into this issue as well, this worked for me. Try this:
TopDocs tp = ms.search(query, 1);
TopDocs hits = indexSearcher.search(query, tp.totalHits);
for (int i=0 ; i<hits.totalHits ; i++) {
// Process hit
}
According to Uwe in the link below tp.totalHits ".. will still count all hits, but return only 1. "
See full details in link from java-user lucene apache mail archives - http://www.gossamer-threads.com/lists/lucene/java-user/95032
Why don't you use Searcher.search(Query query, int n) ? You can specify the number of results you want back, and you can use the TopDocs object that is returned to iterate through the results.
Using Hits to process long result sets was a bad idea, because in the background the hits object would run more searches to fill in results that it didn't already have.
TopDocs only contains ids and scores, so you shouldn't have a memory problem even for large n.
How about using NumDocs from the index reader as the maximum number of results.
Do watch out for the edge case of zero documents in the index though...
Hope this helps,
精彩评论