开发者

Lucene: Iterate all entries

I have a Lucene Index which I would like to iterate (for one time evaluation at the current stage in development) I have 4 documents with each a few hundred thousand up to million entries, which I want to i开发者_开发知识库terate to count the number of words for each entry (~2-10) and calculate the frequency distribution.

What I am doing at the moment is this:

   for (int i = 0; i < reader.maxDoc(); i++) {
                    if (reader.isDeleted(i))
                        continue;

                    Document doc = reader.document(i);
                Field text = doc.getField("myDocName#1");

                String content = text.stringValue();


                int wordLen = countNumberOfWords(content);
//store
}

So far, it is iterating something. The debug confirms that its at least operating on the terms stored in the document, but for some reason it only process a small part of the stored terms. I wonder what I am doing wrong? I simply want to iterate over all documents and everything that is stored in them?


Firstly you need to ensure you index with TermVectors enabled

doc.add(new Field(TITLE, page.getTitle(), Field.Store.YES, Field.Index.ANALYZED, TermVector.WITH_POSITIONS_OFFSETS));

Then you can use IndexReader.getTermFreqVector to count terms

TopDocs res = indexSearcher.search(YOUR_QUERY, null, 1000);

// iterate over documents in res, ommited for brevity

reader.getTermFreqVector(res.scoreDocs[i].doc, YOUR_FIELD, new TermVectorMapper() {
            public void map(String termval, int freq, TermVectorOffsetInfo[] offsets, int[] positions) {
                // increment frequency count of termval by freq
                freqs.increment(termval, freq);
            }

            public void setExpectations(String arg0, int arg1,boolean arg2, boolean arg3) {}
});
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜