开发者

Reindexing using lucene/ deleting term from index

I hope you can help me, here is my problem:

edit: Now that I re-thought, if there is a way to delete a term from the index, it would work anyway. Is there a way to do that? if there is, there is no need to read the rest of the question. thanks!

Here is what I intend to do: 1 - I have to index some files while removing the standard stopwords. 2 - Afterwards, I must count the document frequency of every term, and remove those terms that have df < 2

How I'm doing it:

1 - I index the files using indexwriter, while removing the std stopwords. 2 - I count the df of every term, and add to the stopwords list. 3 - And then, I index again the texts using indexwriter, but with the new stopwords list

What's really happening:

The first time I index it goes as planned. The problem is when I try to index for a second time. The result becomes pretty unpredictable:

1) if i run the program one time, even though the stopwords has new words,only the std stopwords are removed.

2) if i run the program a second time, then the terms with df < 2 are removed.

I print the terms in the index twice, one after indexing for the first开发者_开发技巧 time, and one after indexing for the second time.

When i run for a second time, the terms with df < 2 appear removed in the first print(notice that I add the terms with df < 2 when indexing for the second time, It shouldn't) appear removed in the first print.

Maybe the way I explained was a bit confusing, I ask you to tell me if something couldn't be understood.

I hope you guys can help me. Thank you very much!


When indexing documents for the second time, make sure to delete the first instance of the document, otherwise you will inflate the dfs for all terms. You can delete documents by the external id field: create a Term with field=idfield & value=externalId, and then use deleteDocument(Term) of IndexWriter to remove the old instance; then add the new one. I don't think there is a way to delete terms explicitly; they are derived from the documents.

As an optimization, you might consider the following: 1. Index all documents 2. Find all terms with df = 1 3. Remove all documents with each such term, keeping track of their external document ids 4. Add the terms to your stop list 5. Re-index only the previously-removed documents.

Of course first you have to think carefully about the use case for removing these terms. 1. Why does it matter if they occur in the index? 2. What happens if you update the index later, and add a new document that causes some term that previously had df=1 now to have df=2. You wouldn't be able to index on that term since it would have been already in the stop list.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜