Best way to create a Lucene Index with fields that can be updated frequently, and filtering the results by this field
I use Lucene to index my documents and search. Actually I have 800k documents indexed in Lucene. Those documents开发者_开发百科 have some fields:
Id: is a Numeric field to index the documents
Name: is a textual field to be stored and analyzed
Description: like name
Availability: is a numeric field to filter results. This field can be updated frequently, every day.
My question is: What's the better way to create a filter for availability?
1 - add this information to index and make a lucene filter. With this approach I have to update document (remove and add, because lucene 3.0.2 not have update support) every time the "availability" changes. What the cost of reindex?
2 - don't add this information to index, and filter the results with a DB select. This approach will do a lot of selects, because I need select every id from database to check availability.
3 - Create a separated index with id and availability. I don't know if it is a good solution, but I can create a index with static information and other with information can be frequently updated. I think it is better then update all document, just because some fields were updated.
I would stay away from 2, if you can deal only with the search in lucene, instead of search in lucene+db, do it. I deal in my project with this case (Lucene search + DB search), but I do it cause there is no way out of it.
The cost of an update is internally:
delete the doc
insert new doc (with new field).
I would just try approach number 1 (as is the simplest), if the performance is good enough, then just stick with it, if not then you might look ways to optimize it or try 3.
Answer provided from lucene-groupmail:
How often is "frequently"? How many updates do you expect to do in a day? And how quickly must those updates be reflected in the search results?
800K documents isn't all that many. I'd go with the simple approach first and monitor the results, #then# go to a more complex solution if you see a problem arising. Just update (delete/add) the documents when the value changes.
Well, the cost to reindex is just about what the cost to index it orignally is. The old version of the document is marked deleted and the new one is added. It's essentially the same cost as to index a new document. This leaves some gaps in your index, that is the deleted docs are still in there, but the next optimize will compact them.
From which you may infer that optimizing is the expensive part. I'd do that, say once daily (or even weekly).
HTH Erick
精彩评论