Effective search on a small text
I have many small texts (lets say about 500 words) and two databases with roughly 10.000 entries each (keywords).
I now want to process every text and find out which keywords (the ones saved in the 2 databases) are contained in the text.
Does anyone of you have a good approach on how to do this effectivel开发者_如何转开发y?
I wanted to process every text and index it (with lucene perhaps) before searching the database against it, but I don't really know if lucene is the right tool for this.
Lucene is exactly the right tool for this task.
One way to achieve your goal would be to use a RAMDirectory to index each text and then get the TermEnum from the index using the IndexReader. You can now match the terms against the keywords in your DB.
Another approach would be to index every text as lucene document and then iterate over your keywords and get the termDocs for the current term => all texts that contain the current term/keyword.
Your text needs to be indexed in some manner in order to search against it. You have two options:
1) Load your texts into a MySQL db and make the field/column full text searchable
2) As you say, index with Lucene.
Then read your keywords into a list, loop over them, and query against Lucene/MySQL.
Give that your data sets are not large, I would go with MySQL - it'll be much faster to set up.
精彩评论