Search engine that combines indexed text with user generated tags
I need a customizable search engine that combines no开发者_StackOverflow中文版rmal indexing of unstructured HTML documents with user generated tag, for each document of a web application. I have already an algorithm that assign a score to each tag, i'd like to integrate the weight of document related tag with the indexing system of search engine.
The most mature open source framework for handling your problem is definitely Lucene. Whether you want to use Lucene in its native form or use abstraction layer like Solr like @steen has mentioned is up to you. But the basic idea is simple.
1- Prepare your source document for Indexing. you could use Tika or you could use any native xml Parser, you should be fine. (When I meant prepare you need to segregate your document in to individual fields).
2- As far as I understand, you do not seem to need any special analyzer, you could just use standard analyzer (that comes packaged with lucene). Just make sure that you use "Analyzer_With_Norms" Option while indexing.
3- The reason why you need norms option as mentioned in previous point is, you could now specify your weight for each of the fields while indexing.
For someone not familiar with Lucene, all these would look very confusing.I suggest Lucene In Action book for greater understanding of Lucene.
I would definitely go with Solr. You will have to customize a bit to get HTML indexed:
- First off, you will need to think about which elements of the html page should go into specific Solr fields. You indicate the the subject html is 'unstructured', but if the pages share any common traits at all, you would benefit from storing these in separate fields in your index.
- You should take a look at the tika HtmlParser which works very well together with solr.
On the issue you have with making user generated tags provide extra semantic value for the indexed pages, I would suggest reading the Solr Relevancy FAQ for information on how to do index-time boosting of fields
精彩评论