开发者

How to index forum discussions for search?

For a discussion forum, does it work better to index each entry inside a discussion thread as a separate lucene document or simple concat all 开发者_如何学JAVAentries within a discussion into one big block of text and index a whole discussion thread as a single lucene document?


Depends on what kind of search capabilities you are looking for. For eg, if you want the users to be able to search for keywords that occurred in threads on some particular date, then you must index all entries as separate documents with a date (as a NumericField searchable using a NumericRangeFilter).

Indexing every entry as a separate document will also enable you to score each entry using the Lucene scorers which will help in retrieving the most relevant entries (and not threads) as a response to a query. Additionally you can also add the thread topic as a separate field to each entry-document (at the cost of little more space).

Concatenating all entries is not a good idea if you want to point the user to the exact entry of interest. As to your concern(comment on Ryan's answer) on returning multiple entries from the same thread, you can add a thread id to each entry while indexing. Then at the time of displaying results you can display only the entry for each thread id (the entry with the highest score could be displayed along with the thread topic)


If you concatenate all entries within a discussion you run into the error where you cannot pin point the exact entry you want to retrieve.

Lucene should be able to quickly index and search each entry (post/thread/whatever). Mashing them all together just seems overkill.


If you decide to index them separately, you can use Solr, which is about to support search result collapsing:

http://www.lucidimagination.com/blog/2010/09/16/2446/


I will prefer to index each entry separately. It will make the design more flexible as your system should have some kind of topic entity to group the entries in the same thread. And another issue to index with concatenation is it would need to re-index once new entry is posted which has performance impact.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜