开发者

solr schema for article->paragraph structure

I want to index some articles and show the paragraph number in the search result. So I guess the solr schema should looks like this:

article_id, paragraph_number, paragraph_content

Therefore, I need to parse article first, extract paragraphs and index it one by one.

I'm worried about the performance since one article can contain 100 paragr开发者_JAVA技巧aphs.

Any suggestion?


It is better to do the heavy lifting at index time rather than search time. So parsing the paragraphs out of the document when you index is probably the right way to go.

How many articles do you have? It really shouldn't be a problem to strip paragraphs (we do much more complex pre-processing that that).


If you only need to match individual paragraphs against the fulltext query (as opposed to filters etc.), you could also do this using highlighting -- split up the paragraphs, prefix each one with its paragraph number, and then index the paragraphs as multiple values in a single field in a single document. At search time, you'd do a highlight on the field with a full match (e.g. fragment size of -1) and no decoration of the highlight; so what you'd get back is the paragraph that matched the fulltext query, prefixed by its paragraph number (which you'd probably want to then pull back out).

Not sure if this fits your use case exactly but might be an interesting approach to try -- I do something similar to identify photos whose caption matches the fulltext query to display next to article search results.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜