开发者

What is the formatting of Solr CEL/Tika output? And how to fix it?

I am using Solr to index DOC, DOCX and PDF files. I had enabled stored for the text and I checked it out. Here's the result from a sample DOC file:

, a mobile user interface (UI) software development company, based in Cambridge, UK. After integrating the company, Qualcomm re-branded their interface markup language and its accompanying integrated development environment (IDE) as HYPERLINK "http://en.wikipedia.org/w/index.php?title=UiOne&action=edit&redlink=1" *\o "UiOne (page does not exist)" uiOne** . In March 2009, Qualcomm informed their Cambridge engineering staff, mostly from the division working on HYPERLINK "http://en.wikipedia.org

The Doc contains material from Wikipdia. I captured a full output on http://pastebin.com/8FL9eHJv

So Solr CEl/Tika inserts its own formatting, and the results of the formatting show up in the search output. How can I fix the problem so that the search results (text snippets) will not contain the formatting?

Googling around tells me that TIKA has several output formats, so is that the approach? Or is there a plugin that can filter the text before rendering the results?

Relevant details: My conf开发者_开发技巧iguration is close to stock: My upload command is a python variation of

curl "http://localhost:8983/solr/update/extract?literal.id=doc-qualcomm&commit=true" -F "myfile=@11qualcomm.doc"

My schema.xml http://pastebin.com/VLz2uuDQ

My SolrConfig.xml http://pastebin.com/X2J2jj64


Are you asking about the extra hyperlink items in the search results. If yes, try updating the extract request handle in your solrconfig.xml to

<str name="captureAttr">false</str><str name="fmap.a">ignored_</str>
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜