What is the formatting of Solr CEL/Tika output? And how to fix it?
I am using Solr to index DOC, DOCX and PDF files. I had enabled stored for the text and I checked it out. Here's the result from a sample DOC file:
, a mobile user interface (UI) software development company, based in Cambridge, UK. After integrating the company, Qualcomm re-branded their interface markup language and its accompanying integrated development environment (IDE) as HYPERLINK "http://en.wikipedia.org/w/index.php?title=UiOne&action=edit&redlink=1" *\o "UiOne (page does not exist)" uiOne** . In March 2009, Qualcomm informed their Cambridge engineering staff, mostly from the division working on HYPERLINK "http://en.wikipedia.org
The Doc contains material from Wikipdia. I captured a full output on http://pastebin.com/8FL9eHJv
So Solr CEl/Tika inserts its own formatting, and the results of the formatting show up in the search output. How can I fix the problem so that the search results (text snippets) will not contain the formatting?
Googling around tells me that TIKA has several output formats, so is that the approach? Or is there a plugin that can filter the text before rendering the results?
Relevant details: My conf开发者_开发技巧iguration is close to stock: My upload command is a python variation of
curl "http://localhost:8983/solr/update/extract?literal.id=doc-qualcomm&commit=true" -F "myfile=@11qualcomm.doc"
My schema.xml http://pastebin.com/VLz2uuDQ
My SolrConfig.xml http://pastebin.com/X2J2jj64
Are you asking about the extra hyperlink items in the search results. If yes, try updating the extract request handle in your solrconfig.xml to
<str name="captureAttr">false</str><str name="fmap.a">ignored_</str>
精彩评论