What is the formatting of Solr CEL/Tika output? And how to fix it?

2023-03-21 23:48 问答作者：

I am using Solr to index DOC, DOCX and PDF files. I had enabled stored for the text and I checked it out. Here's the result from a sample DOC file:

, a mobile user interface (UI) software development company, based in Cambridge, UK. After integrating the company, Qualcomm re-branded their interface markup language and its accompanying integrated development environment (IDE) as HYPERLINK "http://en.wikipedia.org/w/index.php?title=UiOne&action=edit&redlink=1" *\o "UiOne (page does not exist)" uiOne** . In March 2009, Qualcomm informed their Cambridge engineering staff, mostly from the division working on HYPERLINK "http://en.wikipedia.org

The Doc contains material from Wikipdia. I captured a full output on http://pastebin.com/8FL9eHJv

So Solr CEl/Tika inserts its own formatting, and the results of the formatting show up in the search output. How can I fix the problem so that the search results (text snippets) will not contain the formatting?

Googling around tells me that TIKA has several output formats, so is that the approach? Or is there a plugin that can filter the text before rendering the results?

Relevant details: My conf开发者_开发技巧iguration is close to stock: My upload command is a python variation of

curl "http://localhost:8983/solr/update/extract?literal.id=doc-qualcomm&commit=true" -F "myfile=@11qualcomm.doc"

My schema.xml http://pastebin.com/VLz2uuDQ

My SolrConfig.xml http://pastebin.com/X2J2jj64

Are you asking about the extra hyperlink items in the search results. If yes, try updating the extract request handle in your solrconfig.xml to

<str name="captureAttr">false</str><str name="fmap.a">ignored_</str>

继续阅读：apache-tika lucene solr

What is the formatting of Solr CEL/Tika output? And how to fix it?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？