Using Solr CELL's ExtractingRequestHandler to index/extract files from package formats

2023-01-21 17:41 问答作者：

Can you use ExtractingRequestHandler and Tika with any of the compressed file formats (zip, tar, gz, etc) to extract the content out for indexing?

I am sending solr the archived.tar file using curl. curl " http://localhost:8983/solr/update/extract?literal.id=doc1&fmap.content=body_texts&commit=true" -H 'Content-type:application/octet-stream' --data-binary "@/home/archived.tar" The result I get when I query the document is that the file names inside the archive are indexed as the "body_texts", but the content of those files is not extracted or included. This is not the behavior I expected. Ref: http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika#article.tika.example. When I send 1 of the actual documents inside the archive using the same curl command the extracted content is then stored in the "body_texts" field. Am I missing a step for the compressed files?

I have added all the extraction dependencies as indicated by mat in http://outoftime.lighthouseapp.com/projects/20339/tickets/98-solr-cell and am able to successfully extract data from MS Word, PDF, HTML documents.

I'm using the following library versions. Solr 1.40, Solr Cell 1.4.1, with Tika Core 0.4

Given everything I have read this version of Tika should support extracting data from all files with开发者_StackOverflowin a compressed file. Any help or suggestions would be appreciated.

The short answer: Solr Cell 1.4.1 and Tika Core 0.6.

The long answer: After a lot of headaches I was able to get this working. I'll answer it for both people using solr directly and for people using solr with the Ruby library sunspot (which was my problem).

Here was what I did: I used this https://github.com/tomasc/sunspot_cell plugin to extend sunspot and give it the attachment feature. (Ignore this step if you're not using ruby/sunspot)

v1.4.1 works for individual files but not with compressed files, so I had to explore a bit. I downloaded the v1.4.1 codebase from http://lucene.apache.org/solr/ and grabbed the dist/apache-solr-cell-1.4.1.jar then I had to pull down the Tika libraries from the 1.5 branch http://svn.apache.org/viewvc/lucene/solr/branches/branch-1.5-dev/contrib/extraction/lib/.

You can download each individually, or you can use svn to checkout the branch by

svn co http://svn.apache.org/repos/asf/lucene/solr/branches/branch-1.5-dev

Or just checkout the library folder:

svn co http://svn.apache.org/repos/asf/lucene/solr/branches/branch-1.5-dev/contrib/extraction/lib/

继续阅读：apache-tika full-text-search solr solr-cell

Using Solr CELL's ExtractingRequestHandler to index/extract files from package formats

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？