solr + Heritrix

2022-12-09 22:15 问答作者：

How is it possible to integrate solr with heritrix?

I want to archiv开发者_如何学运维e a site using heritrix and then index and search locally this file using solr.

Thanks

The problem with using Solr to index is that it is a straight text index (which may be fine if you are only crawling an internal website and don´t care about 'pagerank').

Using Nutch will give you a much better index however as it does use pagerank.

NutchWAX

If however you are deadset on using Heritrix and would like pagerank based search results you could use NutchWAX (Nutch Web Archive eXtensions) to index Heritrix's output (that's what the makers of Heritrix are doing).

NutchWAX is intended for web archives but can also be used to create a search engine of the live web (in fact that is easier as you aren't dragging years worth of data along during each rebuild of the index).

Solr

If you do want to use Heritrix+Solr to create a search website, you should probably replace the "ARCWriter" processor in Heritrix with a custom processor that submits the contents of the page to Solr.

The Solr end is just an XML file posted via HTTP and is dead simple.

The Heritrix end is little bit more complicated, but the Developer's Manual will get you started on writing a Processor for Heritrix 1.x (if you are using the --as yet-- unstable 3.x -- or discontinued 2.x -- you'll need to do a little more legwork as the documentation isn't there yet.).

There is a section in the Solr 1.4 Enterprise Search book about using Heritrix and Solr together. Basically use Heritrix to crawl, and then in a seperate process parse the archive files and add them Solr. While you loose out on things like page rank scores that Nutch provides, it does simplify things because your crawler and your search engine are separate tools.

This is basically the approach that Mauricio uses, storing data into MySQL as an intermediate step. We published all the source for the book on an Amazon EC2 AMI, look for "solrbook". Also, the support site at Packt (http://www.packtpub.com/solr-1-4-enterprise-search-server) will let you download the sample.

For the same purpose I used youseer.

First download YouSeer.jar and then,

java -jar YouSeer.jar http://localhost:8983/solr/update /cygdrive/d/arcs /cached 3 0

It internally uses the ArcReader to read documents and then upload them to Solr. The YouSeer code is fairly simple and I had to modify a bit for my purposes..

According to this message, yes:

It is pretty easy to add custom writers to Heritrix. We write our crawls to MySQL and then ingest into Solr from there. It would not be hard to write a Heritrix writer that writes directly to Solr however.

-- Sean Timm

Or you might want to use Nutch instead, there is more work done towards integrating it with Solr:

http://wiki.apache.org/nutch/RunningNutchAndSolr
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/

继续阅读：indexing search search-engine solr web-crawler

solr + Heritrix

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？