Simple Nutch 1.3/Solr index explanation

2023-04-05 03:48 问答作者：

After much searching, it doesn't seem like there's any straightforward explanation of how to use Nutch 1.3 with Solr.

I have a Solr index with other content in it that I'll be using on a website for search.

I'd like to add Nutch results to the index, which will add external sites to the website's search.

All of this is working just fine.

The question is, how do you freshen the index? Do you have to delete all of the Nutch results from Solr first? Or does Nutch take care of that? 开发者_如何学编程Does Nutch remove results that are no longer valid from the Solr index?

Shell scripts with no documentation or explanation of what they are doing haven't been helpful with answering these questions.

The nutch schema defines id (= url) as teh unique key. If you re-crawl the url teh document will be replaced in solr index when nutch posts the data to solr.

Well you need to implement incremental crawling in Nutch... which is dependent on your application. Some people want to recrawl every day, others every 3 month. The max is 90 days in any case.

The general idea is to delete crawl segments that are older than your max time for recrawl, since they will be redundant at that time. And produce a fresh solrindex for use in Solr.

I'm afraid that you have to do that yourself in scripting. One day I may put on the wiki some scripts I did for that, but they are not ready for publish as it stands.

Try Lucidworks' enterprise Solr for testing/prototyping, which has a webcrawler builtin.

http://www.lucidimagination.com/products/lucidworks-search-platform/enterprise

It'll give you a feel for the whole Lucene stack. It has a MUCH better interface than any other Java software I've ever used. It's a joy to use.

继续阅读：nutch solr

Simple Nutch 1.3/Solr index explanation

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？