Simple Nutch 1.3/Solr index explanation
After much searching, it doesn't seem like there's any straightforward explanation of how to use Nutch 1.3 with Solr.
I have a Solr index with other content in it that I'll be using on a website for search.
I'd like to add Nutch results to the index, which will add external sites to the website's search.
All of this is working just fine.
The question is, how do you freshen the index? Do you have to delete all of the Nutch results from Solr first? Or does Nutch take care of that? 开发者_如何学编程Does Nutch remove results that are no longer valid from the Solr index?
Shell scripts with no documentation or explanation of what they are doing haven't been helpful with answering these questions.
The nutch schema defines id (= url) as teh unique key. If you re-crawl the url teh document will be replaced in solr index when nutch posts the data to solr.
Well you need to implement incremental crawling in Nutch... which is dependent on your application. Some people want to recrawl every day, others every 3 month. The max is 90 days in any case.
The general idea is to delete crawl segments that are older than your max time for recrawl, since they will be redundant at that time. And produce a fresh solrindex
for use in Solr.
I'm afraid that you have to do that yourself in scripting. One day I may put on the wiki some scripts I did for that, but they are not ready for publish as it stands.
Try Lucidworks' enterprise Solr for testing/prototyping, which has a webcrawler builtin.
http://www.lucidimagination.com/products/lucidworks-search-platform/enterprise
It'll give you a feel for the whole Lucene stack. It has a MUCH better interface than any other Java software I've ever used. It's a joy to use.
精彩评论