开发者

Give comparision of Nutch Vs Heritrix

I want to select one of the above for building a crawling framework for specific web sites. This is not an internet-wide crawl. I am not building a search index, and rather interested in scraping specific pages from the web site.

Could somebody 开发者_开发技巧please detail about the pros and cons of above? Thanks Nayn


Your main task is scrape specific pages from the web site.

Nutch: Open-source web-search software, built on Lucene Java

Heritrix: is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project

So I think Heritrix is much better than Nutch for your project.

Learning a framework/library is a valuable exercise. But it takes some time. Since you task is not very complex one, sometimes it would be less painful to write a simple crawler from the scratch in Java

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜