Give comparision of Nutch Vs Heritrix
I want to select one of the above for building a crawling framework for specific web sites. This is not an internet-wide crawl. I am not building a search index, and rather interested in scraping specific pages from the web site.
Could somebody 开发者_开发技巧please detail about the pros and cons of above? Thanks Nayn
Your main task is scrape specific pages from the web site.
Nutch: Open-source web-search software, built on Lucene Java
Heritrix: is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project
So I think Heritrix is much better than Nutch for your project.
Learning a framework/library is a valuable exercise. But it takes some time. Since you task is not very complex one, sometimes it would be less painful to write a simple crawler from the scratch in Java
精彩评论