How can I instruct Nutch to treat page#1 as belonging to a core and page#2 as belonging to a different core (both pages from the same domain)?
What possibly can i do with Hadoop and Nutch used as a search engine ? I know that nutch is used to build a web crawler . But i\'m not finding the perfect picture . Can i use mapreduce with nutch and
I read the source of org.apache.nutch.parse.ParseUtil.runParser(Parser p, Content content). Do these two method calls do the same thing:
After much searching, it doesn\'t seem like there\'s any straightforward explanation of how to use Nutch 1.3 with Solr.
The scene: I have indexed many websites using Nutch and Solr.I\'ve implemented result grouping by site.My results output includes the page title, highlight snippets and URL. My issue is with the page
I am a newbie to Nutch and Hadoop and trying to follow the tutorial here at http://wiki.apache.org/nutch/NutchHadoopTutorial.
Hi I am trying to run Apache Nutch 1.2 on Amazon\'s EMR. To do this I specifiy an input directory from S3.I get the following error:
I would like to know how to make nutch crawl not only the domain that I specified, but also the dir path within 开发者_StackOverflowthe domain that I specified.I know that you can configure this infor
I have a lot of HTML files on my hard disk and want to index them with Nutch, but as I know nutch only get URLs and index them and pages that linked in that URLs.开发者_StackOverflow
I\'m new to Nutch and not really sure what is going on here.I run nutch and it crawl my website, but it seems to ignore URLs that contain query strings.I\'ve commented out the filter in the crawl-urlf