Nutch issues with crwaling website where the url differes only in termes of parameters passes
I am using Nutch to crawl webistes and strangely for one of my webistes, the Nutch crawl returns only two urls, the home page url (http://mysite.com/)开发者_开发技巧 and one other.
The urls on my webiste are basically of this format
http://mysite.com/index.php?main_page=index¶ms=12
http://mysite.com/index.php?main_page=index&category=tub¶m=17
i.e. the urls differ only in terms of parameters appened to the url (the part "http://mysite.com/index.php?" is common to all urls)
Is Nutch unable to crawl such webistes?
What Nutch settings should I do in order to crawl such websites?
I got the issue fixed. It had everything to do with the url filter set as
skip URLs containing certain characters as probable queries, etc
-[?*!@=]
I commented this filter and Nutch crawle dall urls :)
精彩评论