开发者

Nutch issues with crwaling website where the url differes only in termes of parameters passes

I am using Nutch to crawl webistes and strangely for one of my webistes, the Nutch crawl returns only two urls, the home page url (http://mysite.com/)开发者_开发技巧 and one other.

The urls on my webiste are basically of this format

http://mysite.com/index.php?main_page=index&params=12

http://mysite.com/index.php?main_page=index&category=tub&param=17

i.e. the urls differ only in terms of parameters appened to the url (the part "http://mysite.com/index.php?" is common to all urls)

Is Nutch unable to crawl such webistes?

What Nutch settings should I do in order to crawl such websites?


I got the issue fixed. It had everything to do with the url filter set as

skip URLs containing certain characters as probable queries, etc

-[?*!@=]

I commented this filter and Nutch crawle dall urls :)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜