开发者

Nutch 1.2 - Why won't nutch crawl url with query strings?

I'm new to Nutch and not really sure what is going on here. I run nutch and it crawl my website, but it seems to ignore URLs that contain query strings. I've commented out the filter in the crawl-urlfilter.t开发者_JS百科xt page so it look like this now:

# skip urls with these characters
#-[]

#skip urls with slash delimited segment that repeats 3+ times
#-.*(/[^/]+)/[^/]+\1/[^/]+\1/

So, i think i've effectively removed any filter so I'm telling nutch to accept all urls it finds on my website.

Does anyone have any suggestions? Or is this a bug in nutch 1.2? Should i upgrade to 1.3 and will this fix this issue i am having? OR am i doing something wrong?


See my previous question here Adding URL parameter to Nutch/Solr index and search results

The first 'Edit' should answer your question.


# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]

You have to comment it or modify it as :

# skip URLs containing certain characters as probable queries, etc.
-[*!@]


By default, crawlers shouldn't crawl links with query strings to avoid spams and fake search engines.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜