Nutch 1.2 - Why won't nutch crawl url with query strings?

2023-03-28 07:19 问答作者：

I'm new to Nutch and not really sure what is going on here. I run nutch and it crawl my website, but it seems to ignore URLs that contain query strings. I've commented out the filter in the crawl-urlfilter.t开发者_JS百科xt page so it look like this now:

# skip urls with these characters
#-[]

#skip urls with slash delimited segment that repeats 3+ times
#-.*(/[^/]+)/[^/]+\1/[^/]+\1/

So, i think i've effectively removed any filter so I'm telling nutch to accept all urls it finds on my website.

Does anyone have any suggestions? Or is this a bug in nutch 1.2? Should i upgrade to 1.3 and will this fix this issue i am having? OR am i doing something wrong?

See my previous question here Adding URL parameter to Nutch/Solr index and search results

The first 'Edit' should answer your question.

# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]

You have to comment it or modify it as :

# skip URLs containing certain characters as probable queries, etc.
-[*!@]

By default, crawlers shouldn't crawl links with query strings to avoid spams and fake search engines.

继续阅读：nutch

Nutch 1.2 - Why won't nutch crawl url with query strings?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？