开发者

How to omit JavaScript and comments using nutch crawl?

I am a newbie at this, trying to use Nutch 1.2 to fetch a site. I'm using only a Linux console to work with Nut开发者_如何学Goch as I don't need anything else. My command looks like this

bin/nutch crawl urls -dir crawled -depth 3
Where the folder urls is were I have my links and I do get the results to the folder crawled. And when I would like to see the results I type:
bin/nutch readseg -dump crawled/segments/20110401113805 /home/nutch/dumpfiles
This works very fine, but I get a lot of broken links. Now, I do not want Nutch to follow JavaScript links, only regular links, could anyone give me a hint/help on how to do that? I've tried to edit the conf/crawl-urlfilter.txt with no results. I might have typed wrong commands!

Any help appreciated!


beware there are two different filter files, one for the one stop crawl command and the other for the step-by-step commands. For the rest just build a regex that will match the urls you want to skip, add minus before and you shoulb de done.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜