How to omit JavaScript and comments using nutch crawl?
I am a newbie at this, trying to use Nutch 1.2 to fetch a site. I'm using only a Linux console to work with Nut开发者_如何学Goch as I don't need anything else. My command looks like this
Where the folder urls is were I have my links and I do get the results to the folder crawled.
And when I would like to see the results I type:
bin/nutch crawl urls -dir crawled -depth 3
This works very fine, but I get a lot of broken links.
Now, I do not want Nutch to follow JavaScript links, only regular links, could anyone give me a hint/help on how to do that?
I've tried to edit the conf/crawl-urlfilter.txt with no results. I might have typed wrong commands!bin/nutch readseg -dump crawled/segments/20110401113805 /home/nutch/dumpfiles
Any help appreciated!
beware there are two different filter files, one for the one stop crawl command and the other for the step-by-step commands. For the rest just build a regex that will match the urls you want to skip, add minus before and you shoulb de done.
精彩评论