Crawl website using wget and limit total number of crawled links
I want to learn more about crawlers by playing around with the wget tool. I'm interested in crawling my department's website, and finding the first 100 links on that site. So far, t开发者_StackOverflow社区he command below is what I have. How do I limit the crawler to stop after 100 links?
wget -r -o output.txt -l 0 -t 1 --spider -w 5 -A html -e robots=on "http://www.example.com"
You can't. wget doesn't support this so if you want something like this, you would have to write a tool yourself.
You could fetch the main file, parse the links manually, and fetch them one by one with a limit of 100 items. But it's not something that wget supports.
You could take a look at HTTrack for website crawling too, it has quite a few extra options for this: http://www.httrack.com/
- Create a fifo file (mknod /tmp/httpipe p)
- do a fork
- in the child do
wget --spider -r -l 1 http://myurl --output-file /tmp/httppipe
- in the father: read line by line
/tmp/httpipe
- parse the output
=~ m{^\-\-\d\d:\d\d:\d\d\-\- http://$self->{http_server}:$self->{tcport}/(.*)$}, print $1
- count the lines; after 100 lines just close the file, it will break the pipe
- in the child do
精彩评论