Web crawling and robots.txt - II
Similar scenario as one of my previous question:
Using
wget
, i type the following to pull down images fr开发者_如何学运维om a site (sub-folder):wget -r -A.jpg http://www.abc.com/images/
I get two images from the above command - Img1, Img2.
The index.php file in
http://www.abc.com/images/
refers to onlyImg2.jpg
(saw the source).If i key in
http://www.abc.com/images/Img4.jpg
orhttp://www.abc.com/images/Img5.jpg
, i get two separate images.But these images are not downloaded by wget.
How should I go about retrieving the entire set of images under
http://www.abc.com/images/
?
Not exactly sure what you want but try this:
wget --recursive --accept=gif,jpg,png http://www.abc.com
This will:
- Create a directory called
www.abc.com\
- Crawl all pages on
www.abc.com
- Save all .GIF, .JPG or .PNG files inside the corresponding directories under
www.abc.com\
You can then delete all directories except the one you're interested in, namely, www.abc.com\images\
Crawling all pages is a time consuming operation but probably the only way to make sure you that you get all images that are referenced by any of the pages on www.abc.com. There is no other way to detect what images are present inside http://abc.com/images/ unless the server allows directory browsing.
精彩评论