开发者

Whats the best way to crawl a batch of urls for a specific html element and retrieve the image?

I'm looking to crawl ~100 webpages that are of the same structure, but the image I require is of a different name in each instance.

The image tag is located at:

#content div.artwork img.ar开发者_如何转开发twork

and I need the src url of that result to be downloaded.

Any ideas? I have the urls in a .txt file, and am on a mac os x box.


I am not sure how you can utilize a 'selector' like query on the file but a Perl regex might do the job just as well:

for url in `cat urls.txt`; do wget -O- $url; done | \
  perl -nle 'print $1 if /<img.+?class="artwork".+?src="([^"]+)"/'
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜