Hosting a html page of images on localhost and accessing it through web crawler, and downloading the image

2023-01-24 06:28 问答作者：

I have created a web crawler in python that accesses the web page & downloads images from that page. The code of web crawler is:

# ImageDownloader.py
# Finds and downloads all images from any given URL.

import urllib2 
import re from os.path
import basename from urlparse 
import urlsplit

url = "http://www.yahoo.com"

urlContent = urllib2.urlopen(url).read()

# HTML image tag: <img src="url" alt="some_text"/>

imgUrls = re.findall('img.*?src="(.*?)"', urlContent)

# download all images

for imgUrl in imgUrls:

     try:
         imgData = urllib2.urlopen(imgUrl).read()
         fileName = basename(urlsplit(imgUrl)[2])
         output = open(fileName,'wb')
         output.write(imgData)
         output.close()
     except:
         pass

I have to show a demo in class, so I built a simple web page with some images & hosted it on localhost but the web crawler which I have created 开发者_Go百科is not accessing the html page and not downloading the images.

Can anyone help me out in accessing the html page on localhost from the crawler?

You need to point your script at localhost, not at "www.yahoo.com".

With that said, there are a number of things you could do to improve this program:

Never blindly catch an exception, and then do nothing. Let the exception propagate upwards, or do something useful.
For simple scripts like this, create a function that does your work and call it from a if __name__ == '__main__': block.
Instead of using regexes to find images, you could use BeautifulSoup which would add some structure to your program, but this might not be needed.
It is quite common that images are included through CSS, so it might be worth looking there as well.

Your regex should be img.+?src=(.+?)". Once I changed that the findall() returned a nice list of image urls and flooded my directory with images.

That stated, do heed @knutin's advice, especially using BeautifulSoup instead of regex. While the regex here works, it likely won't be robust enough to handle all HTML you throw at it. I've been doing some HTML scraping myself recently (nothing as easy as images) and it's been an absolute breeze.

Fork this on github and add something like if [".jpg", ".png" etc...] in link to find pics and download them :) https://github.com/mouuff/MouCrawler/blob/master/moucrawler.py

继续阅读：python

Hosting a html page of images on localhost and accessing it through web crawler, and downloading the image

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？