Hosting a html page of images on localhost and accessing it through web crawler, and downloading the image
I have created a web crawler in python that accesses the web page & downloads images from that page. The code of web crawler is:
# ImageDownloader.py
# Finds and downloads all images from any given URL.
import urllib2
import re from os.path
import basename from urlparse
import urlsplit
url = "http://www.yahoo.com"
urlContent = urllib2.urlopen(url).read()
# HTML image tag: <img src="url" alt="some_text"/>
imgUrls = re.findall('img.*?src="(.*?)"', urlContent)
# download all images
for imgUrl in imgUrls:
try:
imgData = urllib2.urlopen(imgUrl).read()
fileName = basename(urlsplit(imgUrl)[2])
output = open(fileName,'wb')
output.write(imgData)
output.close()
except:
pass
I have to show a demo in class, so I built a simple web page with some images & hosted it on localhost
but the web crawler which I have created 开发者_Go百科is not accessing the html page and not downloading the images.
Can anyone help me out in accessing the html page on localhost
from the crawler?
You need to point your script at localhost, not at "www.yahoo.com".
With that said, there are a number of things you could do to improve this program:
Never blindly catch an exception, and then do nothing. Let the exception propagate upwards, or do something useful.
For simple scripts like this, create a function that does your work and call it from a
if __name__ == '__main__':
block.Instead of using regexes to find images, you could use BeautifulSoup which would add some structure to your program, but this might not be needed.
It is quite common that images are included through CSS, so it might be worth looking there as well.
Your regex should be img.+?src=(.+?)"
. Once I changed that the findall()
returned a nice list of image urls and flooded my directory with images.
That stated, do heed @knutin's advice, especially using BeautifulSoup instead of regex. While the regex here works, it likely won't be robust enough to handle all HTML you throw at it. I've been doing some HTML scraping myself recently (nothing as easy as images) and it's been an absolute breeze.
Fork this on github and add something like if [".jpg", ".png" etc...] in link to find pics and download them :) https://github.com/mouuff/MouCrawler/blob/master/moucrawler.py
精彩评论