How to retrieve a webpage in python, including any images
I'm trying to retrieve the source of a webpage, including any images. At the moment I have this:
import urllib
page = urll开发者_运维知识库ib.urlretrieve('http://127.0.0.1/myurl.php', 'urlgot.php')
print urlgot.php
which retrieves the source fine, but I also need to download any linked images.
I was thinking I could create a regular expression which searched for img src or similar in the downloaded source; however, I was wondering if there was urllib function that would retrieve the images as well? Similar to the wget command of:
wget -r --no-parent http://127.0.0.1/myurl.php
I don't want to use the os module and run the wget, as I want the script to run on all systems. For this reason I can't use any third party modules either.
Any help is much appreciated! Thanks
Don't use regex when there is a perfectly good parser built in to Python:
from urllib.request import urlretrieve # Py2: from urllib
from html.parser import HTMLParser # Py2: from HTMLParser
base_url = 'http://127.0.0.1/'
class ImgParser(HTMLParser):
def __init__(self, *args, **kwargs):
self.downloads = []
HTMLParser.__init__(self, *args, **kwargs)
def handle_starttag(self, tag, attrs):
if tag == 'img':
for attr in attrs:
if attr[0] == 'src':
self.downloads.append(attr[1])
parser = ImgParser()
with open('test.html') as f:
# instead you could feed it the original url obj directly
parser.feed(f.read())
for path in parser.downloads:
url = base_url + path
print(url)
urlretrieve(url, path)
Use BeautifulSoup to parse the returned HTML and search for image links. You might also need to recursively fetch frames and iframes.
精彩评论