Get a list of the absolute paths of all the images in a page using BeautifulSoup
Could someone show me how to get a list of aboslute paths for all the images in a webpage using BeautifulSoup?
It's simple to get all the images. I'm doing this:
page_images = [image["src"] for image in soup.findAll("img")]
...but I'm having difficulties getting the absolute paths. Any help?
Thank yo开发者_运维问答u.
You will have to normalize the paths after getting them. This can be done using urlparse.urljoin
. For example:
>>> urlparse.urljoin("http://google.com/some/path/", "../../img/icon.png")
'http://google.com/img/icon.png'
This is not using BeautifulSoup, but the more elegant (and well-maintained) lxml+pyquery:
import pyquery
from urlparse import urljoin
def make_images_absolute(self):
self('img').each(lambda: self(this).attr('src',
urljoin(self.base_url, self(this).attr('src'))))
url="http://lwn.net"
pq = pyquery.PyQuery(url)
for i in pq("img"):
print i.attrib["src"]
make_images_absolute(pq)
for i in pq("img"):
print i.attrib["src"]
精彩评论