开发者

Get a list of the absolute paths of all the images in a page using BeautifulSoup

Could someone show me how to get a list of aboslute paths for all the images in a webpage using BeautifulSoup?

It's simple to get all the images. I'm doing this:

page_images = [image["src"] for image in soup.findAll("img")]

...but I'm having difficulties getting the absolute paths. Any help?

Thank yo开发者_运维问答u.


You will have to normalize the paths after getting them. This can be done using urlparse.urljoin. For example:

>>> urlparse.urljoin("http://google.com/some/path/", "../../img/icon.png")
'http://google.com/img/icon.png'


This is not using BeautifulSoup, but the more elegant (and well-maintained) lxml+pyquery:

import pyquery
from urlparse import urljoin

def make_images_absolute(self):
    self('img').each(lambda: self(this).attr('src',
           urljoin(self.base_url, self(this).attr('src'))))

url="http://lwn.net"
pq = pyquery.PyQuery(url)
for i in pq("img"):
    print i.attrib["src"]
make_images_absolute(pq)
for i in pq("img"):
    print i.attrib["src"]
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜