BeautifulSoup getting href [duplicate]
I have the fo开发者_运维百科llowing soup
:
<a href="some_url">next</a>
<span class="class">...</span>
From this I want to extract the href, "some_url"
I can do it if I only have one tag, but here there are two tags. I can also get the text 'next'
but that's not what I want.
Also, is there a good description of the API somewhere with examples. I'm using the standard documentation, but I'm looking for something a little more organized.
You can use find_all
in the following way to find every a
element that has an href
attribute, and print each one:
# Python2
from BeautifulSoup import BeautifulSoup
html = '''<a href="some_url">next</a>
<span class="class"><a href="another_url">later</a></span>'''
soup = BeautifulSoup(html)
for a in soup.find_all('a', href=True):
print "Found the URL:", a['href']
# The output would be:
# Found the URL: some_url
# Found the URL: another_url
# Python3
from bs4 import BeautifulSoup
html = '''<a href="https://some_url.com">next</a>
<span class="class">
<a href="https://some_other_url.com">another_url</a></span>'''
soup = BeautifulSoup(html)
for a in soup.find_all('a', href=True):
print("Found the URL:", a['href'])
# The output would be:
# Found the URL: https://some_url.com
# Found the URL: https://some_other_url.com
Note that if you're using an older version of BeautifulSoup (before version 4) the name of this method is findAll
. In version 4, BeautifulSoup's method names were changed to be PEP 8 compliant, so you should use find_all
instead.
If you want all tags with an href
, you can omit the name
parameter:
href_tags = soup.find_all(href=True)
精彩评论