Extracting a tag value in BeautifulSoup when unable to match by position or attributes
I'm using BS to scrape a web page and i'm a little stuck with a small problem. Here's a snippet of HTML from the page.
<span style="font-family: a开发者_StackOverflow中文版rial;"><span style="font-weight: bold;">Artist:</span> M.I.A.<br>
</span>
Once I've got the soup, how can I find this tag and get the artist name i.e. M.I.A.
I cannot match the tag with the style
attribute as it is used in a dozen places in the page. I don't even know the exact location of the span
tag as it changes position from page to page. Therefore, I can't match by position. The artist name changes but the title span structure is always the same.
I would only like the extract the artist name (the M.I.A. bit).
BeautifulSoup
is kind of dead, since SGMLParser
is deprecated. I suggest you use the better lxml
library -- It even has xpath support!!
from lxml import html
text = '''
<span style="font-family: arial;">
<span style="font-weight: bold;">Artist:</span>M.I.A.<br>
</span>
'''
doc = html.fromstring(text)
print ''.join(doc.xpath("//span/span[text()='Artist:']/../text()"))
This xpath expression means "find the span
tag which is inside another span
tag and contains the text 'Artist:'
, and grab all the text of the parent containing tag". It correctly prints M.I.A.
as one would expect.
精彩评论