Python library to do jQuery-like text extraction?
I've got html that contains entries like this:
<div class="entry">
<h3 class="foo">
<a href="http://www.example.com/blog-entry-slug"
rel="bookmark">Blog Entry</a>
</h3>
...
</div>
and I would like to extract the text "Blog Entry" (and a number of other attributes, so I'm looking for a generic answer).
In jQuery, I would do
$('.entry a[rel=bookmark]').text()
the closest I've been able to get in Python is:
from BeautifulSoup import Beautiful开发者_C百科Soup
import soupselect as soup
rawsoup = BeautifulSoup(open('fname.html').read())
for entry in rawsoup.findAll('div', 'entry'):
print soup.select(entry, 'a[rel=bookmark]')[0].string.strip()
soupselect from http://code.google.com/p/soupselect/.
Soupselect doesn't understand the full CSS3 selector syntax, like jQuery does however. Is there such a beast in Python?
You might want to take a look at lxml's CSSSelector class which tries to implement CSS selectors as described in the w3c specification. As a side note, many folks recommend lxml for parsing HTML/XML over BeautifulSoup now, for performance and other reasons.
I think lxml's CSSSelector uses XPath for element selection, but you might want to check the documentation for yourself. Here's your example with lxml:
>>> from lxml.cssselect import CSSSelector
>>> from lxml.html import fromstring
>>> html = '<div class="entry"><h3 class="foo"><a href="http://www.example.com/blog-entry-slug" rel="bookmark">Blog Entry</a></h3></div>'
>>> h = fromstring(html)
>>> sel = CSSSelector("a[rel=bookmark]")
>>> [e.text for e in sel(h)]
['Blog Entry']
You might also want to have a look at pyquery. pyquery is a jquery-like library for python. Find it here
It's really very easy using keyword arguments.
>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup('''<div class="entry">
... <h3 class="foo">
... <a href="http://www.example.com/blog-entry-slug"
... rel="bookmark">Blog Entry</a>
... </h3>
... ...
... </div>
... ''')
>>> soup.find('div', 'entry').find(rel='bookmark').text
u'Blog Entry'
Alternately,
>>> for entry in soup('div', 'entry'):
... for bookmark in entry(rel='bookmark'):
... print bookmark.text
...
Blog Entry
You can also use attrs
to effect a selector of .entry
rather than div.entry
:
>>> for entry in soup(attrs={'class': 'entry'}):
... for bookmark in entry(rel='bookmark'):
... print bookmark.text
...
Blog Entry
(Note calling the soup or part of the soup is equivalent to .findAll()
.)
As a list comprehension, that's [b.text for e in soup('div', 'entry') for b in e(rel='bookmark')]
(produces [u'Blog Entry']
).
If you are wanting real CSS3 selectors, I'm not aware of any such thing for BeautifulSoup. All (or if not quite, almost all) of it can be done with simple nesting, conditions and regular expressions (you could just as well use entry(rel=re.compile('^bookmark$'))
). If you want something like that, consider it your next project! It could be useful for flattening code and making it more understandable to web people.
BeautifulSoup allows (basic) CSS selectors: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors
But, in the docs they refer to lxml (http://lxml.de/) if you need more elaborate CSS selectors.
精彩评论