开发者

Python library to do jQuery-like text extraction?

I've got html that contains entries like this:

<div class="entry">
  <h3 class="foo">
    <a href="http://www.example.com/blog-entry-slug"
    rel="bookmark">Blog Entry</a>
  </h3>
  ...
</div>

and I would like to extract the text "Blog Entry" (and a number of other attributes, so I'm looking for a generic answer).

In jQuery, I would do

$('.entry a[rel=bookmark]').text()

the closest I've been able to get in Python is:

from BeautifulSoup import Beautiful开发者_C百科Soup
import soupselect as soup

rawsoup = BeautifulSoup(open('fname.html').read())

for entry in rawsoup.findAll('div', 'entry'):
    print soup.select(entry, 'a[rel=bookmark]')[0].string.strip()

soupselect from http://code.google.com/p/soupselect/.

Soupselect doesn't understand the full CSS3 selector syntax, like jQuery does however. Is there such a beast in Python?


You might want to take a look at lxml's CSSSelector class which tries to implement CSS selectors as described in the w3c specification. As a side note, many folks recommend lxml for parsing HTML/XML over BeautifulSoup now, for performance and other reasons.

I think lxml's CSSSelector uses XPath for element selection, but you might want to check the documentation for yourself. Here's your example with lxml:

>>> from lxml.cssselect import CSSSelector
>>> from lxml.html import fromstring
>>> html = '<div class="entry"><h3 class="foo"><a href="http://www.example.com/blog-entry-slug" rel="bookmark">Blog Entry</a></h3></div>'
>>> h = fromstring(html)
>>> sel = CSSSelector("a[rel=bookmark]")
>>> [e.text for e in sel(h)]
['Blog Entry']


You might also want to have a look at pyquery. pyquery is a jquery-like library for python. Find it here


It's really very easy using keyword arguments.

>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup('''<div class="entry">
...   <h3 class="foo">
...     <a href="http://www.example.com/blog-entry-slug"
...     rel="bookmark">Blog Entry</a>
...   </h3>
...   ...
... </div>
... ''')
>>> soup.find('div', 'entry').find(rel='bookmark').text
u'Blog Entry'

Alternately,

>>> for entry in soup('div', 'entry'):
...     for bookmark in entry(rel='bookmark'):
...         print bookmark.text
...
Blog Entry

You can also use attrs to effect a selector of .entry rather than div.entry:

>>> for entry in soup(attrs={'class': 'entry'}):
...     for bookmark in entry(rel='bookmark'):
...         print bookmark.text
...
Blog Entry

(Note calling the soup or part of the soup is equivalent to .findAll().)

As a list comprehension, that's [b.text for e in soup('div', 'entry') for b in e(rel='bookmark')] (produces [u'Blog Entry']).

If you are wanting real CSS3 selectors, I'm not aware of any such thing for BeautifulSoup. All (or if not quite, almost all) of it can be done with simple nesting, conditions and regular expressions (you could just as well use entry(rel=re.compile('^bookmark$'))). If you want something like that, consider it your next project! It could be useful for flattening code and making it more understandable to web people.


BeautifulSoup allows (basic) CSS selectors: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors

But, in the docs they refer to lxml (http://lxml.de/) if you need more elaborate CSS selectors.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜