Python library to do jQuery-like text extraction?

2023-01-30 11:44 问答作者：

I've got html that contains entries like this:

<div class="entry">
  <h3 class="foo">
    <a href="http://www.example.com/blog-entry-slug"
    rel="bookmark">Blog Entry</a>
  </h3>
  ...
</div>

and I would like to extract the text "Blog Entry" (and a number of other attributes, so I'm looking for a generic answer).

In jQuery, I would do

$('.entry a[rel=bookmark]').text()

the closest I've been able to get in Python is:

from BeautifulSoup import Beautiful开发者_C百科Soup
import soupselect as soup

rawsoup = BeautifulSoup(open('fname.html').read())

for entry in rawsoup.findAll('div', 'entry'):
    print soup.select(entry, 'a[rel=bookmark]')[0].string.strip()

soupselect from http://code.google.com/p/soupselect/.

Soupselect doesn't understand the full CSS3 selector syntax, like jQuery does however. Is there such a beast in Python?

You might want to take a look at lxml's CSSSelector class which tries to implement CSS selectors as described in the w3c specification. As a side note, many folks recommend lxml for parsing HTML/XML over BeautifulSoup now, for performance and other reasons.

I think lxml's CSSSelector uses XPath for element selection, but you might want to check the documentation for yourself. Here's your example with lxml:

>>> from lxml.cssselect import CSSSelector
>>> from lxml.html import fromstring
>>> html = '<div class="entry"><h3 class="foo"><a href="http://www.example.com/blog-entry-slug" rel="bookmark">Blog Entry</a></h3></div>'
>>> h = fromstring(html)
>>> sel = CSSSelector("a[rel=bookmark]")
>>> [e.text for e in sel(h)]
['Blog Entry']

You might also want to have a look at pyquery. pyquery is a jquery-like library for python. Find it here

It's really very easy using keyword arguments.

>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup('''<div class="entry">
...   <h3 class="foo">
...     <a href="http://www.example.com/blog-entry-slug"
...     rel="bookmark">Blog Entry</a>
...   </h3>
...   ...
... </div>
... ''')
>>> soup.find('div', 'entry').find(rel='bookmark').text
u'Blog Entry'

Alternately,

>>> for entry in soup('div', 'entry'):
...     for bookmark in entry(rel='bookmark'):
...         print bookmark.text
...
Blog Entry

You can also use attrs to effect a selector of .entry rather than div.entry:

>>> for entry in soup(attrs={'class': 'entry'}):
...     for bookmark in entry(rel='bookmark'):
...         print bookmark.text
...
Blog Entry

(Note calling the soup or part of the soup is equivalent to .findAll().)

As a list comprehension, that's [b.text for e in soup('div', 'entry') for b in e(rel='bookmark')] (produces [u'Blog Entry']).

If you are wanting real CSS3 selectors, I'm not aware of any such thing for BeautifulSoup. All (or if not quite, almost all) of it can be done with simple nesting, conditions and regular expressions (you could just as well use entry(rel=re.compile('^bookmark$'))). If you want something like that, consider it your next project! It could be useful for flattening code and making it more understandable to web people.

BeautifulSoup allows (basic) CSS selectors: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors

But, in the docs they refer to lxml (http://lxml.de/) if you need more elaborate CSS selectors.

继续阅读：css-selectors jquery python

Python library to do jQuery-like text extraction?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？