html5lib/lxml examples for BeautifulSoup users?

2023-01-16 01:50 问答作者：

I'm trying to wean myself from BeautifulSoup, which I love but seems to be (aggressively) unsupported. I'm trying to work with html5lib and lxml, but I can't seem to figure out how to use the "find" and "findall" operators.

By looking at the docs for html5lib, I came up with this for a test program:

import cStringIO

f = cStringIO.StringIO(开发者_如何转开发)
f.write("""
  <html>
    <body>
      <table>
       <tr>
          <td>one</td>
          <td>1</td>
       </tr>
       <tr>
          <td>two</td>
          <td>2</td
       </tr>
      </table>
    </body>
  </html>
  """)
f.seek(0)

import html5lib
from html5lib import treebuilders
from lxml import etree  # why?

parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("lxml"))
etree_document = parser.parse(f)

root = etree_document.getroot()

root.find(".//tr")

But this returns None. I noticed that if I do a etree.tostring(root) I get all my data back, but all my tags are prefaced by html (e.g. <html:table>). But root.find(".//html:tr") throws a KeyError.

Can someone put me back on the right track?

You can turn off namespaces with this command: etree_document = html5lib.parse(t, treebuilder="lxml", namespaceHTMLElements=False)

In general, use lxml.html for HTML. Then you don't need to worry about generating your own parser & worrying about namespaces.

>>> import lxml.html as l
>>> doc = """
...    <html><body>
...    <table>
...      <tr>
...        <td>one</td>
...        <td>1</td>
...      </tr>
...      <tr>
...        <td>two</td>
...        <td>2</td
...      </tr>
...    </table>
...    </body></html>"""
>>> doc = l.document_fromstring(doc)
>>> doc.finall('.//tr')
[<Element tr at ...>, <Element tr at ...>] #doctest: +ELLIPSIS

FYI, lxml.html also allows you to use CSS selectors, which I find is an easier syntax.

>>> doc.cssselect('tr')
[<Element tr at ...>, <Element tr at ...>] #doctest: +ELLIPSIS

It appears that using the "lxml" html5lib TreeBuilder causes html5lib to build the tree in the XHTML namespace -- which makes sense, as lxml is an XML library, and XHTML is how one represents HTML as XML. You can use lxml's qname syntax with the find() method to do something like:

root.find('.//{http://www.w3.org/1999/xhtml}tr')

Or you can use lxml's full XPath functions to do something like:

root.xpath('.//html:tr', namespaces={'html': 'http://www.w3.org/1999/xhtml'})

The lxml documentation has more information on how it uses XML namespaces.

I realize that this is an old question, but I came here in a quest for information I didn't find in any other one place. I was trying to scrape something with BeautifulSoup but it was choking on some chunky html. The default html parser is apparently less loose than some others that are available. One often preferred parser is lxml, which I believe produces the same parsing as expected for browsers. BeautifulSoup allows you to specify lxml as the source parser, but using it requires a little bit of work.

First, you need html5lib AND you must also install lxml. While html5lib is prepared to use lxml (and some other libraries), the two do not come packaged together. [for Windows users, even though I don't like fussing with Win dependencies to the extent that I usually get libraries by making a copy in the same directory as my project, I strongly recommend using pip for this; pretty painless; I think you need administrator access.]

Then you need to write something like this:

import urllib2
from bs4 import BeautifulSoup
import html5lib
from html5lib import sanitizer
from html5lib import treebuilders
from lxml import etree

url = 'http://...'

content = urllib2.urlopen(url)
parser = html5lib.HTMLParser(tokenizer=sanitizer.HTMLSanitizer,
                             tree=treebuilders.getTreeBuilder("lxml"),
                             namespaceHTMLElements=False)
htmlData = parser.parse(content)
htmlStr = etree.tostring(htmlData)

soup = BeautifulSoup(htmlStr, "lxml")

Then enjoy your beautiful soup!

Note the namespaceHTMLElements=false option on the parser. This is important because lxml is intended for XML as opposed to just HTML. Because of that, it will label all the tags it provides as belonging to the HTML namespace. The tags will look like (for example)

<html:li>

and BeautifulSoup will not work well.

Try:

root.find('.//{http://www.w3.org/1999/xhtml}tr')

You have to specify the namespace rather than the namespace prefix (html:tr). For more information, see the lxml docs, particularly the section:

Tutorial - Namespaces

继续阅读：html5lib lxml python

html5lib/lxml examples for BeautifulSoup users?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？