开发者

How do I parse HTML which includes named ISO-8859-1 entities with Python?

I summarize: minidom appears not to like 8859 named entities; what's an appropriate resolution?

Here's code which illustrates my situation:

sample = """
  <html>
    <body>
      <h1>Un ejemplo</h1>
      <p>Me llamo Juan Fulano y Hern&aacute;ndez.</p>
    </body>
  </html>
"""
sample2 = sample.replace("&aacute;", "&#225;")

import xml.dom.minidom

dom2 = xml.dom.minidom.pars开发者_StackOverflow中文版eString(sample2)
dom = xml.dom.minidom.parseString(sample)

Briefly: when the HTML includes 'á' and similar, expressed as named entities, minidom complains

... xml.parsers.expat.ExpatError: undefined entity ...

How should I respond? Do I

  • Replace named entities with corresponding literal constants?
  • Use a parser other than minidom? Which?
  • Somehow (with an encoding assignment?) convince minidom that these named entities are cool?

Not feasible is to convince the author of the (X)HTML to eschew named entities.


xml.dom.minidom is an XML parser, not an HTML parser. Therefore, it doesn't know any HTML entities (only those which are common to both XML and HTML: &quot;, &amp;, &lt;, &gt; and &apos;).

Try BeautifulSoup.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜