开发者

How to parse broken XML in Python?

A sever I can't influence sends very broken XML.

Specifically, a Unicode WHITE STAR would get encoded as UTF-8 (E2 98 86) and then translated using a Latin-1 to HTML entity table. What I get is â 98 86 (9 bytes) in a file that's declared as utf-8 with no DTD.

I couldn't configure W3C tidy in a way that doesn't garble this irreversibly. I only found how to make lxml skip it silently. SAX uses Expat, which cannot recover after encounter开发者_如何学Going this. I'd like to avoid BeautifulSoup for speed reasons.

What else is there?


BeautifulSoup is your best bet in this case. I suggest profiling before ruling out BeautifulSoup altogether.


Maybe something like:

import htmlentitydefs as ents
from lxml import etree  # or maybe 'html' , if the input is still more broken
def repl_ent(m): 
     return ents.entitydefs[m.group()[1:-1]]
goodxml = re.sub( '&\w+;', repl_ent, badxml )
etree.fromstring( goodxml )
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜