Gracefully recover from parse error in expat
XML is supposed to be strict, and so there are some Unicode characters which aren't allowed in XML. However I'm tryin开发者_JAVA百科g to work with RSS feeds which often contain these characters anyway, and I'd like to either avoid parse errors from invalid characters or recover gracefully from them and present the document anyway.
See an example here (on March 21 anyway): http://feeds.feedburner.com/chrisblattman
What's the recommended way to handle unicode in the XML feed? Detect the characters and substitute in null bytes, edit the parser, or some other method?
Looks like that RSS feed contained a vertical tab character \x0c
which is illegal per the XML 1.0 spec.
My advice is to filter out the illegal characters before passing the data to expat, rather than attempting to catch errors and recover. Here is a routine to filter out the Unicode characters which are illegal. I tested it on your chrisblattman.xml
RSS feed:
import re
from xml.parsers import expat
# illegal XML 1.0 character ranges
# See http://www.w3.org/TR/REC-xml/#charsets
XML_ILLEGALS = u'|'.join(u'[%s-%s]' % (s, e) for s, e in [
(u'\u0000', u'\u0008'), # null and C0 controls
(u'\u000B', u'\u000C'), # vertical tab and form feed
(u'\u000E', u'\u001F'), # shift out / shift in
(u'\u007F', u'\u009F'), # C1 controls
(u'\uD800', u'\uDFFF'), # High and Low surrogate areas
(u'\uFDD0', u'\uFDDF'), # not permitted for interchange
(u'\uFFFE', u'\uFFFF'), # byte order marks
])
RE_SANITIZE_XML = re.compile(XML_ILLEGALS, re.M | re.U)
# decode, filter illegals out, then encode back to utf-8
data = open('chrisblattman.xml', 'rb').read().decode('utf-8')
data = RE_SANITIZE_XML.sub('', data).encode('utf-8')
pr = expat.ParserCreate('utf-8')
pr.Parse(data)
Update: Here is a Wikipedia page about XML character validity. My regexp above filters out the C1 control range, but you may want to allow those characters depending on your application.
You may try Beautiful Soupwich may parse HTML/XML documents even if they are not well formed.
精彩评论