Python.expat can't parse XML file with bad symbols. How to go around?
I'm trying to parse an XML file (OSM data) with expat, and there are lines 开发者_运维问答with some Unicode characters that expat can't parse:
<tag k="name"
v="абвгдежзиклмнопр�?туфхцчшщьыъ�?ю�?�?БВГДЕЖЗИКЛМ�?ОПРСТУФХЦЧШЩЬЫЪЭЮЯ" />
<tag k="name" v="Cin\x8e? Rex" />
(XML file encoding in the opening line is "UTF-8")
The file is quite old, and there must have been errors. In modern files I don't see UTF-8 errors, and they are parsed fine. But what if my program meets a broken symbol, what workaround can I make? Is it possible to join bz2 codec (I parse a compressed file) and utf-8 codec to ignore the broken characters, or change them to "?"?
Not sure if '�' characters were introduced by copy-pasting string here, but if you have them in original data, then it seems to be generator problem which introduced \uFFFD charactes as:
"used to replace an incoming character whose value is unknown or unrepresentable in Unicode"
citied from: http://www.fileformat.info/info/unicode/char/fffd/index.htm
Workaround? Just idea for extension:
good = True
buf = None
while True:
if good:
buf = f.read(buf_size)
else:
# try again with cleaned buffer
pass
try:
xp.Parse(buf, len(buf) == 0)
if (len(buf) == 0):
break
good = True
except ExpatError:
if xp.ErrorCode == XML_ERROR_BAD_CHAR_REF:
# look at ErrorByteIndex (or nearby)
# for 0xEF 0xBF 0xBD (UTF8 replacement char) and remove it
good = False
else:
# other errors processing
pass
Or clean input buffer instead + corner cases (partial sequence at the buffer end). I can't recall if python's expat allows to assign custom error handler. That would be easier then.
If i clean yours sample from '�' characters it's processed ok. \xd1 does not fail.
OSM data?
精彩评论