开发者

Replace numeric character references in XML document using Python

I am struggling with the following issue: I have an XML string that contains the following tag and I want to convert this, using cElementTree, to a valid XML document:

<tag>#55296;#57136;#55296;#57149;#55296;#57139;#55296;#57136;#55296;#57151;#55296;
#57154;#55296;#57136;</tag>

but each # sign is preceded by a & sign and hence the output looks like: �����������&#开发者_如何学运维57154;��

This is a unicode string and the encoding is UTF-8. I want to discard these numeric character references because they are not legal XML in a valid XML document (see Parser error using Perl XML::DOM module, "reference to invalid character number")

I have tried different regular expression to match these numeric character references. For example, I have tried the following (Python) regex:

RE_NUMERIC_CHARACTER = re.compile('&#[\d{1,5}]+;')

This does work in regular python session but as soon as I use the same regex in my code then it doesn't work, presumably because those numeric characters have been interpreted (and are shown as boxes or question marks).

I have also tried the unescape function from http://effbot.org/zone/re-sub.htm but that does not work either.

Thus: how can I match, using a regular expression in Python, these numeric character references and create a valid XML document?


Eurgh. You've got surrogates (UTF-16 code units in the range D800-DFFF), which some fool has incorrectly encoded individually instead of using a pair of code units for a single character. It would be ideal to replace this mess with what it should look like:

<tag>&#66352;&#66365;&#66355;&#66352;&#66367;&#66370;&#66352;</tag>

Or, just as valid, in literal characters (if you've got a font that can display the Gothic alphabet):

<tag>
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜