Replace numeric character references in XML document using Python
I am struggling with the following issue: I have an XML string that contains the following tag and I want to convert this, using cElementTree, to a valid XML document:
<tag>#55296;#57136;#55296;#57149;#55296;#57139;#55296;#57136;#55296;#57151;#55296;
#57154;#55296;#57136;</tag>
but each # sign is preceded by a & sign and hence the output looks like: 开发者_如何学运维57154;
This is a unicode string and the encoding is UTF-8. I want to discard these numeric character references because they are not legal XML in a valid XML document (see Parser error using Perl XML::DOM module, "reference to invalid character number")
I have tried different regular expression to match these numeric character references. For example, I have tried the following (Python) regex:
RE_NUMERIC_CHARACTER = re.compile('&#[\d{1,5}]+;')
This does work in regular python session but as soon as I use the same regex in my code then it doesn't work, presumably because those numeric characters have been interpreted (and are shown as boxes or question marks).
I have also tried the unescape function from http://effbot.org/zone/re-sub.htm but that does not work either.
Thus: how can I match, using a regular expression in Python, these numeric character references and create a valid XML document?
Eurgh. You've got surrogates (UTF-16 code units in the range D800-DFFF), which some fool has incorrectly encoded individually instead of using a pair of code units for a single character. It would be ideal to replace this mess with what it should look like:
<tag>𐌰𐌽𐌳𐌰𐌿𐍂𐌰</tag>
Or, just as valid, in literal characters (if you've got a font that can display the Gothic alphabet):
<tag>
精彩评论