开发者

Cleaning an XML file in Python before parsing

I'm usi开发者_StackOverflow社区ng minidom to parse an xml file and it threw an error indicating that the data is not well formed. I figured out that some of the pages have characters like ไอเฟล &, causing the parser to hiccup. Is there an easy way to clean the file before I start parsing it? Right now I'm using a regular expressing to throw away anything that isn't an alpha numeric character and the </> characters, but it isn't quite working.


Try

xmltext = re.sub(u"[^\x20-\x7f]+",u"",xmltext)

It will get rid of everything except 0x20-0x7F range.

You may start from \x01, if you want want to keep control characters like tab, line breaks.

xmltext = re.sub(u"[^\x01-\x7f]+",u"",xmltext)


Take a look at µTidyLib, a Python wrapper to TidyLib.


If you do need the data with the strange characters you could, in stead of just stripping them, convert them to codes the XML parser can understand.

You could have a look at the unicodedata package, especially the normalize method.

I haven't used it myself, so I can't tell you all that much, but you could ask again here on SO if you decide you're going to convert and keep that data.

>>> import unicodedata
>>> unicodedata.normalize("NFKD" , u"ไภเฟล &")
u'a\u03001\u201ea\u0300 \u0327 a\u03001\u20aca\u0300 \u0327Y\u0308a\u0300 \u0327\xa5 &'


It looks like you're dealing with data which are saved with some kind of encoding "as if" they were ASCII. XML file should normally be UTF8, and SAX (the underlying parser used by minidom) should handle that, so it looks like something's wrong in that part of the processing chain. Instead of focusing on "cleaning up" I'd first try to make sure the encoding is correct and correctly recognized. Maybe a broken XML directive? Can you edit your Q to show the first few lines of the file, especially the <?xml ... directive at the very start?


I'd throw out all non-ASCII characters which can be identified by having the 8th bit (0x80) set (128 .. 255 respectively 0x80 .. 0xff).

  • You could read in the file into a Python string named old_str

  • Then perform a filter call in conjunction with a lambda statement:

    new_str = filter(lambda x: x in string.ascii_letters, old_str)
    
  • Parse new_str

Many ways exist to accomplish stripping non-ASCII characters from a string.

This question might be related: How to check if a string in Python is in ASCII?

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜