cElementTree invalid encoding problem
I'm encoding challenged, so this is probably simple, but I'm stuck.
I'm trying to parse an XML file emailed to the App Engine's new receive mail functionality. Firs开发者_如何学运维t, I just pasted the XML into the body of the message, and it parsed fine with CElementTree. Then I changed to using an attachment, and parsing it with CElementTree produces this error:
SyntaxError: not well-formed (invalid token): line 3, column 10
I've output the XML from both emailing in the body and as an attachment, and they look the same to me. I assume pasting it in the box is changing the encoding in a way that attaching the file is not, but I don't know how to fix it.
The first few lines look this:
<?xml version="1.0" standalone="yes"?>
<gpx xmlns="http://www.topografix.com/GPX/1/0" version="1.0" creator="TopoFusion 2.85" xmlns:TopoFusion="http://www.TopoFusion.com" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.topografix.com/GPX/1/0 http://www.topografix.com/GPX/1/0/gpx.xsd http://www.TopoFusion.com http://www.TopoFusion.com/topofusion.xsd">
<name><![CDATA[Pacific Crest Trail section K hike 4]]></name><desc><![CDATA[Pacific Crest Trail section K hike 4. Five Lakes to Old Highway 40 near Donner. As described in Day Hikes on the PCT California edition by George & Patricia Semb. See pages 150-152 for access and exit trailheads. GPS data provided by the USFS]]></desc><author><![CDATA[MikeOnTheTrail]]></author><email><![CDATA[michaelonthetrail@yahoo.com]]></email><url><![CDATA[http://www.pcta.org]]></url>
<urlname><![CDATA[Pacific Crest Trail Association Homepage]]></urlname>
<time>2006-07-08T02:16:05Z</time>
Edited to add more info:
I have a GPX file that's a few thousand lines. If I paste it into the body of the message I can parse it correctly, like so:
gpxcontent = message.bodies(content_type='text/plain')
for x in gpxcontent:
gpxcontent = x[1].decode()
for event, elem in ET.iterparse(StringIO.StringIO(gpxcontent), events=("start", "start-ns")):
If I attach it to the mail as an attachment, using Gmail. And then extract it like so:
if isinstance(message.attachments, tuple):
attachments = [message.attachments]
gpxcontent = attachments[0][3].decode()
for event, elem in ET.iterparse(StringIO.StringIO(gpxcontent), events=("start", "start-ns")):
I get the error above. Line 3 column 10 seems to be the start of ![CDATA on the third line.
Ah, nevermind. There's a bug in App Engine that is calling lower() on all attachments when you decode them. This made the CDATA string invalid.
Here's a link to the bug report: http://code.google.com/p/googleappengine/issues/detail?id=2289#c2
精彩评论