Why does printing to a utf-8 file fail?
So I ran into a problem this afternoon, I was able to solve it, but I don't quite understand why it worked.
this is related to a problem I had the other week: python check if utf-8 string is uppercase
basically, the following will not work:
#!/usr/bin/python
import codecs
from lxml import etree
outFile = codecs.open('test.xml', 'w', 'utf-8') #cannot use codecs.open()
root = etree.Element('root')
sect = etree.SubElement(root,'sect')
words = ( u'\u041c\u041e\u0421\u041a\u0412\u0410', # capital of Russia, all uppercase
u'R\xc9SUM\xc9', # RESUME with accents
u'R\xe9sum\xe9', # Resume with accents
u'R\xe9SUM\xe9', ) # ReSUMe with accents
for word in words:
print word
if word.encode('utf8').decode('utf8').isupper(): #.isupper won't function on utf8
title = etree.SubElement(sect,'title')
title.text = word
else:
item = etree.SubElement(sect,'item')
item.text = word
print>>outFile,etree.tostring(root,pretty_print=True,xml_declaration=True,encoding='utf-8')
it fails with the following:
Traceback (most recent call last):
File "./temp.py", line 25, in print >>outFile,etree.tostring(root,pretty_print=True,xml_declaration=True,encoding='utf-8') File "/usr/lib/python2.7/codecs.py", line 691, in write return self.writer.write(data) File "/usr/lib/python2.7/codecs.py", line 351, in write data, consumed = self.encode(object, self.errors) UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 66: ordinal not in range(128)
but if I 开发者_Python百科open the new file without codecs.open('test.xml', 'w', 'utf-8')
and instead use
outFile = open('test.xml', 'w')
it works perfectly.
So whats happening??
since
encoding='utf-8'
is specified inetree.tostring()
is it encoding the file again?if I leave
codecs.open()
and removeencoding='utf-8'
the file then becomes an ascii file. Why? becuaseetree.tostring()
has a default encoding of ascii I persume?but
etree.tostring()
is simply being written to stdout, and is then redirect to a file that was created as a utf-8 file??- is
print>>
not workings as I expect?outFile.write(etree.tostring())
behaves the same way.
- is
Basically, why wouldn't this work? what is going on here. It might be trivial, but I am obviously a bit confused and have a desire to figure out why my solution works,
You've opened the file with UTF-8 encoding, which means that it expects Unicode strings.
tostring is encoding to UTF-8 (in the form of bytestrings, str), which you're writing to the file.
Because the file is expecting Unicode, it's decoding the bytestrings to Unicode using the default ASCII encoding so that it can then encode the Unicode to UTF-8.
Unfortunately, the bytestrings aren't ASCII.
EDIT: The best advice to avoid this kind of problem is to use Unicode internally, decoding on input and encoding on output.
Using print>>outFile
is a little strange. I don't have lxml
installed, but the built-in xml.etree
library is similar (but doesn't support pretty_print
). Wrap the root
Element in an ElementTree and use the write method.
Also, if you using a # coding
line to declare the encoding of the source file, you can use readable Unicode strings instead of escape codes:
#!/usr/bin/python
# coding: utf8
import codecs
from xml.etree import ElementTree as etree
root = etree.Element(u'root')
sect = etree.SubElement(root,u'sect')
words = [u'МОСКВА',u'RÉSUMÉ',u'Résumé',u'RéSUMé']
for word in words:
print word
if word.isupper():
title = etree.SubElement(sect,u'title')
title.text = word
else:
item = etree.SubElement(sect,u'item')
item.text = word
tree = etree.ElementTree(root)
tree.write('text.xml',xml_declaration=True,encoding='utf-8')
In addition to MRABs answer some lines of code:
import codecs
from lxml import etree
root = etree.Element('root')
sect = etree.SubElement(root,'sect')
# do some other xml building here
with codecs.open('test.xml', 'w', encoding='utf-8') as f:
f.write(etree.tostring(root, encoding=unicode))
精彩评论