Python libxml2 parsing xml having Chinese characters
i encountered encoding problems when using libxml2 in python to parse Chinese charactors
# coding=utf8
import libxml2
def output(data):
doc = libxml2.parseMemory(data, len(data))
ctxt = doc.xpathNewContext()
res_rslt = ctxt.xpathEval("/r/e/attribute::Name")
print res_rslt[0]
data = '''<r><e RoleID="3247" Name="中文"></e></r>'''
o开发者_StackOverflow社区utput(data)
the out put is
Name="中文"
while i'm expecting
Name="中文"
how could i make it?
With lxml
, things are easier and they work. It is Pythonic binding for the libxml2
library and works wonderfully.
>>> from lxml import etree
>>> x = etree.fromstring('''<r><e RoleID="3247" Name="中文"></e></r>''')
>>> name = x[0].get('Name')
>>> print name
中文
And yes, XPath
is also supported. The documentation is here.
As for your program, have a look at this:
# -*- coding: utf-8 -*-
import libxml2
def output(data):
doc = libxml2.parseDoc(data)
ctxt = doc.xpathNewContext()
res_rslt = ctxt.xpathEval("/r/e/attribute::Name")
return res_rslt[0]
data = u'''<?xml version="1.0" encoding="UTF-8"?><r><e RoleID="3247" Name="中文"></e></r>'''.encode("UTF-8")
print output(data)
My answer to these sorts of things always seems to be "use Beautiful Soup". And I always get upvoted for it, too (which shows, I think, that others agree with me that it's good).
>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup(u'''<r><e RoleID="3247" Name="中文"></e></r>''')
>>> print soup.r.e['name']
中文
The thing is that libxml2 is converting those characters into the proper XML entities which for XML is correct. Beautiful Soup doesn't have any such notions of feeling a need to be correct - so it just gives you what you want.
(Note in this case that using either u'...'
or '...'
will work; I just put it as a unicode
because it feels better that way - whatever you do, Beautiful Soup gives you Unicode.)
精彩评论