开发者

Python libxml2 parsing xml having Chinese characters

i encountered encoding problems when using libxml2 in python to parse Chinese charactors

# coding=utf8
import libxml2

def output(data):
  doc = libxml2.parseMemory(data, len(data))
  ctxt = doc.xpathNewContext()
  res_rslt = ctxt.xpathEval("/r/e/attribute::Name")
  print res_rslt[0]

data =  '''<r><e RoleID="3247" Name="中文"></e></r>'''

o开发者_StackOverflow社区utput(data)

the out put is

Name="&#x4E2D;&#x6587;"

while i'm expecting

Name="中文"

how could i make it?


With lxml, things are easier and they work. It is Pythonic binding for the libxml2 library and works wonderfully.

>>> from lxml import etree
>>> x = etree.fromstring('''<r><e RoleID="3247" Name="中文"></e></r>''')
>>> name = x[0].get('Name')
>>> print name
中文

And yes, XPath is also supported. The documentation is here.

As for your program, have a look at this:

# -*- coding: utf-8 -*-

import libxml2

def output(data):
  doc = libxml2.parseDoc(data)
  ctxt = doc.xpathNewContext()
  res_rslt = ctxt.xpathEval("/r/e/attribute::Name")
  return res_rslt[0]

data =  u'''<?xml version="1.0" encoding="UTF-8"?><r><e RoleID="3247" Name="中文"></e></r>'''.encode("UTF-8")

print output(data)


My answer to these sorts of things always seems to be "use Beautiful Soup". And I always get upvoted for it, too (which shows, I think, that others agree with me that it's good).

>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup(u'''<r><e RoleID="3247" Name="中文"></e></r>''')
>>> print soup.r.e['name']
中文

The thing is that libxml2 is converting those characters into the proper XML entities which for XML is correct. Beautiful Soup doesn't have any such notions of feeling a need to be correct - so it just gives you what you want.

(Note in this case that using either u'...' or '...' will work; I just put it as a unicode because it feels better that way - whatever you do, Beautiful Soup gives you Unicode.)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜