How to Convert Extended ASCII to HTML Entity Names in Python?
I'm currently doing this to replace extended-ascii characters with their HTML-entity-number equivalents:
s.encode('ascii', 'xmlcharrefreplace')
What I would like to do is convert to the 开发者_运维问答HTML-entity-name equivalent (i.e. ©
instead of ©
). This small program below shows what I'm trying to do that is failing. Is there a way to do this, aside from doing a find/replace?
#coding=latin-1
def convertEntities(s):
return s.encode('ascii', 'xmlcharrefreplace')
ok = 'ascii: !@#$%^&*()<>'
not_ok = u'extended-ascii: ©®°±¼'
ok_expected = ok
not_ok_expected = u'extended-ascii: ©®°±¼'
ok_2 = convertEntities(ok)
not_ok_2 = convertEntities(not_ok)
if ok_2 == ok_expected:
print 'ascii worked'
else:
print 'ascii failed: "%s"' % ok_2
if not_ok_2 == not_ok_expected:
print 'extended-ascii worked'
else:
print 'extended-ascii failed: "%s"' % not_ok_2
Is htmlentitydefs
what you want?
import htmlentitydefs
htmlentitydefs.codepoint2name.get(ord(c),c)
edit
Others have mentioned the htmlentitydefs
that I never knew about. It would work with my code this way:
from htmlentitydefs import entitydefs as symbols
for tag, val in symbols.iteritems():
mystr = mystr.replace("&{0};".format(tag), val)
And that should work.
I'm not sure how directly but I think the htmlentitydefs
module will be of use. An example can be found here.
Update This is the solution I'm going with, with a small fix to check that entitydefs contains a mapping for the character we have.
def convertEntities(s):
return ''.join([getEntity(c) for c in s])
def getEntity(c):
ord_c = ord(c)
if ord_c > 127 and ord_c in htmlentitydefs.codepoint2name:
return "&%s;" % htmlentitydefs.codepoint2name[ord_c]
return c
Are you sure that you don't want the conversion to be reversible? Your ok_expected
string indicates you don't want existing &
characters escaped, so the conversion will be one way. The code below assumes that &
should be escaped, but just remove the cgi.escape
if you really don't want that.
Anyway, I'd combine your original approach with a regular expression substitution: do the encoding as before and then just fix up the numeric entities. That way you don't end up mapping every single character through your getEntity function.
#coding=latin-1
import cgi
import re
import htmlentitydefs
def replace_entity(match):
c = int(match.group(1))
name = htmlentitydefs.codepoint2name.get(c, None)
if name:
return "&%s;" % name
return match.group(0)
def convertEntities(s):
s = cgi.escape(s) # Remove if you want ok_expected to pass!
s = s.encode('ascii', 'xmlcharrefreplace')
s = re.sub("&#([0-9]+);", replace_entity, s)
return s
ok = 'ascii: !@#$%^&*()<>'
not_ok = u'extended-ascii: ©®°±¼'
ok_expected = ok
not_ok_expected = u'extended-ascii: ©®°±¼'
ok_2 = convertEntities(ok)
not_ok_2 = convertEntities(not_ok)
if ok_2 == ok_expected:
print 'ascii worked'
else:
print 'ascii failed: "%s"' % ok_2
if not_ok_2 == not_ok_expected:
print 'extended-ascii worked'
else:
print 'extended-ascii failed: "%s"' % not_ok_2
精彩评论