开发者

Encode unicode chars to HTML entities in Python, excluding tags

As you may know, for an email to be valid in many clients, all unicode chars must be encoded. I would like to automate this encoding in a Python script.

Obviously tags need to be excluded from conversion开发者_StackOverflow, otherwise the html won't work - doing this is really the complicated part - to be sure of success it is necessary to use a parsing package like lxml or beautifulsoup.

As far as I know, neither of those two packages support converting to numbered unicode entities such as & #x6F22 ; (漢)

Any help would be really invaluable, I've been banging my head against this wall all day!


I’ve had a similar problem, however it was always enough to run the following expression on the raw text, which just converts hex entities to decimal entities, which are then parsed just fine:

>>> hex_entity_pat = re.compile('&#x([^;]+);')
>>> hex_entity_fix = lambda x: hex_entity_pat.sub(lambda m: '&#%d;' % int(m.group(1), 16), x) # convert hex to dec entities
>>> BeautifulSoup(hex_entity_fix("<b>&#x6F22;</b>"), convertEntities=BeautifulSoup.ALL_ENTITIES)
<b>漢</b>


I’m assuming that your emails are in HTML, not plain text. I think you are looking for this:

some_unicode_string.encode('ascii', errors='xmlcharrefreplace')

But maybe you can do this some other way. How do you generate the HTML in the first place?

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜