Encode unicode chars to HTML entities in Python, excluding tags
As you may know, for an email to be valid in many clients, all unicode chars must be encoded. I would like to automate this encoding in a Python script.
Obviously tags need to be excluded from conversion开发者_StackOverflow, otherwise the html won't work - doing this is really the complicated part - to be sure of success it is necessary to use a parsing package like lxml or beautifulsoup.
As far as I know, neither of those two packages support converting to numbered unicode entities such as & #x6F22 ; (漢)
Any help would be really invaluable, I've been banging my head against this wall all day!
I’ve had a similar problem, however it was always enough to run the following expression on the raw text, which just converts hex entities to decimal entities, which are then parsed just fine:
>>> hex_entity_pat = re.compile('&#x([^;]+);')
>>> hex_entity_fix = lambda x: hex_entity_pat.sub(lambda m: '&#%d;' % int(m.group(1), 16), x) # convert hex to dec entities
>>> BeautifulSoup(hex_entity_fix("<b>漢</b>"), convertEntities=BeautifulSoup.ALL_ENTITIES)
<b>漢</b>
I’m assuming that your emails are in HTML, not plain text. I think you are looking for this:
some_unicode_string.encode('ascii', errors='xmlcharrefreplace')
But maybe you can do this some other way. How do you generate the HTML in the first place?
精彩评论