开发者

Decoding html content and HTMLParser

I'm creating a sub-class based on 'HTMLParser' to pull out html content. Whenever I hav开发者_StackOverflowe character refs such as

' ' '&'  '–' '…'

I'd like to replace them with their English counterparts of

' ' (space), '&', '-', '...', and so on.

What's the best way to convert some of the simple character refs into their correct representation?

My text is similar to:

Some text goes here&after that, 6:30 pm–8:45pm and maybe 
something like …

I would like to convert this to:

Some text goes here & after that, 6:30 pm-8:45pm and maybe 
something like ...


Your question has two parts. The easy part is decoding the HTML entities. The easiest way to do that is to grab this undocumented but long-stable method from the HTMLParser module:

>>> HTMLParser.HTMLParser().unescape('a < é – …')
u'a < é – …'

The second part, converting Unicode characters to ASCII lookalikes, is trickier and also quite questionable. I would try to retain the Unicode en-dash ‘–’ and similar typographical niceties, rather than convert them down to characters like the plain hyphen and straight-quotes. Unless your application can't handle non-ASCII characters at all you should aim to keep them as they are, along with all other Unicode characters.

The specific case of the U+2013 ellipsis character is potentially different because it's a ‘compatibility character’, included in Unicode only for lossless round-tripping to other encodings that feature it. Preferably you'd just type three dots, and let the font's glyph combination logic work out exactly how to draw it.

If you want to just replace compatibility characters (like this one, explicit ligatures, the Japanese fullwidth numbers, and a handful of other oddities), you could try normalising your string to Normal Form KC:

>>> unicodedata.normalize('NFKC', u'a < – …')
u'a < é – ...'

(Care, though: some other characters that you might have wanted to keep are also compatibility characters, including ‘²’.)

The next step would be to turn letters with diacriticals into plain letters, which you could do by normalising to NFKD instead and them removing all characters that have the ‘combining’ character class from the string. That would give you plain ASCII for the previously-accented Latin letters, albeit in a way that is not linguistically correct for many languages. If that's all you care about you could encode straight to ASCII:

>>> unicodedata.normalize('NFKD', u'a < – …').encode('us-ascii', 'ignore')
'a < e  ...'

Anything further you might do would have to be ad-hoc as there is no accepted standard for folding strings down to ASCII. Windows has one implementation, as does Lucene (ASCIIFoldingFilter). The results are pretty variable.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜