Python Markdown module choking on unicode conversion, utf-8
I'm using the markdown module from web2py to handle marked up text. The problem is, people are submitting stuff with smartquotes, special characters etc, and I 开发者_如何学编程need to replace those with their equivalents.
I have text like this: '\n\r\nThe Colonels face paled a bit. \xe2\x80\x9cBut, then \xe2\x80" excuse my boldness, sir \xe2\x80" our going to Uvar now'
How do I ensure that calling unicode(txt, 'utf-8') like it does on the text internally inside markdown will not throw an error? The fancy special quotes that word processing programs insert are the normal cause, but there seem to be many characters which are an issue.
The \xe2\x80\x9c
is U+201C LEFT DOUBLE QUOTATION MARK (a "smart quote") when decoded as UTF-8. The two occurrences of \xe2\x80"
are not valid UTF-8 sequences and the presence there of a "
(a "dumb" quote) is suspicious. You appear to have a mangling problem or an encoding problem, or both. We need to sort that out before we get to the task of replacing e.g. smart quotes by dumb quotes.
Exactly how are "people submitting stuff"? What transformations has it gone through before markdown does unicode(txt, 'utf-8')
?
精彩评论