Strip out special characters
I pull some data from a HTML page with a list of products and for some text it looks like this:
Organicâ„¢
In the HTML page when I look at that same text I can see its supposed to read Organic with the TM (Trade Mark) symbol after it. Why does it look like the above!
My main question is How can I get rid of TM, @ and Copyright symbols so I am just left with a clean开发者_如何学Python name of the product?
Thanks all for any help
Your page has the wrong character set declared (or no character set declared at all).
View the source HTML and see if in the head
section there is a tag like <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
If there's no such tag, or the tag is there but the charset
bit is missing, you haven't declared a character set. If the tag is there and the charset
bit is present, the declared character set is wrong. Looking at the specific example you gave, it looks like the text might be in UTF-8 but is being displayed as latin-1.
It's an encoding issue ; there's a gap between your html page encoding, and your output device encoding.
You'll have to rationalize this. The best is to have your working environment in utf8, and to convert all external data into utf8.
精彩评论