ANSI, ASCII, Unicode and encoding confusion with Python
I was happily using BeautifulSoup and I'm also using a text file as input parameters of my Python script.
I then came across the famous "UnicodeEncodeError" error.
I've been reading questions here at SO but I'm still confused.
What does ASCII got to do with all of these? What encoding do I use on my text editor (Notepad++)? ANSI? UTF-8? Decoding a string to ASCII doesn't seem to always work (I'm guessing the string is in a different encoding coming from BeautifulSoup). How do I fix this?
Anyway any help and clarifications will be greatly appreciated.
Thanks!
edit: reading BeautifulSoup's docs, it says that it only uses unicode but I'm still getting Unicode errors :(
File "C:\Python26\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode cha开发者_StackOverflowracter u'\u300d' in position
3: character maps to <undefined>
ANSI is not a character encoding (in common parlance it refers to certain escape sequences, though it's of course the acronym for the American National Standard Institute). You can set the encoding in Notepad++ (and check what encoding you're using) -- hopefully utf-8, because that's a universal encoding (lets you represent any Unicode point). You build unicode from your utf-8 encoded text with an explicit decode
method call, or you read the file as unicode with a codecs.open
(both require you to specify your encoding name -- again, hopefully 'utf8').
What does ASCII got to do with all of these?
Python has no way to find out what encoding was used to store text, so it assumes ascii by default. However, ASCII defines only first 128 chars, so anything outside results in decode error (which is actually good thing, since it does not let you use incorrectly decoded strings around).
Most of the time your string would be in utf-8, since its most common way to encode Unicode, so its usually safe to do s.decode('utf-8')
on str
type strings (or use unicode(s, 'utf-8')
call)
If you dont know in advance what kind of encoding text has, and it provides no encoding metadata, you can try using chardet module.
BeautifulSoup can output result in different encodings and ways, so you just need to specify that you want unicode there.
As of now (2014, 1, 23), for Notepad++ (NPP) there still seems to be a lot of recent/Unresolved BugReports/Discussions regarding the use of ANSI as a Notepad++ encoding term.
PROBLEM
Google: notepad++ ansi encoding
Results:
#4095 "ANSI as UTF-8" Misleading
#124 ansi encoding and german letters
The encoding that Notepad++ just calls “ANSI”, does anyone know what to call it for Ruby?
Notepad++ Forum - Search discussion: ANSI encoding
SOLUTION
The following NPP Forum Discussion seems to point to the best SOLUTION for me.
See Encoding detection, ANSI (Windows 1252) vs. UTF-8 (w/o BOM)
Preferences -> New Document > Encoding > UTF8 without BOM" called Apply to opened ANSI files
I CHECKED the above as OPPOSED to the author who UNchecked it.
Then i begin my Python script as follows.
#!/usr/bin/python
# -*- coding: utf-8 -*-
精彩评论