Python: what does "...".encode("utf8") fix?
I wante开发者_开发百科d to url encode a python string and got exceptions with hebrew strings.
I couldn't fix it and started doing some guess oriented programming.
Finally, doing mystr = mystr.encode("utf8")
before sending it to the url encoder saved the day.
Can somebody explain what happened? What does .encode("utf8") do? My original string was a unicode string anyways (i.e. prefixed by a u).
My original string was a unicode string anyways (i.e. prefixed by a u)
...which is the problem. It wasn't a "string", as such, but a "Unicode object". It contains a sequence of Unicode code points. These code points must, of course, have some internal representation that Python knows about, but whatever that is is abstracted away and they're shown as those \uXXXX
entities when you print repr(my_u_str)
.
To get a sequence of bytes that another program can understand, you need to take that sequence of Unicode code points and encode it. You need to decide on the encoding, because there are plenty to choose from. UTF8 and UTF16 are common choices. ASCII could be too, if it fits. u"abc".encode('ascii')
works just fine.
Do my_u_str = u"\u2119ython"
and then type(my_u_str)
and type(my_u_str.encode('utf8'))
to see the difference in types: The first is <type 'unicode'>
and the second is <type 'str'>
. (Under Python 2.5 and 2.6, anyway).
Things are different in Python 3, but since I rarely use it I'd be talking out of my hat if I tried to say anything authoritative about it.
You original string was a unicode object containing raw Unicode code points, after encoding it as UTF-8 it is a normal byte string that contains UTF-8 encoded data.
The URL encoder seems to expect a byte string, so that it can URL-encode one byte after another and doesn't have to deal with Unicode code points. When you give it a unicode object, it tries to convert it to a byte string using some default encoding, probably ASCII. For Hebrew characters that cannot be represented as ASCII, this will lead to errors.
What does .encode("utf8") do?
It depends on which version of Python you're using:
- In Python 3.x, it converts a
str
object (encoded in UTF-16 or UTF-32) into abytes
object containing the UTF-8 representation of the string. - In Python 2.x, it converts a
unicode
object into astr
object encoded in UTF-8. Butstr
has anencode
method too, and writing'...'.encode('UTF-8')
is equivalent to writing'...'.decode('ascii').encode('UTF-8')
.
Since you mentioned the "u" prefix, you must be using 2.x. If you don't require any 2.x-only libraries, I'd recommend switching to 3.x, which has a nice clear distinction between text and binary data.
Dive into Python 3 has a good explanation of the issue.
Can somebody explain what happened?
It would help if you told us what the error message was.
The urllib.quote
function expects a str
object. It also happens to work with unicode
objects that contain only ASCII characters, but not when they contain Hebrew letters.
In Python 3.x, urllib.parse.quote
accepts both str
(=Python 2.x unicode
) and bytes
objects. Strings are automatically encoded in UTF-8.
"...".encode("utf-8") transforms the in-memory representation of the string into an UTF-8 -encoded string.
url encoder likely expected a bytestring, that is, string representation where each character is represented with a single byte.
It returns a UTF-8 encoded version of the Unicode string, mystr. It is important to realize that UTF-8 is simply 1 way of encoding Unicode. Python can work with many other encodings (eg. mystr.encode("utf32") or even mystr.encode("ascii")).
The link that balpha posted explains it all. In short:
The fact that your string was prefixed with "u" just means it's composed of Unicode characters (or code points). UTF-8 is an encoding of this string into a sequence of bytes.
精彩评论