Python: what does "...".encode("utf8") fix?

2023-01-08 14:54 问答作者：

I wante开发者_开发百科d to url encode a python string and got exceptions with hebrew strings. I couldn't fix it and started doing some guess oriented programming. Finally, doing mystr = mystr.encode("utf8") before sending it to the url encoder saved the day.

Can somebody explain what happened? What does .encode("utf8") do? My original string was a unicode string anyways (i.e. prefixed by a u).

My original string was a unicode string anyways (i.e. prefixed by a u)

...which is the problem. It wasn't a "string", as such, but a "Unicode object". It contains a sequence of Unicode code points. These code points must, of course, have some internal representation that Python knows about, but whatever that is is abstracted away and they're shown as those \uXXXX entities when you print repr(my_u_str).

To get a sequence of bytes that another program can understand, you need to take that sequence of Unicode code points and encode it. You need to decide on the encoding, because there are plenty to choose from. UTF8 and UTF16 are common choices. ASCII could be too, if it fits. u"abc".encode('ascii') works just fine.

Do my_u_str = u"\u2119ython" and then type(my_u_str) and type(my_u_str.encode('utf8')) to see the difference in types: The first is <type 'unicode'> and the second is <type 'str'>. (Under Python 2.5 and 2.6, anyway).

Things are different in Python 3, but since I rarely use it I'd be talking out of my hat if I tried to say anything authoritative about it.

You original string was a unicode object containing raw Unicode code points, after encoding it as UTF-8 it is a normal byte string that contains UTF-8 encoded data.

The URL encoder seems to expect a byte string, so that it can URL-encode one byte after another and doesn't have to deal with Unicode code points. When you give it a unicode object, it tries to convert it to a byte string using some default encoding, probably ASCII. For Hebrew characters that cannot be represented as ASCII, this will lead to errors.

What does .encode("utf8") do?

It depends on which version of Python you're using:

In Python 3.x, it converts a str object (encoded in UTF-16 or UTF-32) into a bytes object containing the UTF-8 representation of the string.
In Python 2.x, it converts a unicode object into a str object encoded in UTF-8. But str has an encode method too, and writing '...'.encode('UTF-8') is equivalent to writing '...'.decode('ascii').encode('UTF-8').

Since you mentioned the "u" prefix, you must be using 2.x. If you don't require any 2.x-only libraries, I'd recommend switching to 3.x, which has a nice clear distinction between text and binary data.

Dive into Python 3 has a good explanation of the issue.

Can somebody explain what happened?

It would help if you told us what the error message was.

The urllib.quote function expects a str object. It also happens to work with unicode objects that contain only ASCII characters, but not when they contain Hebrew letters.

In Python 3.x, urllib.parse.quote accepts both str (=Python 2.x unicode) and bytes objects. Strings are automatically encoded in UTF-8.

"...".encode("utf-8") transforms the in-memory representation of the string into an UTF-8 -encoded string.

url encoder likely expected a bytestring, that is, string representation where each character is represented with a single byte.

It returns a UTF-8 encoded version of the Unicode string, mystr. It is important to realize that UTF-8 is simply 1 way of encoding Unicode. Python can work with many other encodings (eg. mystr.encode("utf32") or even mystr.encode("ascii")).

The link that balpha posted explains it all. In short:

The fact that your string was prefixed with "u" just means it's composed of Unicode characters (or code points). UTF-8 is an encoding of this string into a sequence of bytes.

继续阅读：python python-2.x unicode utf-8

Python: what does "...".encode("utf8") fix?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？