Is there any harm to encoding (with the same encoding format) a string multiple times? (in Python)

2023-01-29 21:43 问答作者：

Is there any harm to encoding a string multiple times in python, with the same encoding format? (i.e, UTF-8)?

I have a function that uses another function to get a string from a document, and then serialize the string. Currently, the only user of the second function(the one which gets the string from the document) is the first function.

开发者_Python百科

This might change in the future, and someone might decide to use it in another serialization (or such) function, without encoding its result to UTF-8 first. I'm wondering if its safe to always return a UTF-8 encoded string from it (this string will also be re-.encode()'d by the serialization function, at the moment). My testing indicates this isn't a problem, but, I figured I'd ask.

Thank you!

You can't encode multiple times, it doesn't work.

>>> s = u"ä".encode('latin1')
>>> s = s.encode('latin1')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)

See, you get "ascii codec can't decode". What the encode method on a string does is that is first decodes the string to Unicode, and then encodes it again with the given encoding. It will decode it with the system encoding, which by default is ascii.

That behavior is unexpected and gone in Python 3, btw, where bytes doesn't have an encode method and strings doesn't have a decode method.

So you simply can't encode it multiple times, and of course that's because encoding an encoded string simply doesn't make any sense. Encoding is converting from Unicode to a binary representation, and you can't further encode a binary representation.

Unless the string is pure ascii, then yes, it can cause harm (and if it's pure ascii, you don't need to worry about utf-8):

>>> a
u'a \xd7 b'
>>> a.encode("utf-8")
'a \xc3\x97 b'
>>> a.encode("utf-8").encode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in range(128)

It's good practice to treat byte sequences and text as two different things. In Python 3, they are different things: bytes objects have the decode() method, and string (unicode) objects have an encode() method.

In general, you should only call encode on unicode objects and only call decode on string objects.

encode encodes a Unicode object into a given encoding (stored as a string). decode decodes a given encoding back into a Unicode object.

The existance of string.encode and unicode.decode in 2.x should be treated as a historical artifact.

Well, if you have a stream of bytes that are UTF-8-encoded text and you interpret them as a string encoded in something else and then re-encoding it as UTF-8, then you have a problem.

If you read it as UTF-8 again (since you cannot treat bytes as text without an encoding, certainly), then you have Unicode, which, when written as UTF-8 again will look the same as before.

Just be careful not to mess around with the encodings too much. A common error is to interpret UTF-8 encoded text as Latin 1, thereby turning Fööbär into FÃ¶Ã¶bÃ¤r which then of course won't change anymore again.

Note the difference between text (the actual thing you care about) and the encoded text which is just a bunch of bytes and the knowledge how to turn them into text again. If you treat the latter as the former, problems arise. If you convert properly from one representation to the other, it's fine.

继续阅读：encoding python unicode utf-8

Is there any harm to encoding (with the same encoding format) a string multiple times? (in Python)

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？