Is there any harm to encoding (with the same encoding format) a string multiple times? (in Python)
Is there any harm to encoding a string multiple times in python, with the same encoding format? (i.e, UTF-8)?
I have a function that uses another function to get a string from a document, and then serialize the string. Currently, the only user of the second function(the one which gets the string from the document) is the first function.
开发者_Python百科This might change in the future, and someone might decide to use it in another serialization (or such) function, without encoding its result to UTF-8 first. I'm wondering if its safe to always return a UTF-8 encoded string from it (this string will also be re-.encode()'d by the serialization function, at the moment). My testing indicates this isn't a problem, but, I figured I'd ask.
Thank you!
You can't encode multiple times, it doesn't work.
>>> s = u"ä".encode('latin1')
>>> s = s.encode('latin1')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)
See, you get "ascii codec can't decode". What the encode method on a string does is that is first decodes the string to Unicode, and then encodes it again with the given encoding. It will decode it with the system encoding, which by default is ascii.
That behavior is unexpected and gone in Python 3, btw, where bytes doesn't have an encode method and strings doesn't have a decode method.
So you simply can't encode it multiple times, and of course that's because encoding an encoded string simply doesn't make any sense. Encoding is converting from Unicode to a binary representation, and you can't further encode a binary representation.
Unless the string is pure ascii, then yes, it can cause harm (and if it's pure ascii, you don't need to worry about utf-8):
>>> a
u'a \xd7 b'
>>> a.encode("utf-8")
'a \xc3\x97 b'
>>> a.encode("utf-8").encode("utf-8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in range(128)
It's good practice to treat byte sequences and text as two different things. In Python 3, they are different things: bytes objects have the decode()
method, and string (unicode) objects have an encode()
method.
In general, you should only call encode
on unicode
objects and only call decode
on string
objects.
encode
encodes a Unicode object into a given encoding (stored as a string). decode
decodes a given encoding back into a Unicode object.
The existance of string.encode
and unicode.decode
in 2.x should be treated as a historical artifact.
Well, if you have a stream of bytes that are UTF-8-encoded text and you interpret them as a string encoded in something else and then re-encoding it as UTF-8, then you have a problem.
If you read it as UTF-8 again (since you cannot treat bytes as text without an encoding, certainly), then you have Unicode, which, when written as UTF-8 again will look the same as before.
Just be careful not to mess around with the encodings too much. A common error is to interpret UTF-8 encoded text as Latin 1, thereby turning Fööbär
into Fööbär
which then of course won't change anymore again.
Note the difference between text (the actual thing you care about) and the encoded text which is just a bunch of bytes and the knowledge how to turn them into text again. If you treat the latter as the former, problems arise. If you convert properly from one representation to the other, it's fine.
精彩评论