How to use split with utf8 coding?
I use builtin split function and i have a problem:
>>> data = "test, ąśżźć, test2"
>>> splitted_data = data.split(",")
>>> print splitted_data
['test', ' \xc4\x85\x开发者_如何学JAVAc5\x9b\xc5\xbc\xc5\xba\xc4\x87', ' test2']
Why this is happen? What should I do to prevent this?
Python 2.7.1
That's purely the output you get from str.__repr__
(calling repr()
on a string). The \xc4
etc. is just the actual way it's stored. When you print it it's still the same:
>>> data = "test, ąśżźć, test2"
>>> data
'test, \xc4\x85\xc5\x9b\xc5\xbc\xc5\xba\xc4\x87, test2'
>>> print data
test, ąśżźć, test2
list.__str__
and list.__repr__
use the representation of the string, but if you access the item inside it, it's still correct:
>>> splitted_data = data.split(",")
>>> splitted_data
['test', ' \xc4\x85\xc5\x9b\xc5\xbc\xc5\xba\xc4\x87', ' test2']
>>> print splitted_data[1]
ąśżźć
While your snippet works (escapes are just how repr
works), you shouldn't treat bytestrings as text. Decode first, operate later.
data = u"test, ąśżźć, test2" # or "test, ąśżźć, test2".decode('utf-8')
split_data = data.split(u",")
You are looking at the internal representation of splitted_data
data = "test, åäö, test2"
data
'test, \xe5\xe4\xf6, test2'
data.split()[1]
'\xe5\xe4\xf6,'
print data.split()[1]
åäö,
As said by everyone else, there is nothing wrong with your procedure. Your expectations are not met because the presentation of a printed list chosen by Python is not the list of the contained strings. Compare with the following:
>>> data = "test, ąśżźć, test2"
>>> a,b,c = data.split(",")
>>> print a,b,c
test ąśżźć test2
精彩评论