Python and Unicode: How everything should be Unicode
Forgive if this a long a question:
I have been programming in Python for around six months. Self taught, starting with the Python tutorial and then SO and then just using Google for stuff.
Here is the sad part: No one told me all strings should be Unicode. No, I am not lying or making this up, but where does the tutorial mention it? And most examples also I see just make use of byte strings
, instead of Unicode strings.
I was just browsing and came across this question on SO, which says how every string in Python should be a Unicode string. This pretty much made me cry!
I read that every string in Python 3.0 is Unicode by default, so my questions are for 2.x:
Should I do a:
print u'Some text'
or justprint 'Text'
?Everything should be Unicode, does this mean, like say I have a
tuple
:t = ('First', 'Second'), it should be t = (u'First', u'Second')?
I read that I can do a
from __future__ import unicode_literals
and then every string will be a Unicode string, but should I do this inside a container also?When reading/ writing to a file, I should use the
codecs
module. Right? Or should I just use the standard way or reading/ writing andencode
ordecode
where required?If I get the s开发者_开发问答tring from say
raw_input()
, should I convert that to Unicode also?
What is the common approach to handling all of the above issues in 2.x? The from __future__ import unicode_literals
statement?
Sorry for being a such a noob, but this changes what I have been doing for a long time and so clearly I am confused.
The "always use Unicode" suggestion is primarily to make the transition to Python 3 easier. If you have a lot of non-Unicode string access in your code, it'll take more work to port it.
Also, you shouldn't have to decide on a case-by-case basis whether a string should be stored as Unicode or not. You shouldn't have to change the types of your strings and their very syntax just because you changed their contents, either.
It's also easy to use the wrong string type, leading to code that mostly works, or code which works in Linux but not in Windows, or in one locale but not another. For example, for c in "漢字"
in a UTF-8 locale will iterate over each UTF-8 byte (all six of them), not over each character; whether that breaks things depends on what you do with them.
In principle, nothing should break if you use Unicode strings, but things may break if you use regular strings when you shouldn't.
In practice, however, it's a pain to use Unicode strings everywhere in Python 2. codecs.open
doesn't pick the correct locale automatically; this fails:
codecs.open("blar.txt", "w").write(u"漢字")
The real answer is:
import locale, codecs
lang, encoding = locale.getdefaultlocale()
codecs.open("blar.txt", "w", encoding).write(u"漢字")
... which is cumbersome, forcing people to make helper functions just to open files. codecs.open
should be using the encoding from locale
automatically when one isn't specified; Python's failure to make such a simple operation convenient is one of the reasons people generally don't use Unicode everywhere.
Finally, note that Unicode strings are even more critical in Windows in some cases. For example, if you're in a Western locale and you have a file named "漢字", you must use a Unicode string to access it, eg. os.stat(u"漢字")
. It's impossible to access it with a non-Unicode string; it just won't see the file.
So, in principle I'd say the Unicode string recommendation is reasonable, but with the caveat that I don't generally even follow it myself.
No, not every string "should be Unicode". Within your Python code, you know if the string literals needs to be Unicode or not, so it doesn't make any sense to make every string literal into a Unicode literal.
But there are cases where you should use Unicode. For example, if you have arbitrary input that is text, use Unicode for it. You will sooner or later find a non-american using it, and he want to wrîte têxt ås hé is üsed tö. And you'll get problems in that case unless your input and output happen to use the same encoding, which you can't be sure of.
So in short, no, strings shouldn't be Unicode. Text should be. But YMMV.
Specifically:
No need to use Unicode here. You know if that string is ASCII or not.
Depends if you need to merge those strings with Unicode or not.
Both ways work. But do not encode decode "when required". Decode ASAP, encode as late as possible. Using codecs work well (or io, from Python 2.7).
Yeah.
IMHO (my simple rules):
Should I do a:
print u'Some text' or just print 'Text'
?Everything should be Unicode, does this mean, like say I have a tuple:
t = ('First', 'Second'), it should be t = (u'First', u'Second')
?
Well, I use unicode literals only when I have some char above ASCII 128:
print 'New York', u'São Paulo'
t = ('New York', u'São Paulo')
- When reading/ writing to a file, I should use the codecs module. Right? Or should I just use the standard way or reading/ writing and encode or decode where required?
If you expect unicode text, use codecs.
- If I get the string from say raw_input(), should I convert that to Unicode also?
Only if you expect unicode text that may get transfered to another system with distinct default encoding (including databases).
EDITED (about mixing unicode and byte strings):
>>> print 'New York', 'to', u'São Paulo'
New York to São Paulo
>>> print 'New York' + ' to ' + u'São Paulo'
New York to São Paulo
>>> print "Côte d'Azur" + ' to ' + u'São Paulo'
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
ordinal not in range(128)
>>> print "Côte d'Azur".decode('utf-8') + ' to ' + u'São Paulo'
Côte d'Azur to São Paulo
So if you mix a byte string that contains utf-8 (or other non ascii char) with unicode text without explicit conversion, you will have trouble, because default assumes ascii. The other way arround seems to be safe. If you follow the rule of writing every string containing non-ascii as an unicode literal, you should be OK.
DISCLAIMER: I live in Brazil where people speak Portuguese, a language with lots of non-ascii chars. My default encoding is always set to 'utf-8'. Your mileage may vary in English/ascii systems.
I’m just adding my personal opinion here. Not as long and elaborate at the other answers, but maybe it can help, too.
print u'Some text'
or justprint 'Text'
?
I’d indeed prefer the first. If you know that you only have Unicode strings, you have one invariant more. Various other languages (C, C++, Perl, PHP, Ruby, Lua, …) sometimes encounter painful problems because of their lack of separation between code unit sequences and integer sequences. I find the approach of strict distinction between them that exists in .NET, Java, Python etc. quite a bit cleaner.
Everything should be Unicode, does this mean, like say I have a tuple:
t = ('First', 'Second')
, it should bet = (u'First', u'Second')?
Yes.
I read that I can do a
from __future__ import unicode_literals
and then every string will be a Unicode string, but should I do this inside a container also?
Yes. Future statements apply only to the file where they’re used, so you can use them without interfering with other modules. I generally import all futures in Python 2.x modules to make the transition to 3.x easier.
When reading/ writing to a file, I should use the
codecs
module. Right? Or should I just use the standard way or reading/ writing and encode or decode where required?
You should use the codecs
module because that makes it impossible (or at least very hard) to accidentally write differently-encoded representations to a single file. It is also the way Python 3.x works when you open a file in text mode.
If I get the string from say
raw_input()
, should I convert that to Unicode also?
I’d say yes to this too: In most cases it’s easier to deal with only one encoding, so I recommend converting to Python Unicode strings as early as possible.
What is the common approach to handling all of the above issues in 2.x? The
from __future__ import unicode_literals
statement?
I don’t know what the common approach is, but I use that statement all the time. I have encountered only very few issues with this approach, and most of them are related to bugs in external libraries—i.e., NumPy sometimes requires byte strings without documenting that.
The fact that you were writing Python code for 6 months before encountering anything about Unicode means that the Python 2.x ASCII default for strings didn't cause you any problems. Certainly for a beginner to try to grasp the idea of Unicode/code points/encoding in itself is a hard issue to tackle; therefore, most tutorials naturally bypass it until you get more of a grounding in the fundamentals. That's why in a book like Dive Into Python, Unicode is only mentioned in later chapters.
If you need to support Unicode in your application, I suggest looking at Kumar McMillan's PyCon 2008 talk on Unicode for a list of best practices. It should answer your remaining questions.
1/2) Personally I've never heard of "always use unicode". That seems pretty stupid to me. I guess I understand if you plan to support other languages that need unicode support. But other than that I wouldn't do that, it seems like more of a pain than it's worth.
3) I would just read/write the standard way and encode when necessary.
精彩评论