Some utf8 chars allowed in python source, some not

2023-01-11 22:12 问答作者：

I've noticed that I can not use all unicode characters in my python source code.

While

def 价(何):

is perfectly allright (albeit nonsensical [probably?]),

def N(N₀, t, λ) -> 'N(t)':

this isn't allowed (the subscript zero that is).

I also can't use some other characters, most of which I recognise as something other than letters (mathematical operators for example). I always thought that if I just stick to the rules I know, i.e. composing names from letters and numbers, with a letter as the first character, all will be okay. Now, the subscript zero is clearly a 'number'. so my impression was wrong.

I know I should avoid using special characters. However, the function definition above (the exponential decay one that is) seems to me perfectly开发者_如何学Go reasonable - because it will never change, and it so elegantly conveys all the information needed for another programmer to use it.

My question therefore, exactly which characters are allowed and which aren't? And where?

Edit

All right I seem not to have been clear enough. I am using python3, so there is no need for declaring the encoding of the source file. Apparent I thought from then fact that my Chinese function definition works.

My question concerns why some characters are allowed there, while others aren't. The subscript zero raises an error, invalid character in identifier, but the blackboard bold zero works. Both equally special I'd say.

I'd like to know if there are any general rules that apply not just to my situation, there must be. It seems that my error is not an accident.

Edit 2:

The answer courtesy of Beau Martínez, pointing me to the language reference, where i should have looked in the first place:

http://www.dcl.hpi.uni-potsdam.de/home/loewis/table-3131.html It appears the characters that are allowed are all chosen.

As per the language reference, Python 3 allows a large variety of characters as identifiers.

That zero subscript character seems like a number, but it isn't for Python; Python only treats 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9 as numbers. It is in fact a character, so you can use it as an identifier (it's as if it were, instead, for example, a greek character such as Phi).

Importantly, how easily can you type those characters with your keyboard? I don't want to pull up the character map every time I have to call your functions, for example. Calling it "maximum_decay_rate" or something much more intuitive to any user, not just a Physics major, makes your code more readable.

If you say it isn't allowed, it's probably because you haven't specified the character encoding for your source file. It can be specified by having # -*- coding: utf-8 -*- (or which ever the encoding) at the beginning of your source file.

Tell Python what the proper encoding is:

https://www.python.org/dev/peps/pep-0263/

Either...

# -*- coding: utf-8 -*-

# coding=utf-8

As far as what characters are actually allowed in variable names, typically the restriction is alphabetic characters, digits, and underscores.

The "subscript zero" is not actually a digit, it's a subscript.

Each Unicode character has specific 'properties', to be found in the Unicode Character Database, and for our purpose the properties from so called General Category are most important. They allow to partition all the characters into large groups:

letters (L)
numbers (N)
marks (M)
punctuations (P)
symbols (S)
separators (Z)
other (C)

The groups have subgroups, for example Lu is Uppercase_Letter. According to the Python Language Reference (3.4.1), one should first normalize the sequence of characters into NFKC form (which in practice means decomposing characters with diacritics and 'simplifying' them, for example changing subscript 0 into normal 0). Then, the start of the identifier should be either an underscore or a letter (the whole Letter group plus Nl - letterlike numbers), plus a few other letterlike symbols. It gets much more interesting when we look at characters that are allowed as continuation of the identifier. Additionally, we can use: Decimal_Numbers (Nd), which are in fact digits from 0 to 9, but in many guises, for example MATHEMATICAL MONOSPACE DIGIT NINE, which is character \U0001D7FF (all together 70 characters); most marks (M), with the exception of enclosing marks (Me) - here we have all the diacritics (accents); all characters from subgroup Pc - punctuation connectors, so not only underscore, but also various ties (10 characters); some additional digit-like characters (for example Ethiopic digits 0 to 9); middle dots (2 characters).

As seen from the above, N with a subscript 0 should be accepted as an identifier. When I tried to paste it from Word, both IDLE and Wing 101 inserted the normalized forms into the editor (i.e. N0). I suspect the author of the question tried to use a subscript character that could not be properly normalized.

继续阅读：python-3.x unicode

Some utf8 chars allowed in python source, some not

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？