开发者

Some utf8 chars allowed in python source, some not

I've noticed that I can not use all unicode characters in my python source code.

While

def 价(何):

is perfectly allright (albeit nonsensical [probably?]),

def N(N₀, t, λ) -> 'N(t)':

this isn't allowed (the subscript zero that is).

I also can't use some other characters, most of which I recognise as something other than letters (mathematical operators for example). I always thought that if I just stick to the rules I know, i.e. composing names from letters and numbers, with a letter as the first character, all will be okay. Now, the subscript zero is clearly a 'number'. so my impression was wrong.

I know I should avoid using special characters. However, the function definition above (the exponential decay one that is) seems to me perfectly开发者_如何学Go reasonable - because it will never change, and it so elegantly conveys all the information needed for another programmer to use it.

My question therefore, exactly which characters are allowed and which aren't? And where?

Edit

All right I seem not to have been clear enough. I am using python3, so there is no need for declaring the encoding of the source file. Apparent I thought from then fact that my Chinese function definition works.

My question concerns why some characters are allowed there, while others aren't. The subscript zero raises an error, invalid character in identifier, but the blackboard bold zero works. Both equally special I'd say.

I'd like to know if there are any general rules that apply not just to my situation, there must be. It seems that my error is not an accident.

Edit 2:

The answer courtesy of Beau Martínez, pointing me to the language reference, where i should have looked in the first place:

http://www.dcl.hpi.uni-potsdam.de/home/loewis/table-3131.html It appears the characters that are allowed are all chosen.


As per the language reference, Python 3 allows a large variety of characters as identifiers.

That zero subscript character seems like a number, but it isn't for Python; Python only treats 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9 as numbers. It is in fact a character, so you can use it as an identifier (it's as if it were, instead, for example, a greek character such as Phi).

Importantly, how easily can you type those characters with your keyboard? I don't want to pull up the character map every time I have to call your functions, for example. Calling it "maximum_decay_rate" or something much more intuitive to any user, not just a Physics major, makes your code more readable.

If you say it isn't allowed, it's probably because you haven't specified the character encoding for your source file. It can be specified by having # -*- coding: utf-8 -*- (or which ever the encoding) at the beginning of your source file.


Tell Python what the proper encoding is:

https://www.python.org/dev/peps/pep-0263/

Either...

# -*- coding: utf-8 -*-

or

# coding=utf-8

As far as what characters are actually allowed in variable names, typically the restriction is alphabetic characters, digits, and underscores.

The "subscript zero" is not actually a digit, it's a subscript.


Each Unicode character has specific 'properties', to be found in the Unicode Character Database, and for our purpose the properties from so called General Category are most important. They allow to partition all the characters into large groups:

  • letters (L)
  • numbers (N)
  • marks (M)
  • punctuations (P)
  • symbols (S)
  • separators (Z)
  • other (C)

The groups have subgroups, for example Lu is Uppercase_Letter. According to the Python Language Reference (3.4.1), one should first normalize the sequence of characters into NFKC form (which in practice means decomposing characters with diacritics and 'simplifying' them, for example changing subscript 0 into normal 0). Then, the start of the identifier should be either an underscore or a letter (the whole Letter group plus Nl - letterlike numbers), plus a few other letterlike symbols. It gets much more interesting when we look at characters that are allowed as continuation of the identifier. Additionally, we can use: Decimal_Numbers (Nd), which are in fact digits from 0 to 9, but in many guises, for example MATHEMATICAL MONOSPACE DIGIT NINE, which is character \U0001D7FF (all together 70 characters); most marks (M), with the exception of enclosing marks (Me) - here we have all the diacritics (accents); all characters from subgroup Pc - punctuation connectors, so not only underscore, but also various ties (10 characters); some additional digit-like characters (for example Ethiopic digits 0 to 9); middle dots (2 characters).

As seen from the above, N with a subscript 0 should be accepted as an identifier. When I tried to paste it from Word, both IDLE and Wing 101 inserted the normalized forms into the editor (i.e. N0). I suspect the author of the question tried to use a subscript character that could not be properly normalized.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜