开发者

An equivalent to string.ascii_letters for unicode strings in python 2.x?

In the "string" module of the standard library,

string.ascii_letters ## Same as string.ascii_lowercase + string.ascii_uppercase

is

'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

Is there a simil开发者_如何学JAVAar constant which would include everything that is considered a letter in unicode?


You can construct your own constant of Unicode upper and lower case letters with:

import unicodedata as ud
all_unicode = ''.join(unichr(i) for i in xrange(65536))
unicode_letters = ''.join(c for c in all_unicode
                          if ud.category(c)=='Lu' or ud.category(c)=='Ll')

This makes a string 2153 characters long (narrow Unicode Python build). For code like letter in unicode_letters it would be faster to use a set instead:

unicode_letters = set(unicode_letters)


There's no string, but you can check whether a character is a letter using the unicodedata module, in particular its category() function.

>>> unicodedata.category(u'a')
'Ll'
>>> unicodedata.category(u'A')
'Lu'
>>> unicodedata.category(u'5')
'Nd'
>>> unicodedata.category(u'ф') # Cyrillic f.
'Ll'
>>> unicodedata.category(u'٢') # Arabic-indic numeral for 2.
'Nd'

Ll means "letter, lowercase". Lu means "letter, uppercase". Nd means "numeric, digit".


That would be a pretty massive constant. Unicode currently covers over 100.000 different characters. So the answer is no.

The question is why you would need it? There might be some other way of solving whatever your problem is with the unicodedata module, for example.

Update: You can download files with all unicode datapoint names and other information from ftp://ftp.unicode.org/, and do loads of interesting stuff with that.


As mentioned in previous answers, the string would indeed be way too long. So, you have to target (a) specific language(s).
[EDIT: I realized it was the case for my original intended use, and for most uses, I guess. However, in the meantime, Mark Tolonen gave a good answer to the question as it was asked, so I chose his answer, although I used the following solution]

This is easily done with the "locale" module:

import locale
import string
code = 'fr_FR' ## Do NOT specify encoding (see below)
locale.setlocale(locale.LC_CTYPE, code)
encoding = locale.getlocale()[1]
letters = string.letters.decode(encoding)

with "letters" being a 117-character-long unicode string.

Apparently, string.letters is dependant on the default encoding for the selected language code, rather than on the language itself. Setting the locale to fr_FR or de_DE or es_ES will update string.letters to the same value (since they are all encoded in ISO8859-1 by default).

If you add an encoding to the language code (de_DE.UTF-8), the default encoding will be used instead for string.letters. That would cause a UnicodeDecodeError if you used the rest of the above code.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜