Matching case sensitive unicode strings with regular expressions in Python
Suppose I want to match a lowercase letter followed by an uppercase letter, I could do something like
re.compile(r"[a-z][A-Z]")
Now I want to do the same thing for unicode strings, i.e. match something like 'aÅ' or 'yÜ'.
Tried
re.compile(r"[a-z][A-Z]", re.UNICODE)
开发者_运维百科but that does not work.
Any clues?
This is hard to do with Python regex because the current implementation doesn't support Unicode property shortcuts like \p{Lu}
and \p{Ll}
.
[A-Za-z]
will of course only match ASCII letters, regardless of whether the Unicode option is set or not.
So until the re
module is updated (or you install the regex
package currently in development), you either need to do it programmatically (iterate through the string and do char.islower()
/char.isupper()
on the characters), or specify all the unicode code points manually which probably isn't worth the effort...
精彩评论