开发者

Python regex \w doesn't match combining diacritics?

I have a UTF8 string with combining diacritics. I want to match it with the \w regex sequence. It matches characters that have accents, but not if there is a latin character with combining diacritics.

>>> re.match("a\w\w\wz", u"aoooz", re.UNICODE)
<_sre.SRE_Match object at 0xb7788f38>
>>> print u"ao\u00F3oz"
aoóoz
>>> re.match("a\w\w\wz", u"ao\u00F3oz", re.UNICODE)
<_sre.SRE_Match object at 0xb7788f38>
>>> re.match("a\w\w\wz", u"aoo\u0301oz", re.UNICODE)
>>> print u"aoo\u0301oz"
aóooz

(Looks like the SO markdown processer is having trouble with the combining diacritics in the above, but there is a ́ on the last line)

Is there anyway to match combining diacritics wi开发者_如何学JAVAth \w? I don't want to normalise the text because this text is from filename, and I don't want to have to do a whole 'file name unicode normalization' yet. This is Python 2.5.


I've just noticed a new "regex" package on pypi. (if I understand correctly, it is a test version of a new package that will someday replace the stdlib re package).

It seems to have (among other things) more possibilities with regard to unicode. For example, it supports \X, which is used to match a single grapheme (whether it uses combining or not). It also supports matching on unicode properties, blocks and scripts, so you can use \p{M} to refer to combining marks. The \X mentioned before is equivalent to \P{M}\p{M}* (a character that is NOT a combining mark, followed by zero or more combining marks).

Note that this makes \X more or less the unicode equivalent of ., not of \w, so in your case, \w\p{M}* is what you need.

It is (for now) a non-stdlib package, and I don't know how ready it is (and it doesn't come in a binary distribution), but you might want to give it a try, as it seems to be the easiest/most "correct" answer to your question. (otherwise, I think your down to explicitly using character ranges, as described in my comment to the previous answer).

See also this page with information on unicode regular expressions, that might also contain some useful information for you (and can serve as documentation for some of the things implemented in the regex package).


You can use unicodedata.normalize to compose the combining diacritics into one unicode character.

>>> import re
>>> from unicodedata import normalize
>>> re.match(u"a\w\w\wz", normalize("NFC", u"aoo\u0301oz"), re.UNICODE)
<_sre.SRE_Match object at 0x00BDCC60>

I know you said you didn't want to normalize, but I don't think there will be a problem with this solution, as you're only normalizing the string to match against, and do not have to change the filename itself or something.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜