Match letter in any language

2023-03-30 20:11 问答作者：

How can I match a letter from any language using a regex in python 3?

re.match([a-zA-Z]) will match the english language characters but I want all languages to be supported simultaneously.

I don't wish to match the ' in can't or underscores or any other type of formatting. I do wi开发者_运维知识库sh my regex to match: c, a, n, t, Å, é, and 中.

For Unicode regex work in Python, I very strongly recommend the following:

Use Matthew Barnett’s regex library instead of standard re, which is not really suitable for Unicode regular expressions.
Use only Python 3, never Python 2. You want all your strings to be Unicode strings.
Use only string literals with logical/abstract Unicode codepoints, not encoded byte strings.
Set your encoding on your streams and forget about it. If you find yourself ever manually calling .encode and such, you’re almost certainly doing something wrong.
Use only a wide build where code points and code units are the same, never ever ever a narrow one — which you might do well to consider deprecated for Unicode robustness.
Normalize all incoming strings to NFD on the way in and then NFC on the way out. Otherwise you can’t get reliable behavior.

Once you do this, you can safely write patterns that include \w or \p{script=Latin} or \p{alpha} and \p{lower} etc and know that these will all do what the Unicode Standard says they should. I explain all of this business of Python Unicode regex business in much more detail in this answer. The short story is to always use regex not re.

For general Unicode advice, I also have several talks from last OSCON about Unicode regular expressions, most of which apart from the 3rd talk alone is not about Python, but much of which is adaptable.

Finally, there’s always this answer to put the fear of God (or at least, of Unicode) in your heart.

What's wrong with using the \w special sequence?

# -*- coding: utf-8 -*-
import re
test = u"can't, Å, é, and 中ABC"
print re.findall('\w+', test, re.UNICODE)

You can match on

\p{L}

which matches any Unicode code point that represents a letter of a script. That is, assuming you actually have a Unicode-capable regex engine, which I really hope Python would have.

Build a match class of all the characters you want to match. This might become very, very large. No, there is no RegEx shorthand for "All Kanji" ;)

Maybe it is easier to match for what you do not want, but even then, this class would become extremely large.

import re

text = "can't, Å, é, and 中ABC"
print(re.findall('\w+', text))

This works in Python 3. But it also matches underscores. However this seems to do the job as I wish:

import regex

text = "can't, Å, é, and 中ABC _ sh_t"
print(regex.findall('\p{alpha}+', text))

For Portuguese language, use try this one:

[a-zA-ZÀ-ú ]+

As noted by others, it would be very difficult to keep the up-to-date database of all letters in all existing languages. But in most cases you don't actually need that and it can be perfectly fine for your code to begin by supporing just several chosen languages and adding others as needed.

The following simple code supports matching for Czech, German and Polish language. The character sets can be easily obtained from Wikipedia.

import re

LANGS = [
    'ÁáČčĎďÉéĚěÍíŇňÓóŘřŠšŤťÚúŮůÝýŽž',   # Czech
    'ÄäÖöÜüẞß',                         # German
    'ĄąĆćĘęŁłŃńÓóŚśŹźŻż',               # Polish
    ]

pattern = '[A-Za-z{langs}]'.format(langs=''.join(LANGS))
pattern = re.compile(pattern)
result = pattern.findall('Žluťoučký kůň')

print(result)

# ['Ž', 'l', 'u', 'ť', 'o', 'u', 'č', 'k', 'ý', 'k', 'ů', 'ň']

继续阅读：python regex unicode

Match letter in any language

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？