Multi-language input validation with UTF-8 encoding

2023-01-27 14:20 问答作者：

To check a user input english name is valid, I would usually match t开发者_如何转开发he input against regular expression such as [A-Za-z]. But how can I do this if multi-language(like Chinese, Japanese etc.) support is required with utf8 encoding?

You can approximate the Unicode derived property \p{Alphabetic} pretty succintly with [\pL\pM\p{Nl}] if your language doensn’t support a proper Alphabetic property directly.

Don’t use Java’s \p{Alpha}, because that’s ASCII-only.

But then you’ll notice that you’ve failed to account for dashes (\p{Pd} or DashPunctuation works, but that does not include most of the hyphens!), apostrophes (usually but not always one of U+27, U+2BC, U+2019, or U+FF07), comma, or full stop/period.

You probably had better include \p{Pc} ConnectorPunctuation, just in case.

If you have the Unicode derived property \p{Diacritic}, you should use that, too, because it includes things like the mid-dot needed for geminated L’s in Catalan and the non-combining forms of diacritic marks which people sometimes use.

But then you’ll find people who use ordinal numbers in their names in ways that \p{Nl} (LetterNumber) doesn’t accomodate, so you throw \p{Nd} (DecimalNumber) or even all of \pN (Number) into the mix.

Then you realize that Asian names often require the use of ZWJ or ZWNJ to be written correctly in their scripts, so then you have to add U+200D and U+200C to the mix, which are both \p{Cf} (Format) characters and indeed also JoinControl ones.

By the time you’re done looking up the various Unicode properties for the various and many exotic characters that keep cropping up — or when you think you’re done, rather — you’re almost certain to conclude that you would do a much better job at this if you simply allowed them to use whatever Unicode characters for their name that they wish, as the link Tim cites advises. Yes, you’ll get a few jokers putting in things like “əɯɐuʇƨɐ⅂ əɯɐuʇƨɹᴉℲ”, but that just goes with the territory, and you can’t preclude silly names in any reasonable way.

Think about whether you really need to validate the user's name. Maybe you should let users call themselves whatever they want.

You certainly should never use [A-Za-z], because some people have names with apostrophes or hyphens. It can be quite insulting to prevent someone from using their real name just because it doesn't follow your arbitrary rules for what a name should look like.

In PHP I use this nasty hack:

 setlocale(LC_ALL, 'de_DE');
 preg_match('/^[[:alpha:]]+$/', $name);

That includes "Umlauts" (i.e. 'ä','ö' and the like) plus accented vowels (è,í,etc.). But it falls short to validate for Cyrillic (Russia, Bulgaria, ...) or Chinese characters...

继续阅读：internationalization regex unicode utf-8 validation

Multi-language input validation with UTF-8 encoding

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？