Regex for names with special characters (Unicode)

2023-03-04 06:58 问答作者：

Okay, I have read about regex all day now, and still don't understand it properly. What i'm trying to do is validate a name, but the functions i can find for this on the internet only use [a-zA-Z], leaving characters out that i need to accept to.

I basically need a regex that checks that the name is at least two words, and that it does not contain numbers or special characters like !"#¤%&/()=..., however the words can contain characters like æ, é, Â and so on...

An example of an accepted name would be: "John Elkjærd" or "André Svenson"

An non-accepted name would be: "Hans", "H4nn3 Andersen" or "Martin Henriksen!"

If it matters i use the javascript .match() function client side and want to use php's preg_replace() only "in negative" server side. (removing non-matching characters).

Any help would be much appreciated.

Update:

Okay, thanks to Alix Axel's answer i have the important part down, the server side one.

But as the page from LightWing's answer suggests, i'm unable to find anything about unicode support for javascript, so i ended up with half a solution for the client side, just checking for at least two words and minimum 5 characters like this:

if(name.match(/\S+/g).length >开发者_Go百科= minWords && name.length >= 5) {
  //valid
}

An alternative would be to specify all the unicode characters as suggested in shifty's answer, which i might end up doing something like, along with the solution above, but it is a bit unpractical though.

Try the following regular expression:

^(?:[\p{L}\p{Mn}\p{Pd}\'\x{2019}]+\s[\p{L}\p{Mn}\p{Pd}\'\x{2019}]+\s?)+$

In PHP this translates to:

if (preg_match('~^(?:[\p{L}\p{Mn}\p{Pd}\'\x{2019}]+\s[\p{L}\p{Mn}\p{Pd}\'\x{2019}]+\s?)+$~u', $name) > 0)
{
    // valid
}

You should read it like this:

^   # start of subject
    (?:     # match this:
        [           # match a:
            \p{L}       # Unicode letter, or
            \p{Mn}      # Unicode accents, or
            \p{Pd}      # Unicode hyphens, or
            \'          # single quote, or
            \x{2019}    # single quote (alternative)
        ]+              # one or more times
        \s          # any kind of space
        [               #match a:
            \p{L}       # Unicode letter, or
            \p{Mn}      # Unicode accents, or
            \p{Pd}      # Unicode hyphens, or
            \'          # single quote, or
            \x{2019}    # single quote (alternative)
        ]+              # one or more times
        \s?         # any kind of space (0 or more times)
    )+      # one or more times
$   # end of subject

I honestly don't know how to port this to Javascript, I'm not even sure Javascript supports Unicode properties but in PHP PCRE this seems to work flawlessly @ IDEOne.com:

$names = array
(
    'Alix',
    'André Svenson',
    'H4nn3 Andersen',
    'Hans',
    'John Elkjærd',
    'Kristoffer la Cour',
    'Marco d\'Almeida',
    'Martin Henriksen!',
);

foreach ($names as $name)
{
    echo sprintf('%s is %s' . "\n", $name, (preg_match('~^(?:[\p{L}\p{Mn}\p{Pd}\'\x{2019}]+\s[\p{L}\p{Mn}\p{Pd}\'\x{2019}]+\s?)+$~u', $name) > 0) ? 'valid' : 'invalid');
}

I'm sorry I can't help you regarding the Javascript part but probably someone here will.

Validates:

John Elkjærd
André Svenson
Marco d'Almeida
Kristoffer la Cour

Invalidates:

Hans
H4nn3 Andersen
Martin Henriksen!

To replace invalid characters, though I'm not sure why you need this, you just need to change it slightly:

$name = preg_replace('~[^\p{L}\p{Mn}\p{Pd}\'\x{2019}\s]~u', '$1', $name);

Examples:

H4nn3 Andersen -> Hnn Andersen
Martin Henriksen! -> Martin Henriksen

Note that you always need to use the u modifier.

Regarding JavaScript it is more tricky, since JavaScript Regex syntax doesn't support unicode character properties. A pragmatic solution would be to match letters like this:

[a-zA-Z\xC0-\uFFFF]

This allows letters in all languages and excludes numbers and all the special (non-letter) characters commonly found on keyboards. It is imperfect because it also allows unicode special symbols which are not letters, e.g. emoticons, snowman and so on. However, since these symbols are typically not available on keyboards I don't think they will be entered by accident. So depending on your requirements it may be an acceptable solution.

visit this page Unicode Characters in Regular Expression

you can add the allowed special chars to the regex.

example:

[a-zA-ZßöäüÖÄÜæé]+

EDIT:

not the best solution, but this would give a result if there are at least to words.

[a-zA-ZßöäüÖÄÜæé]+\s[a-zA-ZßöäüÖÄÜæé]+

Here's an optimization over the fantastic answer by @Alix above. It removes the need to define the character class twice, and allows for easier definition of any number of required words.

^(?:[\p{L}\p{Mn}\p{Pd}\'\x{2019}]+(?:$|\s+)){2,}$

It can be broken down as follows:

^         # start
  (?:       # non-capturing group
    [         # match a:
      \p{L}     # Unicode letter, or
      \p{Mn}    # Unicode accents, or
      \p{Pd}    # Unicode hyphens, or
      \'        # single quote, or
      \x{2019}  # single quote (alternative)
    ]+        # one or more times
    (?:       # non-capturing group
      $         # either end-of-string
    |         # or
      \s+       # one or more spaces
    )         # end of group
  ){2,}     # two or more times
$         # end-of-string

Essentially, it is saying to find a word as defined by the character class, then either find one or more spaces or an end of a line. The {2,} at the end tells it that a minimum of two words must be found for a match to succeed. This ensures the OP's "Hans" example will not match.

Lastly, since I found this question while looking for a similar solution for ruby, here is the regular expression as can be used in Ruby 1.9+

\A(?:[\p{L}\p{Mn}\p{Pd}\'\U+2019]+(?:\Z|\s+)){2,}\Z

The primary changes are using \A and \Z for beginning and end of string (instead of line) and Ruby's Unicode character notation.

When checking your input string you could

trim() it to remove leading/trailing whitespaces
match against [^\w\s] to detect non-word\non-whitespace characters
match against \s+ to get the number of word separators which equals to number of words + 1.

However I'm not sure that the \w shorthand includes accented characters, but it should fall into "word characters" category.

This is the JS regex that I use for fancy names composed with max 3 words (1 to 60 chars), separated by space/single quote/minus sign

^([a-zA-Z\xC0-\uFFFF]{1,60}[ \-\']{0,1}){1,3}$

继续阅读：character-properties javascript php regex

Regex for names with special characters (Unicode)

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？