Sanitize an international user name

2023-03-21 08:54 问答作者：

In a web chat feature, users enter a name into a form. The names may be in international alphabets but special characters should be removed from the input string, where special means: characters not likely to be part of a persons name.

I don't know personal name conventions from around the world so I thought I'd use PCRE's implementation of Unicode character properties. Here's the regex I cam开发者_运维知识库e up with to remove special characters:

/[\v\t\pC\pS\p{Zl}\p{Zp}\p{Pe}\p{Pf}\p{Pi}\p{Po}\p{Ps}\p{Me}\p{No}]/u

Jan Goyvaerts has a handy list of these properties.

What would you do in to best meet that requirement? Doesn't need to use regex.

EDIT I copied below the list of Unicode character properties and struck out the ones that would be disallowed:

\p{L} or \p{Letter}: any kind of letter from any language.
- \p{Ll} or \p{Lowercase_Letter}: a lowercase letter that has an uppercase variant.
- \p{Lu} or \p{Uppercase_Letter}: an uppercase letter that has a lowercase variant.
- \p{Lt} or \p{Titlecase_Letter}: a letter that appears at the start of a word when only the first letter of the word is capitalized.
- \p{L&} or \p{Letter&}: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).
- \p{Lm} or \p{Modifier_Letter}: a special character that is used like a letter.
- \p{Lo} or \p{Other_Letter}: a letter or ideograph that does not have lowercase and uppercase variants.
\p{M} or \p{Mark}: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
- \p{Mn} or \p{Non_Spacing_Mark}: a character intended to be combined with another character without taking up extra space (e.g. accents, umlauts, etc.).
- \p{Mc} or \p{Spacing_Combining_Mark}: a character intended to be combined with another character that takes up extra space (vowel signs in many Eastern languages).
- ~~\p{Me} or \p{Enclosing_Mark}: a character that encloses the character is is combined with (circle, square, keycap, etc.).~~
\p{Z} or \p{Separator}: any kind of whitespace or invisible separator.
- \p{Zs} or \p{Space_Separator}: a whitespace character that is invisible, but does take up space.
- ~~\p{Zl} or \p{Line_Separator}: line separator character U+2028.~~
- ~~\p{Zp} or \p{Paragraph_Separator}: paragraph separator character U+2029.~~
\p{S} or \p{Symbol}: math symbols, currency signs, dingbats, box-drawing characters, etc..

\p{Sm} or \p{Math_Symbol}: any mathematical symbol.
\p{Sc} or \p{Currency_Symbol}: any currency sign.
\p{Sk} or \p{Modifier_Symbol}: a combining character (mark) as a full character on its own.
\p{So} or \p{Other_Symbol}: various symbols that are not math symbols, currency signs, or combining characters.
\p{N} or \p{Number}: any kind of numeric character in any script.
- \p{Nd} or \p{Decimal_Digit_Number}: a digit zero through nine in any script except ideographic scripts.
- \p{Nl} or \p{Letter_Number}: a number that looks like a letter, such as a Roman numeral.
- ~~\p{No} or \p{Other_Number}: a superscript or subscript digit, or a number that is not a digit 0..9 (excluding numbers from ideographic scripts).~~
\p{P} or \p{Punctuation}: any kind of punctuation character.
- \p{Pd} or \p{Dash_Punctuation}: any kind of hyphen or dash.
- \p{Pc} or \p{Connector_Punctuation}: a punctuation character such as an underscore that connects words.
- ~~\p{Ps} or \p{Open_Punctuation}: any kind of opening bracket.~~
- ~~\p{Pe} or \p{Close_Punctuation}: any kind of closing bracket.~~
- ~~\p{Pi} or \p{Initial_Punctuation}: any kind of opening quote.~~
- ~~\p{Pf} or \p{Final_Punctuation}: any kind of closing quote.~~
- ~~\p{Po} or \p{Other_Punctuation}: any kind of punctuation character that is not a dash, bracket, quote or connector.~~
\p{C} or \p{Other}: invisible control characters and unused code points.

\p{Cc} or \p{Control}: an ASCII 0x00..0x1F or Latin-1 0x80..0x9F control character.
\p{Cf} or \p{Format}: invisible formatting indicator.
\p{Co} or \p{Private_Use}: any code point reserved for private use.
\p{Cs} or \p{Surrogate}: one half of a surrogate pair in UTF-16 encoding.
\p{Cn} or \p{Unassigned}: any code point to which no character has been assigned.

It depends on your requirements. When you're dealing with regex, it's difficult to be certain that what's punctuation in one language isn't a valid character in another. If you're just trying to keep accidentally entered text out of your db, I'd run that regex through JavaScript and ask the user if they're sure they entered correct information if the regex finds stuff that doesn't look like characters. The user then has a choice whether to submit anyway, or correct their name. This makes the user double-check their work in only a small amount of circumstances where there is a high-likelihood they entered non-text, thus not annoying a vast majority of users, but allowing the small minority with problematic names to not be stuck not being able to properly enter their name due to your code removing characters.

This seems like the all-around best approach to me since you're already storing unicode, so nothing should break if users do enter something you think might be punctuation but actually isn't, and the chances of a single user deciding to maliciously enter punctuation seem low (why would someone do that?). Additionally, you could make a separate regex on the server side with standard punctuation [,.!? etc...] that you don't want to allow under any circumstances.

Finally, You could then add a captcha as well to block off spam bots trying to enter bad names maliciously.

继续阅读：regex sanitization unicode

Sanitize an international user name

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？