regex to filter all but whitelisted characters from a multi-language string
I am trying to cleanup a string coming from a search box on a multi-language site.
Normally I would use a regex like:
$allowed = "-+?!,.;:\w\s";
$txt_search = preg_replace("/[^" . $allowed . "]?(.*?)[^" . $allowed . "]?/iu", "$1", $_GET['txt_search']);
and that works 开发者_JS百科fine for English texts.
However, now I need to do the same when the texts entered can be in any language (Russian now, Chinese in the future).
How can I clean up the string while preserving "normal texts" in the original language?
I though about switching to a blacklist (although I´d rather not...) but at this moment the regex just completely destroys all original input.
you can use \p{LN} [\p{L}\p{N}]
instead of \w , see http://www.php.net/manual/en/regexp.reference.unicode.php
It is common problem, that russian letters not recognised by \w pattern, so you can use
$allowed = "-+?!,.;:\w\sа-я";
精彩评论