regex to filter all but whitelisted characters from a multi-language string

2022-12-23 03:17 问答作者：

I am trying to cleanup a string coming from a search box on a multi-language site.

Normally I would use a regex like:

$allowed = "-+?!,.;:\w\s";
$txt_search = preg_replace("/[^" . $allowed . "]?(.*?)[^" . $allowed . "]?/iu", "$1", $_GET['txt_search']);

and that works 开发者_JS百科fine for English texts.

However, now I need to do the same when the texts entered can be in any language (Russian now, Chinese in the future).

How can I clean up the string while preserving "normal texts" in the original language?

I though about switching to a blacklist (although I´d rather not...) but at this moment the regex just completely destroys all original input.

you can use ~~\p{LN}~~ [\p{L}\p{N}] instead of \w , see http://www.php.net/manual/en/regexp.reference.unicode.php

It is common problem, that russian letters not recognised by \w pattern, so you can use

$allowed = "-+?!,.;:\w\sа-я";

精彩评论