Regex to strip out everything but words and numbers (and latin chars)
Im trying to clean a post string used in an ajax request (sanitize before db query) to allow only alphanumeric characters, spaces (1 per word, not multiple), can contain "-", and latin characters like "ç" and "é" without success, can anyone help or point me on the right d开发者_如何转开发irection?
This is the regex I'm using so far:
$string = preg_replace('/^[a-z0-9 àáâãäåçèéêëìíîïðñòóôõöøùúû-]+$/', '', mb_strtolower(utf8_encode($_POST['q'])));
Thank you.
$regEx = '/^[^\w\p{L}-]+$/iu';
\w
- matches alphanumerics
\p{L}
- matches a single Unicode Code Point in the 'Letters' category (see the Unicode Categories section here).
-
at the end of the character class matches a single hyphen.
^
in the character classes negates the character class, so that the regex will match the opposite of the character class (anything you do not specify).
+
outside of the character class says match 1 or more characters
^
and $
outside of the character class will cause the engine to only accept matches that start at the beginning of a line and goes until the end of the line.
After the pattern, the i
modifier says ignore case and the u
tells the pattern matching engine that we're going to be sending UTF8 data it's way, and g
modifier originally present has been removed since it's not necessary in PHP (instead global matching is dependent on which matching function is called)
$string = mb_strtolower(utf8_encode($_POST['q'])));
$string = preg_replace('/[^a-z0-9 àáâãäåçèéêëìíîïðñòóôõöøùúû-]+/g', '', $string);
$string = preg_replace('/ +/g', ' ', $string);
Why not just use mysql_real_escape_string?
$string = preg_replace('/[^a-z0-9 àáâãäåçèéêëìíîïðñòóôõöøùúû\-]/u', '', mb_strtolower(utf8_encode($_POST['q']), 'UTF-8'));
$string = preg_replace( '/ +/', ' ', $string );
should do the trick. Note that
- the character class is negated by putting ^ inside the character class
- you need the u flag when dealing with unicode strings either in the pattern or in the subject
- it's better to specify the character set explicitly in mb_* functions because otherwise they will fall back on your system defaults, and that may not be UTF-8.
- the hyphen character needed escaping (\- instead of - at the end of your character class)
精彩评论