PHP How to check if a utf-8 string is present in another string?
I really struglled the whol night to figure this out.. but :(
in a form, the user inputs a word and I need to check if his input doesn't contain characters out of a preset table of characters:
abggwdḍefkkwhḤƐxqijlmnurṚɣsṢctṬwyzẒ
So far I use a code I found in php.net:
$temp开发者_JS百科_s = mb_convert_encoding($post['word'],'UTF-16','UTF-8');
$temp_a = str_split($temp_s,4);
$temp_a_len = count($temp_a);
for($i=0; $i<$temp_a_len; $i++){
$temp_a[$i] = mb_convert_encoding($temp_a[$i],'UTF-8','UTF-16');
$pos = stripos( mb_strtolower($allowed), mb_strtolower($temp_a[$i]) );
if($pos === false){
echo '- '. mb_strtolower($temp_a[$i]) .' -is not allowed in '.mb_strtolower($allowed);
return false;
}
}
what I'm doing wrong? because the if I submit the character ḍ
it outputs:
- ḍ -is not allowed in abggwdḌefkkwhḤƐxqijlmnurṚɣsṢctṬwyzẒ
UPDATE Another thing is how to allow uppercase or lowercase versions of the $allowed characters list?
As simple as:
$unwanted = 'abggwdḍefkkwhḤƐxqijlmnurṚɣsṢctṬwyzẒ';
$badText = 'Foo baṚ Baz';
$goodText = '345235';
if (preg_match_all("/[$unwanted]/u", $badText, $matches)) {
echo "Bad text is bad, invalid characters: " . join(', ', $matches[0]) . PHP_EOL;
}
if (preg_match_all("/[$unwanted]/u", $goodText, $matches)) {
echo "Good text is bad, invalid characters: " . join(', ', $matches[0]) . PHP_EOL;
}
Note that your source code needs to be saved in UTF-8 and the input needs to be UTF-8 as well.
I'm really questioning the use of a UTF-8 blacklist though, since there are hundreds of thousands of code points. Blacklisting parts of them seems like a useless uphill battle. If you disallowed "Ṛ", why would you accept "Ŗ" or any of other variants of "R"-like characters. Catching them all is rather futile. Think about implementing a whitelist instead. (That is, if I'm understanding what you're trying to do at all. It's not really clear.)
Note that characters could be decomposed, which would mean they won't match your expression. For example, ü
can be the character ü
(U+00FC) or ü
(U+0075 U+0308, which is u
followed by a combining ¨
). You should normalize characters to NFC (Canonical Decomposition followed by Canonical Composition), which means that any form of ü
will be normalized to U+00FC. In PHP you do this with:
$badText = Normalizer::normalize($badText, Normalizer::FORM_C);
The Normalizer
class is not installed everywhere by default unfortunately.
The code you posted doesn't actually seem to give me any errors, but here is a shorter version. Maybe see if this does what you want.
$input = 'ḍwhat';
$allowed = mb_strtolower('ḍabggwdefkkwhḤƐxqijlmnurṚɣsṢctṬwyzẒ');
foreach (preg_split('//u', $input) as $c) {
if (mb_strlen($c) !== 0 && mb_strpos($allowed, mb_strtolower($c)) === FALSE) {
echo '-' . $c . '- is not allowded in ' . $allowed;
return false;
}
}
The only thing I'd say is try out your original code with str_split($temp_s,2);
instead, since 4 isn't always going to work and more UTF-16 stuff will be 2 bytes. Both will potentially break though.
there was no problem in my eclipse
$allowed = 'ḍabggwdefkkwhḤƐxqijlmnurṚɣsṢctṬwyzẒ';
$temp_s = mb_convert_encoding('ḍ','UTF-16','UTF-8');
$temp_a = str_split($temp_s,4);
$temp_a_len = count($temp_a);
for($i=0; $i<$temp_a_len; $i++){
$temp_a[$i] = mb_convert_encoding($temp_a[$i],'UTF-8','UTF-16');
$pos = stripos( mb_strtolower($allowed), mb_strtolower($temp_a[$i]) );
if($pos === false){
echo '- '. mb_strtolower($temp_a[$i]) .' -is not allowed in '.mb_strtolower($allowed);
}
}
精彩评论