PHP How to check if a utf-8 string is present in another string?

2023-04-03 12:26 问答作者：

I really struglled the whol night to figure this out.. but :(

in a form, the user inputs a word and I need to check if his input doesn't contain characters out of a preset table of characters:

abggwdḍefkkwhḤƐxqijlmnurṚɣsṢctṬwyzẒ

So far I use a code I found in php.net:

$temp开发者_JS百科_s = mb_convert_encoding($post['word'],'UTF-16','UTF-8');
$temp_a = str_split($temp_s,4);
$temp_a_len = count($temp_a);

for($i=0; $i<$temp_a_len; $i++){
    $temp_a[$i] = mb_convert_encoding($temp_a[$i],'UTF-8','UTF-16');

    $pos = stripos( mb_strtolower($allowed),  mb_strtolower($temp_a[$i]) );
    if($pos === false){
        echo '- '. mb_strtolower($temp_a[$i]) .' -is not allowed in '.mb_strtolower($allowed);
        return false;
    } 
}

what I'm doing wrong? because the if I submit the character ḍ it outputs:

- ḍ -is not allowed in abggwdḌefkkwhḤƐxqijlmnurṚɣsṢctṬwyzẒ

UPDATE Another thing is how to allow uppercase or lowercase versions of the $allowed characters list?

As simple as:

$unwanted = 'abggwdḍefkkwhḤƐxqijlmnurṚɣsṢctṬwyzẒ';
$badText  = 'Foo baṚ Baz';
$goodText = '345235';

if (preg_match_all("/[$unwanted]/u", $badText, $matches)) {
    echo "Bad text is bad, invalid characters: " . join(', ', $matches[0]) . PHP_EOL;
}

if (preg_match_all("/[$unwanted]/u", $goodText, $matches)) {
    echo "Good text is bad, invalid characters: " . join(', ', $matches[0]) . PHP_EOL;
}

Note that your source code needs to be saved in UTF-8 and the input needs to be UTF-8 as well.

I'm really questioning the use of a UTF-8 blacklist though, since there are hundreds of thousands of code points. Blacklisting parts of them seems like a useless uphill battle. If you disallowed "Ṛ", why would you accept "Ŗ" or any of other variants of "R"-like characters. Catching them all is rather futile. Think about implementing a whitelist instead. (That is, if I'm understanding what you're trying to do at all. It's not really clear.)

Note that characters could be decomposed, which would mean they won't match your expression. For example, ü can be the character ü (U+00FC) or ü (U+0075 U+0308, which is u followed by a combining ¨). You should normalize characters to NFC (Canonical Decomposition followed by Canonical Composition), which means that any form of ü will be normalized to U+00FC. In PHP you do this with:

$badText = Normalizer::normalize($badText, Normalizer::FORM_C);

The Normalizer class is not installed everywhere by default unfortunately.

The code you posted doesn't actually seem to give me any errors, but here is a shorter version. Maybe see if this does what you want.

$input = 'ḍwhat';

$allowed = mb_strtolower('ḍabggwdefkkwhḤƐxqijlmnurṚɣsṢctṬwyzẒ');

foreach (preg_split('//u', $input) as $c) {
  if (mb_strlen($c) !== 0 && mb_strpos($allowed, mb_strtolower($c)) === FALSE) {
    echo '-' . $c . '- is not allowded in ' . $allowed;
    return false;
  }
}

The only thing I'd say is try out your original code with str_split($temp_s,2); instead, since 4 isn't always going to work and more UTF-16 stuff will be 2 bytes. Both will potentially break though.

there was no problem in my eclipse

$allowed = 'ḍabggwdefkkwhḤƐxqijlmnurṚɣsṢctṬwyzẒ';
$temp_s = mb_convert_encoding('ḍ','UTF-16','UTF-8');
$temp_a = str_split($temp_s,4);
$temp_a_len = count($temp_a);

for($i=0; $i<$temp_a_len; $i++){
$temp_a[$i] = mb_convert_encoding($temp_a[$i],'UTF-8','UTF-16');

$pos = stripos( mb_strtolower($allowed),  mb_strtolower($temp_a[$i]) );
if($pos === false){
    echo '- '. mb_strtolower($temp_a[$i]) .' -is not allowed in '.mb_strtolower($allowed);

} 
}

继续阅读：encoding php utf-8

PHP How to check if a utf-8 string is present in another string?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？