What's this regex doing?

2023-03-29 06:35 问答作者：

I've found this regex in a script I'm customizing. Can someone tell me what its doing?

function test( $text) {
    $regex = '/( [\x00-\x7F] | [\xC0-\xDF][\x80-\xBF]开发者_如何学编程 | [\xE0-\xEF][\x80-\xBF]{2} | [\xF0-\xF7][\x80-\xBF]{3} ) | ./x';
    return preg_replace($regex, '$1', $text);
}

Inside of the capturing group there are four options:

[\x00-\x7F]
[\xC0-\xDF][\x80-\xBF]
[\xE0-\xEF][\x80-\xBF]{2}
[\xF0-\xF7][\x80-\xBF]{3}

If none of these patterns are matched at a given location, then any character will be matched by the . that is outside of the capturing group.

The preg_replace call will iterate over $text finding all non-overlapping matches, replacing each match with whatever was captured.

There are two possibilities here, either the entire match was inside the capturing group so the replacement doesn't change $text, or the . at the end matched a single character and that character is removed from $text.

Here are some basic examples:

If a character in the range \xF8-\xFF appears in the text, it will always be removed
A character in \xC0-\xDF will be removed unless followed by a character in \x80-\xBF
A character in \xE0-\xEF will be removed unless followed by two characters in \x80-\xBF
A character in \xF0-\xF7 will be removed unless followed by three characters in \x80-\xBF
A character in \x80-\xBF will be removed unless it was matched as a part of one of the above cases

The purpose appears to be to "clean" UTF-8 encoded text. The part in the capturing group,

( [\x00-\x7F] | [\xC0-\xDF][\x80-\xBF] | [\xE0-\xEF][\x80-\xBF]{2} | [\xF0-\xF7][\x80-\xBF]{3} )

...roughly matches a valid UTF-8 byte sequence, which may be one to four bytes long. The value of the first byte determines how long that particular byte sequence should be.

Since the replacement is simply, '$1', valid byte sequences will be plugged right back into the output. Any byte that's not matched by that part will instead be matched by the dot (.), and effectively removed.

The most important thing to know about this technique is that you should never have to use it. If you find invalid UTF-8 byte sequences in your UTF-8 encoded text, it means one of two things: it's not really UTF-8, or it's been corrupted. Instead of "cleaning" it, you should find out how it got dirty and fix that problem.

继续阅读：php regex

What's this regex doing?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？