How to iterate UTF-8 string in PHP?

2023-01-15 01:20 问答作者：

How to iterate a UTF-8 string character by character using indexing?

When you access a UTF-8 string with the bracket operator $str[0] the utf-encoded character consists of 2 or more elements.

For example:

$str = "Kąt";
$str[0] = "K";
$str[1] = "�";
$str[2] = "�";
$str[3] = "t";

but I would like to have:

$str[0] = "K";
$str开发者_开发技巧[1] = "ą";
$str[2] = "t";

It is possible with mb_substr but this is extremely slow, ie.

mb_substr($str, 0, 1) = "K"
mb_substr($str, 1, 1) = "ą"
mb_substr($str, 2, 1) = "t"

Is there another way to interate the string character by character without using mb_substr?

Use preg_split. With "u" modifier it supports UTF-8 unicode.

$chrArray = preg_split('//u', $str, -1, PREG_SPLIT_NO_EMPTY);

Preg split will fail over very large strings with a memory exception and mb_substr is slow indeed, so here is a simple, and effective code, which I'm sure, that you could use:

function nextchar($string, &$pointer){
    if(!isset($string[$pointer])) return false;
    $char = ord($string[$pointer]);
    if($char < 128){
        return $string[$pointer++];
    }else{
        if($char < 224){
            $bytes = 2;
        }elseif($char < 240){
            $bytes = 3;
        }else{
            $bytes = 4;
        }
        $str =  substr($string, $pointer, $bytes);
        $pointer += $bytes;
        return $str;
    }
}

This I used for looping through a multibyte string char by char and if I change it to the code below, the performance difference is huge:

function nextchar($string, &$pointer){
    if(!isset($string[$pointer])) return false;
    return mb_substr($string, $pointer++, 1, 'UTF-8');
}

Using it to loop a string for 10000 times with the code below produced a 3 second runtime for the first code and 13 seconds for the second code:

function microtime_float(){
    list($usec, $sec) = explode(' ', microtime());
    return ((float)$usec + (float)$sec);
}

$source = 'árvíztűrő tükörfúrógépárvíztűrő tükörfúrógépárvíztűrő tükörfúrógépárvíztűrő tükörfúrógépárvíztűrő tükörfúrógép';

$t = Array(
    0 => microtime_float()
);

for($i = 0; $i < 10000; $i++){
    $pointer = 0;
    while(($chr = nextchar($source, $pointer)) !== false){
        //echo $chr;
    }
}

$t[] = microtime_float();

echo $t[1] - $t[0].PHP_EOL.PHP_EOL;

In answer to comments posted by @Pekla and @Col. Shrapnel I have compared preg_split with mb_substr.

How to iterate UTF-8 string in PHP?

The image shows, that preg_split took 1.2s, while mb_substr almost 25s.

Here is the code of the functions:

function split_preg($str){
    return preg_split('//u', $str, -1);     
}

function split_mb($str){
    $length = mb_strlen($str);
    $chars = array();
    for ($i=0; $i<$length; $i++){
        $chars[] = mb_substr($str, $i, 1);
    }
    $chars[] = "";
    return $chars;
}

Using Lajos Meszaros' wonderful function as inspiration I created a multi-byte string iterator class.

// Multi-Byte String iterator class
class MbStrIterator implements Iterator
{
    private $iPos   = 0;
    private $iSize  = 0;
    private $sStr   = null;

    // Constructor
    public function __construct(/*string*/ $str)
    {
        // Save the string
        $this->sStr     = $str;

        // Calculate the size of the current character
        $this->calculateSize();
    }

    // Calculate size
    private function calculateSize() {

        // If we're done already
        if(!isset($this->sStr[$this->iPos])) {
            return;
        }

        // Get the character at the current position
        $iChar  = ord($this->sStr[$this->iPos]);

        // If it's a single byte, set it to one
        if($iChar < 128) {
            $this->iSize    = 1;
        }

        // Else, it's multi-byte
        else {

            // Figure out how long it is
            if($iChar < 224) {
                $this->iSize = 2;
            } else if($iChar < 240){
                $this->iSize = 3;
            } else if($iChar < 248){
                $this->iSize = 4;
            } else if($iChar == 252){
                $this->iSize = 5;
            } else {
                $this->iSize = 6;
            }
        }
    }

    // Current
    public function current() {

        // If we're done
        if(!isset($this->sStr[$this->iPos])) {
            return false;
        }

        // Else if we have one byte
        else if($this->iSize == 1) {
            return $this->sStr[$this->iPos];
        }

        // Else, it's multi-byte
        else {
            return substr($this->sStr, $this->iPos, $this->iSize);
        }
    }

    // Key
    public function key()
    {
        // Return the current position
        return $this->iPos;
    }

    // Next
    public function next()
    {
        // Increment the position by the current size and then recalculate
        $this->iPos += $this->iSize;
        $this->calculateSize();
    }

    // Rewind
    public function rewind()
    {
        // Reset the position and size
        $this->iPos     = 0;
        $this->calculateSize();
    }

    // Valid
    public function valid()
    {
        // Return if the current position is valid
        return isset($this->sStr[$this->iPos]);
    }
}

It can be used like so

foreach(new MbStrIterator("Kąt") as $c) {
    echo "{$c}\n";
}

Which will output

K
ą
t

Or if you really want to know the position of the start byte as well

foreach(new MbStrIterator("Kąt") as $i => $c) {
    echo "{$i}: {$c}\n";
}

Which will output

0: K
1: ą
3: t

You could parse each byte of the string and determine whether it is a single (ASCII) character or the start of a multi-byte character:

The UTF-8 encoding is variable-width, with each character represented by 1 to 4 bytes. Each byte has 0–4 leading consecutive '1' bits followed by a '0' bit to indicate its type. 2 or more '1' bits indicates the first byte in a sequence of that many bytes.

you would walk through the string and, instead of increasing the position by 1, read the current character in full and then increase the position by the length that character had.

The Wikipedia article has the interpretation table for each character ^{[retrieved 2010-10-01]}:

   0-127 Single-byte encoding (compatible with US-ASCII)
 128-191 Second, third, or fourth byte of a multi-byte sequence
 192-193 Overlong encoding: start of 2-byte sequence, 
         but would encode a code point ≤ 127
  ........

I had the same issue as OP and I try to avoid regex in PHP since it fails or even crashes with long strings. I used Mészáros Lajos' answer with some changes since I have mbstring.func_overload set to 7.

function nextchar($string, &$pointer, &$asciiPointer){
   if(!isset($string[$asciiPointer])) return false;
    $char = ord($string[$asciiPointer]);
    if($char < 128){
        $pointer++;
        return $string[$asciiPointer++];
    }else{
        if($char < 224){
            $bytes = 2;
        }elseif($char < 240){
            $bytes = 3;
        }elseif($char < 248){
            $bytes = 4;
        }elseif($char = 252){
            $bytes = 5;
        }else{
            $bytes = 6;
        }
        $str =  substr($string, $pointer++, 1);
        $asciiPointer+= $bytes;
        return $str;
    }
}

With mbstring.func_overload set to 7, substr actually calls mb_substr. So substr gets the right value in this case. I had to add a second pointer. One keeps track of the multi-byte char in the string, the other keeps track of the single-byte char. The multi-byte value is used for substr (since it's actually mb_substr), while the single-byte value is used for retrieving the byte in this fashion: $string[$index].

Obviously if PHP ever decides to fix the [] access to work properly with multi-byte values, this will fail. But also, this fix wouldn't be needed in the first place.

I think the most efficient solution would be to work through the string using mb_substr. In each iteration of the loop, mb_substr would be called twice (to find the next character and the remaining string). It would pass only the remaining string to the next iteration. This way, the main overhead in each iteration would be finding the next character (done twice), which takes only one to five or so operations, depending on the byte length of the character.

If this description is not clear, let me know and I'll provide a working PHP function.

Since PHP 7.4 You can use mb_str_split.

https://www.php.net/manual/en/function.mb-str-split.php

$str = 'Kąt';
$chars = mb_str_split($str);
var_dump($chars);

array(3) {
  [0] =>
  string(1) "K"
  [1] =>
  string(2) "ą"
  [2] =>
  string(1) "t"
}

继续阅读：php utf-8

How to iterate UTF-8 string in PHP?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？