How do I find the number of bytes within UTF-8 string with PHP?

2022-12-21 20:05 问答作者：

I have the following function from the php.net site to determine the # of bytes in an ASCII and UTF-8 string:

<?php 
/** 
 * Count the number of bytes of a given string. 
 * Input string is expected to be ASCII or UTF-8 encoded. 
 * Warning: the function doesn't return the number of chars 
 * in the string, but the number of bytes. 
 * 
 * @param string $str The string to compute number of bytes 
 * 
 * @return The length in bytes of the given string. 
 */ 
function strBytes($str) 
{ 
  // STRINGS ARE EXPECTED TO BE IN ASCII OR UTF-8 FORMAT 

  // Number of characters in string 
  $strlen_var = strlen($str); 

  // string bytes counter 
  $d = 0; 

 /* 
  * Iterate over every character in the string, 
  * escaping with a slash or encoding to UTF-8 where necessary 
  */ 
  for ($c = 0; $c < $strlen_var; ++$c) { 

      $ord_var_c = ord($str{$d}); 

      switch (true) { 
          case (($ord_var_c >= 0x20) && ($ord_var_c <= 0x7F)): 
              // characters U-00000000 - U-0000007F (same as ASCII) 
              $d++; 
              break; 

          case (($ord_var_c & 0xE0) == 0xC0): 
              // characters U-00000080 - U-000007FF, mask 110XXXXX 
              // see http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 
              $d+=2; 
              break; 

          case (($ord_var_c & 0xF0) == 0xE0): 
              // characters U-00000800 - U-0000FFFF, mask 1110XXXX 
              // see http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 
              $d+=3; 
              break; 

          case (($ord_var_c & 0xF8) == 0xF0): 
              // characters U-00010000 - U-001FFFFF, mask 11110XXX 
              // see http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 
              $d+=4; 
              break; 

          case (($ord_var_c & 0xFC) == 0xF8): 
              // characters U-00200000 - U-03FFFFFF, mask 111110XX 
              // see http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 
              $d+=5; 
              break; 

          case (($ord_var_c & 0xFE) == 0xFC): 
              // characters U-04000000 - U-7FFFFFFF, mask 1111110X 
              // see http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 
              $d+=6; 
              break; 
          default: 
            $d++;    
      } 
  } 

  return $d; 
} 
?>

However when I try this with Russian (e.g. По своей природе компьютеры могут работать лишь с числами. И для того, чтобы они могли хранить в памяти буквы или другие символы, каждому такому символу должно быть поставлено в соответствие число.). It doesn't seem to return the correct number of bytes.

The switch statement is using the default condition. Any ideas why Russian characters would not be working as expected? Or would there be better options for this.

I am asking this as I need to shorten a UTF-8 string to a certain number of bytes. i.e. I can only send a max. of 169 bytes of JSON data to the iPhone APNS in my situation (excluding the other packet data).

Reference: PHP strlen - Manual (Paolo Comme开发者_如何学Gont on 10-Jan-2007 03:58)

I am asking this as I need to shorten a utf-8 string to a certain number of bytes.

mb_strcut() does exactly this, though you might not be able to tell from the barely comprehensible documentation.

strlen() returns the number of bytes.

Shortening a multibyte string to a certain number of bytes is a separate task. You will need to take care not to chop the string off in the middle of a multibyte sequence as you shorten it.

The other thing you need to handle is that when you put a string into json notation, it might need more bytes to represent it as json. For example, if your string contains a double quote character. It needs to be escaped, and the backslash character will add one byte. There's other characters that need to be escaped too. Point is, it can get larger. I assume the byte limit is on the total json payload, so you do need to account for the json syntax itself, as well as any escaping that json will impose on your string.

An unoptimized, kinda hacky way to do it is to chop the string, at say 5 bytes more than your limit, using substr(). Now use mb_strlen() to get number of characters, and mb_substr() to remove the last character. Now encode it as json, and measure the bytes via strlen(). Enter a loop, which keeps chopping off the last character using mb_substr(), encodes as json, and again measure bytes using strlen(). The loop terminates when the number of bytes is acceptable.

If you wish to find the byte length of a multi-byte string when you are using mbstring.func_overload 2 and UTF-8 strings, then you can use the following:

mb_strlen($utf8_string, 'latin1');

In PHP 5, mb_strlen should return the number of characters ; and strlen should return the number of bytes.

For instance, this portion of code :

$string = 'По своей природе компьютеры могут работать лишь с числами. И для того, чтобы они могли хранить в памяти буквы или другие символы, каждому такому символу должно быть поставлено в соответствие число';
echo mb_strlen($string, 'UTF-8') . '<br />';
echo strlen($string);

Should get you the following output :

196
359

As a sidenote : this is one the the things that PHP 6 will change : PHP 6 will be using Unicode by default, which means strlen should, in PHP 6, return the number of characters, and not a number of bytes anymore.

Count of Bytes <> String length!

to get the count of byte you can use (php4,5) strlen. to get the unicode string (utf8 encoded) length you can use mb_strlen (take care about function overloading from that extension) or you can simply count all bytes which do not have the 8th bit set.

8th bit means for this unicodechar is coming at least one more byte from input.

继续阅读：byte php string strlen utf-8

How do I find the number of bytes within UTF-8 string with PHP?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？