Should there be something like 'bytelen' (along with 'strlen')?

2022-12-22 17:16 问答作者：

In my opinion the 'strlen' function should only return the number of characters in a string. Nothing else. And it does, whether it counts ASCII characters or Unicode characters. A character is a character, pointing to a given position on an ASCII table or a UTF-8 table. Nothing more.

If you would like to know, for whatever reason, the byte-length of a string, then you should use a differtent function. I am a newby in PHP scripting, so I did not find that function yet. (Should be something like 'bytele开发者_Python百科n()'?)

mb_strlen() does what you're after.

Yes, that would be most logical design. However, PHP has not been planned to support multibyte charsets from the beginning. Instead, it's been evolving along the years in a sort of chaotic manner. You've tagged your question as PHP 4 but PHP 5 does not have a decent Unicode support yet (and I don't think it'll change in a nearby future).

There're a few reasons for this anyway:

PHP is not a closed-source commercial product owned by a company with a centralized design controlled by enterprise rules.
PHP was released in 1995 as a personal project by someone who needed some functionality in his static home page: at that time, it had no need for Unicode support.
If you modify core functions like strlen() you must do it in a way that it doesn't break previous functionality. It's not easy. Writing a new separate function is much easier.

Update

Sorry, I forgot the second part of your question. If you need to handle Unicode strings you have to use a separate set of functions:

http://es.php.net/manual/en/book.mbstring.php

You might also find these chapters interesting:

http://es.php.net/manual/en/book.iconv.php
http://es.php.net/manual/en/book.unicode.php

Please take note of the PHP version required by each function you are planning to use; PHP 4 is pretty old.

If I'm not grossly misunderstanding you, then strlen() is your 'bytelen()', as alluded to in the other responses here.

strlen() itself has no support for utf-8 or other multi-byte character sets; if you want a proper strlen(), you'll need mb_strlen().

Pentium10's function strBytes($str), from glancing over it (not testing) looks like it would be a good alternative if you know your encoding is utf-8 and you're stuck with a super low version of PHP4 for some reason.

(And I do recommend taking a look at Álvaro G. Vicario's post for the reasons behind this behaviour. Proper, native UTF-8 support is due to come with PHP6.)

/** 
     * Count the number of bytes of a given string. 
     * Input string is expected to be ASCII or UTF-8 encoded. 
     * Warning: the function doesn't return the number of chars 
     * in the string, but the number of bytes. 
     * 
     * @param string $str The string to compute number of bytes 
     * 
     * @return The length in bytes of the given string. 
     */ 
    function strBytes($str) 
    { 
      // STRINGS ARE EXPECTED TO BE IN ASCII OR UTF-8 FORMAT 

      // Number of characters in string 
      $strlen_var = strlen($str); 

      // string bytes counter 
      $d = 0; 

     /* 
      * Iterate over every character in the string, 
      * escaping with a slash or encoding to UTF-8 where necessary 
      */ 
      for ($c = 0; $c < $strlen_var; ++$c) { 

          $ord_var_c = ord($str{$d}); 

          switch (true) { 
              case (($ord_var_c >= 0x20) && ($ord_var_c <= 0x7F)): 
                  // characters U-00000000 - U-0000007F (same as ASCII) 
                  $d++; 
                  break; 

              case (($ord_var_c & 0xE0) == 0xC0): 
                  // characters U-00000080 - U-000007FF, mask 110XXXXX 
                  // see http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 
                  $d+=2; 
                  break; 

              case (($ord_var_c & 0xF0) == 0xE0): 
                  // characters U-00000800 - U-0000FFFF, mask 1110XXXX 
                  // see http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 
                  $d+=3; 
                  break; 

              case (($ord_var_c & 0xF8) == 0xF0): 
                  // characters U-00010000 - U-001FFFFF, mask 11110XXX 
                  // see http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 
                  $d+=4; 
                  break; 

              case (($ord_var_c & 0xFC) == 0xF8): 
                  // characters U-00200000 - U-03FFFFFF, mask 111110XX 
                  // see http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 
                  $d+=5; 
                  break; 

              case (($ord_var_c & 0xFE) == 0xFC): 
                  // characters U-04000000 - U-7FFFFFFF, mask 1111110X 
                  // see http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 
                  $d+=6; 
                  break; 
              default: 
                $d++;    
          } 
      } 

      return $d; 
    }

继续阅读：php4

Should there be something like 'bytelen' (along with 'strlen')?

Update

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

Best solution for private video database [closed]

imessage会显示已读吗？

Update

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生 新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

Best solution for private video database [closed]

imessage会显示已读吗？

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？