开发者

Truncate a UTF-8 string to fit a given byte count in PHP

Say we have a UTF-8 string $s and we need to shorten it so it can be stored in N bytes. Blindly truncating it to N bytes could mess it up. But decoding it to find the character boundaries is a drag. Is there a tidy way?

[Edit 20100414] In addition to S.Mar开发者_如何学运维k’s answer: mb_strcut(), I recently found another function to do the job: grapheme_extract($s, $n, GRAPHEME_EXTR_MAXBYTES); from the intl extension. Since intl is an ICU wrapper, I have a lot of confidence in it.


Edit: S.Mark's answer is actually better than mine - PHP has a (badly documented) builtin function that solves exactly this problem.

Original "back to the bits" answer follows:

  • Truncate to the desired byte count
  • If the last byte starts with 110 (binary), drop it as well
  • If the second-to-last byte starts with 1110 (binary), drop the last 2 bytes
  • If the third-to-last byte starts with 11110 (binary), drop the last 3 bytes

This ensures that you don't have an incomplete character dangling at the end, which is the main thing that can go wrong when truncating UTF-8.

Unfortunately (as Andrew reminds me in the comments) there are also cases where two separately encoded Unicode code points form a single character (basically, diacritics such as accents can be represented as separate code point modifying the preceding letter).

Handling this kind of thing requires advanced Unicode-Fu which is not available in PHP and may not even be possible for all cases (there are somne weird scripts out there!), but fortunately it's relatively rare, at least for Latin-based languages.


I think you don't need to reinvent the wheel, you could just use mb_strcut and make sure you set encoding to UTF-8 first.

mb_internal_encoding('UTF-8');
echo mb_strcut("\xc2\x80\xc2\x80", 0, 3); //from index 0, cut 3 characters.

its return

\xc2\x80

because in \xc2\x80\xc2, last one is invalid


I coded up this simple function for this purpose, you need mb_string though.

function str_truncate($string, $bytes = null)
{
    if (isset($bytes) === true)
    {
        // to speed things up
        $string = mb_substr($string, 0, $bytes, 'UTF-8');

        while (strlen($string) > $bytes)
        {
            $string = mb_substr($string, 0, -1, 'UTF-8');
        }
    }

    return $string;
}

While this code also works, S.Mark answer is obviously the way to go.


Here's a test for mb_strcut(). It doesn't prove that it does just what we're looking for but I find it pretty convincing.

<?php
ini_set('default_charset', 'UTF-8' );
$strs = array(
    'Iñtërnâtiônàlizætiøn',
    'החמאס: רוצים להשלים את עסקת שליט במהירות האפשרית',
    'ايران لا ترى تغييرا في الموقف الأمريكي',
    '独・米で死傷者を出した銃の乱射事件',
    '國會預算處公布驚人的赤字數據後',
    '이며 세계 경제 회복에 걸림돌이 되고 있다',
    'В дагестанском лесном массиве южнее села Какашура',
    'นายประสิทธิ์ รุ่งสะอาด ปลัดเทศบาล รักษาการแทนนายกเทศมนตรี ต.ท่าทองใหม่',
    'ભારતીય ટીમનો સુવર્ણ યુગ : કિવીઝમાં પણ કમાલ',
    'ཁམས་དཀར་མཛེས་ས་ཁུལ་དུ་རྒྱ་གཞུང་ལ་ཞི་བའི་ངོ་རྒོལ་',
    'Χιόνια, βροχές και θυελλώδεις άνεμοι συνθέτουν το',
    'Հայաստանում սկսվել է դատական համակարգի ձեւավորումը',
    'რუსეთი ასევე გეგმავს სამხედრო');
for ( $i = 10; $i <= 30; $i += 5 ) {
    foreach ($strs as $s) {
        $t = mb_strcut($s, 0, $i, 'UTF-8');
        print(
            sprintf('%3s%3s ', mb_strlen($t, 'UTF-8'), mb_strlen($t, 'latin1'))
            . ( mb_check_encoding($t, 'UTF-8') ? ' OK  ' : ' Bad ' )
            . $t . "\n");
    }
}
?>


In addition to S.Mark’s answer which was mb_strcut(), I recently found another function to do a similar job: grapheme_extract($s, $n, GRAPHEME_EXTR_MAXBYTES); from the intl extension.

The functionality is a bit different: mb_strcut() documentation claims it cuts at the nearest UTF-8 character boundary, so it doesn't respect multi-character graphemes while grapheme_extract(), otoh, does. So depending what you need, grapheme_extract() might be better (e.g. to display a string) or mb_strcut() might be better (e.g. for indexing). Anyway, just though I'd mention it.

(And since intl is an ICU wrapper, I have a lot of confidence in it.)


No. There is no way to do this other than decoding. The coding is pretty mechanical however. See the pretty table in the wikipedia article

Edit: Michael Borgwardt shows us how to do it without decoding the whole string. Clever.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜