Truncate a UTF-8 string to fit a given byte count in PHP

2022-12-14 18:37 问答作者：

Say we have a UTF-8 string $s and we need to shorten it so it can be stored in N bytes. Blindly truncating it to N bytes could mess it up. But decoding it to find the character boundaries is a drag. Is there a tidy way?

[Edit 20100414] In addition to S.Mar开发者_如何学运维k’s answer: mb_strcut(), I recently found another function to do the job: grapheme_extract($s, $n, GRAPHEME_EXTR_MAXBYTES); from the intl extension. Since intl is an ICU wrapper, I have a lot of confidence in it.

Edit: S.Mark's answer is actually better than mine - PHP has a (badly documented) builtin function that solves exactly this problem.

Original "back to the bits" answer follows:

Truncate to the desired byte count
If the last byte starts with 110 (binary), drop it as well
If the second-to-last byte starts with 1110 (binary), drop the last 2 bytes
If the third-to-last byte starts with 11110 (binary), drop the last 3 bytes

This ensures that you don't have an incomplete character dangling at the end, which is the main thing that can go wrong when truncating UTF-8.

Unfortunately (as Andrew reminds me in the comments) there are also cases where two separately encoded Unicode code points form a single character (basically, diacritics such as accents can be represented as separate code point modifying the preceding letter).

Handling this kind of thing requires advanced Unicode-Fu which is not available in PHP and may not even be possible for all cases (there are somne weird scripts out there!), but fortunately it's relatively rare, at least for Latin-based languages.

I think you don't need to reinvent the wheel, you could just use mb_strcut and make sure you set encoding to UTF-8 first.

mb_internal_encoding('UTF-8');
echo mb_strcut("\xc2\x80\xc2\x80", 0, 3); //from index 0, cut 3 characters.

its return

\xc2\x80

because in \xc2\x80\xc2, last one is invalid

I coded up this simple function for this purpose, you need mb_string though.

function str_truncate($string, $bytes = null)
{
    if (isset($bytes) === true)
    {
        // to speed things up
        $string = mb_substr($string, 0, $bytes, 'UTF-8');

        while (strlen($string) > $bytes)
        {
            $string = mb_substr($string, 0, -1, 'UTF-8');
        }
    }

    return $string;
}

While this code also works, S.Mark answer is obviously the way to go.

Here's a test for mb_strcut(). It doesn't prove that it does just what we're looking for but I find it pretty convincing.

<?php
ini_set('default_charset', 'UTF-8' );
$strs = array(
    'Iñtërnâtiônàlizætiøn',
    'החמאס: רוצים להשלים את עסקת שליט במהירות האפשרית',
    'ايران لا ترى تغييرا في الموقف الأمريكي',
    '独・米で死傷者を出した銃の乱射事件',
    '國會預算處公布驚人的赤字數據後',
    '이며 세계 경제 회복에 걸림돌이 되고 있다',
    'В дагестанском лесном массиве южнее села Какашура',
    'นายประสิทธิ์ รุ่งสะอาด ปลัดเทศบาล รักษาการแทนนายกเทศมนตรี ต.ท่าทองใหม่',
    'ભારતીય ટીમનો સુવર્ણ યુગ : કિવીઝમાં પણ કમાલ',
    'ཁམས་དཀར་མཛེས་ས་ཁུལ་དུ་རྒྱ་གཞུང་ལ་ཞི་བའི་ངོ་རྒོལ་',
    'Χιόνια, βροχές και θυελλώδεις άνεμοι συνθέτουν το',
    'Հայաստանում սկսվել է դատական համակարգի ձեւավորումը',
    'რუსეთი ასევე გეგმავს სამხედრო');
for ( $i = 10; $i <= 30; $i += 5 ) {
    foreach ($strs as $s) {
        $t = mb_strcut($s, 0, $i, 'UTF-8');
        print(
            sprintf('%3s%3s ', mb_strlen($t, 'UTF-8'), mb_strlen($t, 'latin1'))
            . ( mb_check_encoding($t, 'UTF-8') ? ' OK  ' : ' Bad ' )
            . $t . "\n");
    }
}
?>

In addition to S.Mark’s answer which was mb_strcut(), I recently found another function to do a similar job: grapheme_extract($s, $n, GRAPHEME_EXTR_MAXBYTES); from the intl extension.

The functionality is a bit different: mb_strcut() documentation claims it cuts at the nearest UTF-8 character boundary, so it doesn't respect multi-character graphemes while grapheme_extract(), otoh, does. So depending what you need, grapheme_extract() might be better (e.g. to display a string) or mb_strcut() might be better (e.g. for indexing). Anyway, just though I'd mention it.

(And since intl is an ICU wrapper, I have a lot of confidence in it.)

~~No. There is no way to do this other than decoding.~~ The coding is pretty mechanical however. See the pretty table in the wikipedia article

Edit: Michael Borgwardt shows us how to do it without decoding the whole string. Clever.

继续阅读：php string truncate unicode utf-8

Truncate a UTF-8 string to fit a given byte count in PHP

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？