Anyone can explain the "same thing" issue of UTF-8?

2023-04-12 02:56 问答作者：

Quoted from here:

Security may also be impacted by a characteristic of several character encodings, including UTF-8: the "same thing" (as far as a user can tell) can be represented by several distinct character sequences. For instance, an e with acute accent can be represented by the precomposed U+00E9 E ACUTE character or by the canonicall开发者_开发技巧y equivalent sequence U+0065 U+0301 (E + COMBINING ACUTE). Even though UTF-8 provides a single byte sequence for each character sequence, the existence of multiple character sequences for "the same thing" may have security consequences whenever string matching, indexing,

Is this a hidden feature of UTF-8 that I've never tackled before?

This issue is not actually specific to UTF-8 at all. It happens with all encodings that can represent all (or at least most) Unicode codepoints.

The general idea of Unicode is to not provide so-called pre-composed characters (e.g. U+00E9 E ACUTE), instead they usually like to provide the base character (e.g. U+0065 LATIN SMALL LETTER E) and the combining character (e.g. U+0301 COMBINING ACUTE ACCENT). This has the advantage of not having to provide every possible combination as its own character.

Note: the U+xxxx notation is used to refer to unicode codepoints. It's the encoding-independent way to refer to Unicode characters.

However when Unicode was first designed an important goal was to have round-trip compatibility for existing, widely-used encodings, so some pre-composed characters were included (in fact most of the diacritic characters from the latin and related alphabets are included).

So yes (and tl;dr): in a correctly working Unicode-capable application U+00E9 should render the same way and be treated the same way as U+0065 followed by U+0301.

There's a non-trivial process called normalization that helps work with these differences by reducing a given string to one of four normal forms.

For example passing both strings (U+00E9 and U+0065 U+0301) will result in U+00E9 when using NFC and will result in U+0065 U+0301 when using NFD.

Very short and visualized example: the character "é" can either be represented using the Unicode code point U+00E9 (LATIN SMALL LETTER E WITH ACUTE, é), or the sequence U+0065 (LATIN SMALL LETTER E, e) followed by U+0301 (COMBINING ACUTE ACCENT, ´), which together look like this: é.

In UTF-8, é has the byte sequence xC3 xA9, while é has the byte sequence x65 xCC x81.

_{Note: Due to technical limitations this post does not contain the actual combination characters.}

Actually I don't understand what it means by :

"Even though UTF-8 provides a single byte sequence for each character sequence[...]"

What the quote wants to say is:

"Any given sequence of Unicode code points is mapped to one (and precisely one) sequence of bytes by the UTF-8 encoding." That is, UTF-8 is a bijection between sequences of (abstract) Unicode code points and bytes.

The problem, which the text wants to illustrate, is that there is no bijection between "letters of a text" (as commonly understood) and Unicode code points, because the same text can be represented by different sequences of Unicode code points (as explained in the example).

Actually, this has nothing to do with UTF-8 specifically; it is a fundamental property of Unicode: Many texts have more than one representations as Unicode code points. This is important to keep in mind when comparing texts expressed in Unicode (no matter in what encoding).

One (partial) solution to this is normalization. It defines various Normal forms for Unicode text, which are unique representations of a text.

继续阅读：unicode utf-8

Anyone can explain the "same thing" issue of UTF-8?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？