Unicode BOM for UTF-16LE vs UTF32-LE

2022-12-14 04:41 问答作者：

It seems like there's an ambiguity between th开发者_JS百科e Byte Order Marks used for UTF16-LE and UTF-32LE. In particular, consider a file that contains the following 8 bytes:

FF FE 00 00 00 00 00 00

How can I tell if this file contains:

The UTF16-LE BOM (FF FE) followed by 3 null characters; or
The UTF32-LE BOM (FF FE 00 00) followed by one null character?

Unicode BOMs are described here: http://unicode.org/faq/utf_bom.html#bom4 but there's no discussion of this ambiguity. Am I missing something?

As the name suggests, the BOM only tells you the byte order, not the encoding. You have to know what the encoding is first, then you can use the BOM to determine whether the least or most significant bytes are first for multibyte sequences.

A fortunate side-effect of the BOM is that you can also sometimes use it to guess the encoding if you don't know it, but that is not what it was designed for and it is no substitute for sending proper encoding information.

It is unambiguous. FF FE is for UTF-16LE, and FF FE 00 00 denotes UTF-32LE. There is no reason to think that FF FE 00 00 is possibly UTF-16LE because the UTFs were designed for text, and users shouldn't be using NUL characters in their text. After all, when was the last time you opened a hex editor and inserted a few bytes of 00 into a text document? ^_^

I have experienced the same problem like Edward. I agree with Dustin, usually one will not use null-characters in textfiles.

However i have created a file that contains all unicode characters. I have first used the utf-32le encoding, then a utf-32be encoding, a utf-16le and a utf-16be encoding as well as a utf-8 encoding.

When trying to re-encode the files to utf-8, i wanted to compare the result to the already existing utf-8 file. Because the first character in my files after the BOM is the null-character, i could not successfully detect the file with utf-16le BOM, it showed up as utf-32le BOM, because the bytes appeared exactly like Edward has described. The first character after the BOM FFFE is 0000, but the BOM detection found a BOM FFFE0000 and so, detected utf-32le instead of utf-16le whereby my first 0000-character was stolen and taken as part of the BOM.

So one should never use a null-character as first character of a file encoded with utf-16 little endian, because it will make the utf-16le and utf-32le BOM ambiguous.

To solve my problem, i will swap the first and second character. :-)

继续阅读：byte-order-mark character-encoding file-type unicode utf-16

Unicode BOM for UTF-16LE vs UTF32-LE

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？