default encoding for XML is UTF-8 or UTF-16?

2023-03-12 11:35 问答作者：

OpenTag FAQ states:

If no encoding declaration is present in the XML document (a开发者_Go百科nd no external encoding declaration mechanism such as the HTTP header is available), the assumed encoding of an XML document depends on the presence of the Byte-Order-Mark (BOM).

The BOM is a Unicode special marker placed at the top of the file that indicate its encoding. The BOM is optional for UTF-8.
First bytes        Encoding assumed
-----------------------------------------
EF BB BF           UTF-8
FE FF              UTF-16 (big-endian)
FF FE              UTF-16 (little-endian)
00 00 FE FF        UTF-32 (big-endian)
FF FE 00 00        UTF-32 (little-endian)
None of the above  UTF-8

Is there a dumbed-down explanation of the above paragraph?

Either you have to use a line like

<?xml version="1.0" encoding="iso-8859-1" ?>

to specify which encoding is used. If the encoding is not specified, a Byte order mark (BOM) can be present. If a BOM for either UTF-16 or UTF-32 is present, that encoding is used. Otherwise UTF-8 is the encoding. (The BOM for UTF-8 is optional)

Edit

The BOM is an invisible character. But there is no need to see it. Applications take care of it automatically. When you use windows notepad, you can select the encoding when you save the file. Notepad will automatically insert the BOM at the start of the file. When you later reopen the file, notepad will recognise the BOM and use the proper encoding to read the file. There is no need for you to ever modify the BOM, if you would do so, characters can get a different meaning, so the text will not be the same.

I will try to explain with an example. Consider a text file, with just the characters "test". Default notepad will use ANSI encoding, the text file will look like this when you view it in hex mode:

C:\>C:\gnuwin32\bin\hexdump -C test-ansi.txt
00000000  74 65 73 74                                       |test|
00000004

(as you see, I am using hexdump from gnuwin32, but you can also use an hex editor like Frhed to see this.

There is no BOM in front of this file. It would not be possible, because the character which is used for the BOM does not exist in ANSI encoding. (Because there is not BOM, editors which don't support ANSI encoding, would treat this file as UTF-8).

when I now save the file like utf8, you will see 3 extra bytes (the BOM) in front of "test":

C:\>C:\gnuwin32\bin\hexdump -C test-utf8.txt
00000000  ef bb bf 74 65 73 74                              |ï»¿test|
00000007

(if you would open this file with a text editor which does not support utf-8, you would actually see those characters "ï»¿")

Notepad can also save the file as unicode, this means UTF-16 little-endian (UTF-16LE):

C:\>C:\gnuwin32\bin\hexdump -C test-unicode.txt
00000000  ff fe 74 00 65 00 73 00  74 00                    |ÿþt.e.s.t.|
0000000a

And here is the version saved as unicode (big endian) (UTF-16BE):

C:\>C:\gnuwin32\bin\hexdump -C test-unicode-big-endian.txt
00000000  fe ff 00 74 00 65 00 73  00 74                    |þÿ.t.e.s.t|
0000000a

Now consider a text file with the 4 chinese characters "琀攀猀琀". When I save that as unicode (big endian), the result looks like this:

C:\>C:\gnuwin32\bin\hexdump -C test2-unicode-big-endian.txt
00000000  fe ff 74 00 65 00 73 00  74 00                    |þÿt.e.s.t.|
0000000a

As you see, the word "test" in UTF-16LE is stored the same way as the word "琀攀猀琀" in UTF-16BE. But because the BOM if stored different, you can see whether the file contains "test" or "琀攀猀琀". Without a BOM you would have to guess.

继续阅读：xml xml-serialization

default encoding for XML is UTF-8 or UTF-16?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？