开发者

default encoding for XML is UTF-8 or UTF-16?

OpenTag FAQ states:

If no encoding declaration is present in the XML document (a开发者_Go百科nd no external encoding declaration mechanism such as the HTTP header is available), the assumed encoding of an XML document depends on the presence of the Byte-Order-Mark (BOM).

The BOM is a Unicode special marker placed at the top of the file that indicate its encoding. The BOM is optional for UTF-8.

First bytes        Encoding assumed
-----------------------------------------
EF BB BF           UTF-8
FE FF              UTF-16 (big-endian)
FF FE              UTF-16 (little-endian)
00 00 FE FF        UTF-32 (big-endian)
FF FE 00 00        UTF-32 (little-endian)
None of the above  UTF-8

Is there a dumbed-down explanation of the above paragraph?


Either you have to use a line like

<?xml version="1.0" encoding="iso-8859-1" ?>

to specify which encoding is used. If the encoding is not specified, a Byte order mark (BOM) can be present. If a BOM for either UTF-16 or UTF-32 is present, that encoding is used. Otherwise UTF-8 is the encoding. (The BOM for UTF-8 is optional)

Edit

The BOM is an invisible character. But there is no need to see it. Applications take care of it automatically. When you use windows notepad, you can select the encoding when you save the file. Notepad will automatically insert the BOM at the start of the file. When you later reopen the file, notepad will recognise the BOM and use the proper encoding to read the file. There is no need for you to ever modify the BOM, if you would do so, characters can get a different meaning, so the text will not be the same.

I will try to explain with an example. Consider a text file, with just the characters "test". Default notepad will use ANSI encoding, the text file will look like this when you view it in hex mode:

C:\>C:\gnuwin32\bin\hexdump -C test-ansi.txt
00000000  74 65 73 74                                       |test|
00000004

(as you see, I am using hexdump from gnuwin32, but you can also use an hex editor like Frhed to see this.

There is no BOM in front of this file. It would not be possible, because the character which is used for the BOM does not exist in ANSI encoding. (Because there is not BOM, editors which don't support ANSI encoding, would treat this file as UTF-8).

when I now save the file like utf8, you will see 3 extra bytes (the BOM) in front of "test":

C:\>C:\gnuwin32\bin\hexdump -C test-utf8.txt
00000000  ef bb bf 74 65 73 74                              |test|
00000007

(if you would open this file with a text editor which does not support utf-8, you would actually see those characters "")

Notepad can also save the file as unicode, this means UTF-16 little-endian (UTF-16LE):

C:\>C:\gnuwin32\bin\hexdump -C test-unicode.txt
00000000  ff fe 74 00 65 00 73 00  74 00                    |ÿþt.e.s.t.|
0000000a

And here is the version saved as unicode (big endian) (UTF-16BE):

C:\>C:\gnuwin32\bin\hexdump -C test-unicode-big-endian.txt
00000000  fe ff 00 74 00 65 00 73  00 74                    |þÿ.t.e.s.t|
0000000a

Now consider a text file with the 4 chinese characters "琀攀猀琀". When I save that as unicode (big endian), the result looks like this:

C:\>C:\gnuwin32\bin\hexdump -C test2-unicode-big-endian.txt
00000000  fe ff 74 00 65 00 73 00  74 00                    |þÿt.e.s.t.|
0000000a

As you see, the word "test" in UTF-16LE is stored the same way as the word "琀攀猀琀" in UTF-16BE. But because the BOM if stored different, you can see whether the file contains "test" or "琀攀猀琀". Without a BOM you would have to guess.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜