开发者

What is encoding in XML?

What is encoding in XML? The normal encoding used is 开发者_JS百科utf-8. How is it different from other encoding? What is the purpose of using it?


A character encoding specifies how characters are mapped onto bytes. Since XML documents are stored and transferred as byte streams, this is necessary to represent the unicode characters that make up an XML document.

UTF-8 is chosen as the default, because it has several advantages:

  • it is compatible with ASCII in that all valid ASCII encoded text is also valid UTF-8 encoded (but not necessarily the other way around!)
  • it uses only 1 byte per character for "common" letters (those that also exist in ASCII)
  • it can represent all existing Unicode characters

Character encodings are a more general topic than just XML. UTF-8 is not restricted to being used in XML only.

What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text is a good article that gives a good overview over the topic.


When computers were first created, they mostly only worked with characters found in the english language, leading to the 7-bit US-ASCII standard.

However, there are a lot of different written languages in the world, and ways had to be found to be able to use them in computers.

The first way works fine if you restrict yourself to a certain language, it's to use a culture specific encoding, such as ISO-8859-1, which is able to represent latin-european language characters on 8-bits, or GB2312 for chinese characters.

The second way is a bit more complicated, but allows theoretically to represent every character in the world, it's the Unicode standard, in which every character from every language has a specific code. However, given the high number of existing characters (109,000 in Unicode 5), unicode characters are normally represented using a three byte representation (one byte for the Unicode plane, and two bytes for the character code.

In order to maximize compatibility with existing code (some is still using text in ASCII), the UTF-8 standard encoding was devised as a way to store Unicode characters, only using the minimal amount of space, as described in Joachim Sauer's answer.

So, it's common to see files encoded with specific charsets such as ISO-8859-1 if the file is meant to be edited or read only by software (and people) understanding only these languages, and UTF-8 when there's the need to be highly interoperable and culture-independant. The current tendancy is for UTF-8 to replace other charsets, even though it needs work from software developers, since UTF-8 strings are more complicated to handle than fixed-width charset strings.


XML documents can contain non ASCII characters, like Norwegian æ ø å , or French ê è é. So, to avoid errors you set the encoding or save the XML file as Unicode.

XML Encoding Rules


When data is stored or transfered it is only bytes. Those bytes need some interpretation. Users with non English locales used to have some problems with characters that only appeared in their locale. Those characters were displayed in a wrong way frequently.

With XML having an information how to interpret its bytes character can be displayed in a correct way.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜