What is encoding in XML?

2023-02-24 08:05 问答作者：

What is encoding in XML? The normal encoding used is 开发者_JS百科utf-8. How is it different from other encoding? What is the purpose of using it?

A character encoding specifies how characters are mapped onto bytes. Since XML documents are stored and transferred as byte streams, this is necessary to represent the unicode characters that make up an XML document.

UTF-8 is chosen as the default, because it has several advantages:

it is compatible with ASCII in that all valid ASCII encoded text is also valid UTF-8 encoded (but not necessarily the other way around!)
it uses only 1 byte per character for "common" letters (those that also exist in ASCII)
it can represent all existing Unicode characters

Character encodings are a more general topic than just XML. UTF-8 is not restricted to being used in XML only.

What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text is a good article that gives a good overview over the topic.

When computers were first created, they mostly only worked with characters found in the english language, leading to the 7-bit US-ASCII standard.

However, there are a lot of different written languages in the world, and ways had to be found to be able to use them in computers.

The first way works fine if you restrict yourself to a certain language, it's to use a culture specific encoding, such as ISO-8859-1, which is able to represent latin-european language characters on 8-bits, or GB2312 for chinese characters.

The second way is a bit more complicated, but allows theoretically to represent every character in the world, it's the Unicode standard, in which every character from every language has a specific code. However, given the high number of existing characters (109,000 in Unicode 5), unicode characters are normally represented using a three byte representation (one byte for the Unicode plane, and two bytes for the character code.

In order to maximize compatibility with existing code (some is still using text in ASCII), the UTF-8 standard encoding was devised as a way to store Unicode characters, only using the minimal amount of space, as described in Joachim Sauer's answer.

So, it's common to see files encoded with specific charsets such as ISO-8859-1 if the file is meant to be edited or read only by software (and people) understanding only these languages, and UTF-8 when there's the need to be highly interoperable and culture-independant. The current tendancy is for UTF-8 to replace other charsets, even though it needs work from software developers, since UTF-8 strings are more complicated to handle than fixed-width charset strings.

XML documents can contain non ASCII characters, like Norwegian æ ø å , or French ê è é. So, to avoid errors you set the encoding or save the XML file as Unicode.

XML Encoding Rules

When data is stored or transfered it is only bytes. Those bytes need some interpretation. Users with non English locales used to have some problems with characters that only appeared in their locale. Those characters were displayed in a wrong way frequently.

With XML having an information how to interpret its bytes character can be displayed in a correct way.

继续阅读：encoding utf-8 xml xsd

What is encoding in XML?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？