开发者

Is Unicode the default Character Set of HTML & XML?

I see that some Information like The Unicode Book and some Wikipedia Article tell us that Unicode is the default Character Set of HTML & XML.

I understand the words "Character Set" like the "repertorie" that you can use to work with when you are 开发者_运维知识库making a file. Which leads to some editors set his own default character sets regardless what kind of file is going to be worked. No matter if you are trying to make an HTML file, some editors don't set Unicode as default.

Which leaves the question that if Unicode is the default Character set of HTML and XML or depends of the editor used to create the file...


I suppose that you could call Unicode "the default" because both HTML and XML define their allowed content in terms of Unicode.

However, a file can't be "in Unicode," it has to be in some encoding of Unicode. By default, XML files are required to be in either UTF-8 or UTF-16 encoding, unless the prologue specifies differently. The HTML spec explicitly leaves the supported encodings undefined, and indicates that the encoding is handled by the transport protocol (eg, HTTP).


Depends on the person editing the document, not so much on the editor. The editor uses the encoding best suited to the author (or what they believe to be best suited to the author) as the default.

Basically, if you don't specify an encoding or if the client software do not recognize the headers that the server sends, it might/should default to unicode. I don't think that any of this is mandatory - it just became a commonplace behavior.


If I read your question correctly, you need to make a distinction between

  • the character set you have used
  • the character set you have declared

The character set you have actually used when you created the document is the one you have set in your editor. Now you need to make sure that consumers of your file will read it correctly, ie that the character set you have used is also the one you declare.

If you don't use a declaration, the default will be UTF-8 for XML documents, as you have said. That's what an application which reads your file will assume. So you better make sure your editor is set to UTF-8, or else use the appropriate XML header, e.g.

<?xml version="1.0" encoding="ISO-8859-1"?>

For HTML documents, the default encoding is usually set in the server config, so check that out. UTF-8 is the most common choice these days.


It's important to differentiate between the set of characters that may appear in an HTML document (which is a rather abstract concept), and the character encoding that is used to store/transfer the HTML file.

The default for the latter depends on OS/Browser/HTML editor settings, and it's definitely not Unicode, because Unicode is not an encoding. It may be "UTF-8", which is a character encoding for Unicode - just like e.g. "UTF-16" (these encodings are different than e.g. "ISO-8859-1", which cannot encode all Unicode characters).

Overall, it's important, that you set your editor to the same encoding which you declare in your HTML file. Some editors do this automatically, but many do not.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜