开发者

UTF-8 is an Encoding or a Document Character Set?

According with W3C Recommendation says that every aplicattion requires its document character set (Not be confused with Character Encoding).

A document character set consists of:

  • A Repertoire: A set of abstract characters, such as the Latin letter "A", the Cyrillic letter "I", the Chinese character meaning "water", etc.

  • Code positions: A set of integer references to characters in the repertoire.

Each document is a sequence of characters from the repertoire.

Character Encoding is: How those characters may be represented

When i save a file in Windows notepad im guessing that this are the "Document Character Sets":

  • ANSI
  • UNICODE
  • UNICODE BIG ENDIAN
  • UTF-8

Simple 3 questions:

I want to k开发者_高级运维now if those are the "document character sets". And if they are,

  1. Why is UTF-8 on the list? UTF-8 is not supposed to be an encoding?

    If im not wrong with all this stuff:

  2. Are there another Document Character Sets that Windows do not allow you to define?

  3. How to define another document character sets?


In my understanding:

  • ANSI is both a character set and an encoding of that character set.
  • Unicode is a character set; the the encoding in question is probably UTF-16. An alternative encoding of the same character set is big-endian UTF-16, which is probably what the third option is referring to.
  • UTF-8 is an encoding of Unicode.

The purpose of that dropdown in the Save dialog is really to select both a character set and an encoding for it, but they've been a little careless with the naming of the options.

(Technically, though, an encoding just maps integers to byte sequences, so any encoding could be used with any character set that is small enough to "fit" the encoding. However, the UTF-* encodings are designed with Unicode in mind.)

Also, see Joel on Software's mandatory article on the subject.


UTF-8 is a character encoding that is also used to specify a character set for HTML and other textual documents. It is one of several Unicode encodings (UTF-16 is another).

To answer your questions:

  • It is on the list because Microsoft decided to implement it in notepad.
  • There are many other character sets, though defining your own is not useful, so not really possible.
  • You can't define other character sets to save with notepad. Try using a programmers editor such as notepad++ that will give you more character sets to use.
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜