What is the difference between charsets and character encoding

2022-12-24 09:31 问答作者：

What is the difference between charsets and character 开发者_开发百科encoding? When i say i am using utf-8 encoding then what will be my charset? Does it take unicode as charset by default?

UTF-8 is an encoding of the Unicode character set. Therefore if you're using UTF-8, the character set is Unicode, but you're not likely to have to specify this separately anywhere. The other main encoding of Unicode is UTF-16, which is not put into 8-bit byte streams because it contains zero bytes. If you are dealing with Unicode in a byte sequence, it is certainly encoded as UTF-8.

Other than Unicode, character sets are usually considered to have a single fixed encoding, and then terms like character set, charset, codepage, encoding are often used interchangeably, or depending on the vendor. This is sloppy but creates no runtime problems.

The only possible exceptions I can think of are East Asian: JIS and EUC originally defined multiple encodings for the same character set, but in practice today, each encoding is just treated separately.

Character set: definition which character has which numeric code point (ascii, jis, unicode)

Encoding: definition how the numeric code point is physically represented (utf, ucs, shiftjis)

According to Unicode terminology

ACR: Abstract Character Repertoire = the set of characters to be encoded, for example, some alphabet or symbol set
CCS: Coded Character Set = a mapping from an abstract character repertoire to a set of nonnegative integers
CEF: Character Encoding Form = a mapping from a set of nonnegative integers that are elements of a CCS to a set of sequences of particular code units of some specified width, such as 32-bit integers
CES: Character Encoding Scheme = a reversible transformation from a set of sequences of code units (from one or more CEFs to a serialized sequence of bytes)
CM: Character Map = a mapping from sequences of members of an abstract character repertoire to serialized sequences of bytes bridging all four levels in a single operation
TES: Transfer Encoding Syntax = a reversible transform of encoded data, which may or may not contain textual data

Older protocols like MIME use "charset" when they really mean "character encoding scheme". Originally, different character encodings were though of as independent character repertoires instead of subsets of Unicode.

A character set defines the mapping between numbers and characters. Almost all char sets say 65 is A, and agree in general about mappings of numbers up to 127. But they might have different stands when it comes to numbers above 127.

There are a lot of character sets

EBCDIC
Double Byte Character Set
ANSI
Different OEM char sets
Unicode, an effort to create a single character set that included every reasonable writing system on the planet and some make-believe ones like Klingon, too.

When you say character encoding, you're talking about how a Unicode code point (a character) is stored internally.

In UTF-8 encoding, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.
There's something called UTF-7, which is a lot like UTF-8 but guarantees that the high bit will always be zero
There are hundreds of traditional encodings which can only store some code points correctly and change all the other code points into question marks. Some popular encodings of English text are Windows-1252 (the Windows 9x standard for Western European languages) and ISO-8859-1, aka Latin-1 (also useful for any Western European language).
UTF 7, 8, 16, and 32 all have the nice property of being able to store any code point correctly.

This post is almost entirely based on Joel Spolsky's post on Unicode: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets. Read it to get a better idea.

Charset is synonym for character encoding

Default encoding depends on the operating system and locale.

EDIT http://www.w3.org/TR/REC-xml/#sec-TextDecl

http://www.w3.org/TR/REC-xml/#NT-EncodingDecl

继续阅读：character-encoding unicode

What is the difference between charsets and character encoding

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？