开发者

Why haven't ASCII and ISO-8859-1 encoding been relegated to history?

It seems to 开发者_StackOverflowme if UTF-8 was the only encoding used everywhere ever, there would be a lot less issues with code:

  • Don't even need to think about encoding issues.
  • No issues with mixed 1-2-byte character streaming, because everything uses 2 bytes.
  • Browsers don't need to wait for the <meta> tag specifying encoding before they can do anything. StackOverflow doesn't even have the meta tag, making browsers download the full page first, slowing page rendering.
  • You would never see ? and other random symbols on old web pages (e.g. in place of Microsoft Word's special [read: horrible] quotes).
  • More characters can be represented in UTF-8.
  • Other things I can't think of right now.

So why haven't the inferior encodings been nuked from space?


  • Don't even need to think about encoding issues.

True. Except for all the data that's still in the old ASCII format.

  • No issues with mixed 1-2-byte character streaming, because everything uses 2 bytes.

Incorrect. UTF-8 is variable length, from 1 to 6 or so bytes.

  • Browsers don't need to wait for the tag specifying encoding before they can do anything. StackOverflow doesn't even have the meta tag, making browsers download the full page first, slowing page rendering.

Browsers don't generally wait for the full page, they make a guess based on the first part of the page data.

  • You would never see ? and other random symbols on old web pages (e.g. in place of Microsoft Word's special [read: horrible] quotes).

Except for all those other old web pages that use other non-UTF-8 encodings (the non-English speaking world is pretty big).

  • More characters can be represented in UTF-8.

True. Your problems of data validation just got harder, too.


Why are EBCDIC, Baudot, and Morse still not nuked from orbit? Why did the buggy-whip manufacturers not close their doors the day after Gottlieb Daimler shipped his first automobile?

Relegating a technology to history takes non-zero time.


No issues with mixed 1-2-byte character streaming, because everything uses 2 bytes.

Not true at all. UTF-8 is a mixed-width 1, 2, 3, and 4-byte encoding. You may have been thinking of UTF-16, but even that has had 4-byte characters for a while. If you want a “simple” fixed-width encoding, you need UTF-32.

You would never see ? and other random symbols on old web pages

Even with UTF-8 web pages, you still might not have a font that supports every Unicode character, so this is still a problem.

More characters can be represented in UTF-8.

Sometimes this is a disadvantage. Having more characters means more bits are required to encode the characters. And to keep track of which ones are letters, digits, etc. And to store the fonts for displaying those characters. And to deal with additional Unicode-related complexities like normalization.

This is probably a non-issue for modern computers with gigabytes of RAM, but don't expect your TI-83 to support Unicode any time soon.


But still, if you do need those extra characters, it's way easier to work with UTF-8 than it is to work with than having zillions of different 8-bit character encodings (plus a few non-self-synchronizing East Asian multibyte encodings).

So why haven't the inferior encodings been nuked from space?

In large part, this is because the “inferior” programming languages haven't been nuked from space. Lots of code is still written in languages like C and C++ (and even COBOL!) that predate Unicode and still don't have good support for it.

I badly wish we get rid of the situation where some libraries use char-based strings encoded in UTF-8 while others think char is for legacy encodings and Unicode should always use wchar_t and then you have to deal with whether wchar_t is UTF-16 or UTF-32 (or neither).


I don't think UTF-8 uses "2 bits" it's variable length. Also a lot of OS level code is UTF-16 and UTF-32 respectively, which means the choice is between ASCII or ISO-8859-1 for latin encodings.


Well, your question is a bit why-world-is-so-bad complaint. It is because it is so. The pages written in other encodings than UTF-8 come from the times when UTF-8 was badly supported by operating systems and when UTF-8 was not yet de-facto standard.

This pages will stay in their original encoding as long as someone will not change them, which is in many cases not very probable. Many of them are no longer supported by anyone.

There are also a lot of documents with non-unicode encoding in the internet, in many formats. Someone COULD convert them, but it, as above, requires a lot of effort.

So, the support for non-unicode must also stay.

And for the current times, keep as the rule that when someone uses non-unicode encoding, a kitten dies.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜