How to Determine "Lowest" Encoding Possible?

2023-01-16 17:58 问答作者：

Scenario

You have lots of XML files stored as UTF-16 in a Database or on a Server where space is not an issue. You need to take a large majority of these files that you need to get to other systems as XML Files and it is critical that you use as little space as you can.

Issue

In reality only about 10% of the files stored as UTF-16 need to be stored as UTF-16, the rest can safely be stored as UTF-8 and be fine. If we can have the ones that need to be UTF-16 be such, and the rest be UTF-8 we can use about 40% less space on the file system.

We have tried 开发者_StackOverflow社区to use great compression of the data and this is useful but we find that we get the same ratio of compression with UTF-8 as we get with UTF-16 and UTF-8 compresses faster as well. Therefore in the end if as much of the data is stored as UTF-8 as possible we can not only save space when stored uncompress, we can still save more space even when it is compressed, and we can even save time with the compression itself.

Goal

To figure out when there are Unicode characters in the XML file that require UTF-16 so we can only use UTF-16 when we have to.

Some Details about XML File and Data

While we control the schema for the XML itself, we do not control what type of "strings" can go in the values from a Unicode perspective as the source is free to provide Unicode data to use. However, this is rare so we would like not to have to use UTF-16 everytime just to support something that is only needed 10% of the time.

Development Environment

We are using C# with the .Net Framework 4.0.

EDIT: Solution

The solution is just to use UTF-8.

The question was based on my misunderstanding of UTF and I appreciate everyone helping set me straight. Thank you!

Edit: I didn’t realise that your question implies that you think that there are Unicode strings that cannot be safely encoded as UTF-8. This is not the case. The following answer assumes that what you really meant was that some strings will simply be longer (take more storage space) as UTF-8.

I would say even less than 10% of the files need to be stored as UTF-16. Even if your XML contains significant amounts of Chinese, Japanese, Korean, or another language that is larger in UTF-8 than UTF-16, it is still only an issue if there is more text in that language than there is XML syntax.

Therefore, my initial intuition is “use UTF-8 until it’s a problem”. It makes for consistency, too.

If you have serious reason to believe that a large proportion of the XML will be East Asian, only then you need to worry about it. In that case, I would apply a simple heuristic, like... go through the XML and count the number of characters greater than U+0800 (those are three bytes in UTF-8) and only if this is greater than the number of characters less than U+0080 (those are one byte in UTF-8), use UTF-16.

Encode everything in UTF-8. UTF-8 can handle anything UTF-16 can, and is almost surely going to be smaller in the case of an XML document. The only case in which UTF-8 would be larger than UTF-16 would be if the file was largely composed of characters beyond the BMP, and in the best case (ASCII-spec, which includes every character you can type on a standard U.S. 104-key) a UTF-8 file would be half the size of a UTF-16.

UTF-8 requires 2 bytes or less per character for all symbols at or below ordinal U07FF, and one byte for any character in the Extended ASCII codepage; that means UTF-8 will be at least equal to UTF-16 in size (and probably far smaller) for any document in a modern-day language using the Latin, Greek, Cyrillic, Hebrew or Arabic alphabets, including most of the common symbols used in algebra and the IPA. That's known as the Base Multilingual Plane, and encompasses more than 90% of all official national languages outside of Asia.

UTF-16, as a general rule, will give you a smaller file for documents written primarily in the Devanagari (Hindi), Japanese, Chinese, or Hangul (Korean) alphabets, or any ancient or "esoteric" alphabet (Cherokee or Inuit anyone?), and MAY be smaller in cases of documents that heavily use specialized mathematical, scientific, engineering or game symbols. If the XML you're working with is for localization files for India, China and Japan, you MAY get a smaller file size with UTF-16, but you will have to make your program smart enough to know the localization file is encoded that way.

You never 'need' to use UTF-16 instead of UTF-8 and the choice is not about 'safety'. Both encodings have the same encodable character repertoire.

There is no such thing as a document that has to be UTF-16. Any UTF-16 document can also be encoded as UTF-8. It is theoretically possible to have a document which is larger as UTF-8 than as UTF-16, but this is vanishingly unlikely, and not worth stressing over.

Just encode everything as UTF-8 and stop worrying about it.

There are no characters that require UTF-16 rather than UTF-8. Both UTF-8 and UTF-16 (and for that matter, UTF-32 along with some other non-recommended formats) can encode the entire UCS (that's what UTF means).

There are some streams that will be smaller in UTF-16 than in UTF-8. However, in practice such streams will largely contain Asian ideographs which are linguistically very concise. However, XML requires some characters in the 0x20-0x7F range with specific meanings, and are quite often using alphabet-based scripts for the element and attribute names.

Because of the aforementioned concision of these ideographs, the ratio of XML tags (including the element and attribute name along with the less-thans and greater-thans) to human-trageted text will be much higher than in languages that use alphabets and syllabaries. For this reason, even in cases where plain-text in UTF-16 would be appreciably smaller than the same text in UTF-8, when it comes to XML either this difference will be less, or the UTF-8 will still be smaller.

As a rule, use UTF-8 for transmission and storage.

Edit: Just noticed that you're compressing too. In which case, the balance is even less important, just use UTF-8 and be done with it.

继续阅读：character-encoding unicode utf-16 utf-8

How to Determine "Lowest" Encoding Possible?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？