Efficient way to ASCII encode UTF-8

2022-12-25 17:31 问答作者：

I'm looking for a simple and efficient way to store UTF-8 strings in ASCII-7. With efficient I mean the following:

all ASCII alphanumeric chars in the input should stay the same ASCII alphanum开发者_如何学Pythoneric chars in the output
the resulting string should be as short as possible
the operation needs to be reversable without any data loss
the resulting ASCII string should be case insensitive
there should be no restriction on the input length
the whole UTF-8 range should be allowed

My first idea was to use Punycode (IDNA) as it fits the first four requirements, but it fails at the last two.

Can anyone recommend an alternative encoding scheme? Even better if there's some code available to look at.

UTF-7, or, slightly less transparent but more widespread, quoted-printable.

all ASCII chars in the input should stay ASCII chars in the output

(Obviously not fully possible as you need at least one character to act as an escape.)

Since ASCII covers the full range of 7-bit values, an encoding scheme that preserves all ASCII characters, is 7-bits long, and encodes the full Unicode range is not possible.

Edited to add:

I think I understand your requirements now. You are looking for a way to encode UTF-8 strings in a seven-bit code, in which, if that encoded string were interpreted as ASCII text, then the case of the alphabetic characters may be arbitrarily modified, and yet the decoded string will be byte-for-byte identical to the original.

If that's the case, then your best bet would probably be just to encode the binary representation of the original as a string of hexadecimal digits. I know you are looking for a more compact representation, but that's a pretty tall order given the other constraints of the system, unless some custom encoding is devised.

Since the hexadecimal representation can encode any arbitrary binary values, it might be possible to shrink the string by compressing them before taking the hex values.

If you're talking about non-standard schemes - MECE

URL encoding or numeric character references are two possible options.

It depends on the distribution of characters in your strings.

Quoted-printable is good for mostly-ASCII strings because there's no overhead except with '=' and control characters. However, non-ASCII characters take an inefficient 6-12 bytes each, so if you have a lot of those, you'll want to consider UTF-7 or Base64 instead.

Punycode is used for IDNA, but you can use it outside the restrictions imposed by it

Per se, Punycode doesn't fail your last 2 requirements:

>>> import sys
>>> _ = ("\U0010FFFF"*10000).encode("punycode")
>>> all(chr(c).encode("punycode") for c in range(sys.maxunicode))
True

(for idna, python supplies another homonymous encoding)

obviously, if you don't nameprep the input, the encoded string isn't strictly case-insensitive anymore... but if you supply only lowercase (or if you don't care about the decoded case) you should be good to go

继续阅读：ascii encoding punycode utf-8

Efficient way to ASCII encode UTF-8

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？