Intra-Unicode "lean" Encoding Converters

2023-01-03 06:02 问答作者：

Windows provides encoding conversion functions ("MultiByteToWideChar" and "WideCharToMultiByte") which are capable of UTF-8 to/from UTF-16 conversions, among other things. But I've seen people offer home-grown 30 to 40 line functions that claim also to perform UTF-8 / UTF-16 encoding conve开发者_如何学Gorsions.

My question is, how reliable are such tiny converters? Can such a tiny amount of code handle problems such as converting a UTF-16 surrogate pair (such as <D800 DC00>) into a UTF-8 single four byte sequence (rather than making the mistake of converting into a pair of three byte sequences)? Can they correctly spot "unpaired" surrogate input, and provide an error?

In short, are such tiny converters mere toys, or can they be taken seriously? For that matter, why does unicode.org seemingly offer no advice on an algorithm for accomplishing such conversions?

The open source ICU library has 113 lines of code for ucnv_fromUnicode_UTF8 (source/common/ucnv_u8.c). Error checking included, proper surrogate handling, some comments. You should only consider using something else if you don't like the naming conventions.

Converting between UTF-8, -16 and -32 is a pretty simple process. It is simple because they all work with the same "character set", and just use different encodings to represent each code point.

The tricky part is converting to/from a non-UTF format. MultiByteToWideChar can do that. A 15-line conversion function can't.

Yes, production quality functions can be that short. I've written full-strength, error checking, defensive, pedantic, understandable, bulletproof conversions for UTF-8 -> UTF-32 and UTF-32 to UTF-8 in about 50 lines each, with comments (but not including the unit tests). There are denser coding styles that could probably do the same in 30-40 lines for each function. There are also shortcuts you can take transcoding UTF-8 to/from UTF-16 directly without UTF-32 in between.

You are correct - most "copy/paste" routines you can find on the Internet don't perform validity checks at all.

If you want a small library that performs those checks, take a look at UTF8-CPP. It has both "checked" and "unchecked" versions of the conversion functions.

There used to be a sample converter in C at the Unicode web site at ftp://ftp.unicode.org/Public/PROGRAMS/CVTUTF/ but it was removed. I have no idea why as it was very useful and had a non-restrictive license - you would have to ask them.

It was pretty small and I have used it. I believe it did handle surrogate pairs properly but as I don't have the code in front of me I can't swear by it. I'm sure you can find copies of it elsewhere on the web though.

The downside is that it's of no use if you have to convert to or from a non-unicode character set as it's only between UTF variants.

继续阅读：unicode

Intra-Unicode "lean" Encoding Converters

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？