开发者

Intra-Unicode "lean" Encoding Converters

Windows provides encoding conversion functions ("MultiByteToWideChar" and "WideCharToMultiByte") which are capable of UTF-8 to/from UTF-16 conversions, among other things. But I've seen people offer home-grown 30 to 40 line functions that claim also to perform UTF-8 / UTF-16 encoding conve开发者_如何学Gorsions.

My question is, how reliable are such tiny converters? Can such a tiny amount of code handle problems such as converting a UTF-16 surrogate pair (such as <D800 DC00>) into a UTF-8 single four byte sequence (rather than making the mistake of converting into a pair of three byte sequences)? Can they correctly spot "unpaired" surrogate input, and provide an error?

In short, are such tiny converters mere toys, or can they be taken seriously? For that matter, why does unicode.org seemingly offer no advice on an algorithm for accomplishing such conversions?


The open source ICU library has 113 lines of code for ucnv_fromUnicode_UTF8 (source/common/ucnv_u8.c). Error checking included, proper surrogate handling, some comments. You should only consider using something else if you don't like the naming conventions.


Converting between UTF-8, -16 and -32 is a pretty simple process. It is simple because they all work with the same "character set", and just use different encodings to represent each code point.

The tricky part is converting to/from a non-UTF format. MultiByteToWideChar can do that. A 15-line conversion function can't.


Yes, production quality functions can be that short. I've written full-strength, error checking, defensive, pedantic, understandable, bulletproof conversions for UTF-8 -> UTF-32 and UTF-32 to UTF-8 in about 50 lines each, with comments (but not including the unit tests). There are denser coding styles that could probably do the same in 30-40 lines for each function. There are also shortcuts you can take transcoding UTF-8 to/from UTF-16 directly without UTF-32 in between.


You are correct - most "copy/paste" routines you can find on the Internet don't perform validity checks at all.

If you want a small library that performs those checks, take a look at UTF8-CPP. It has both "checked" and "unchecked" versions of the conversion functions.


There used to be a sample converter in C at the Unicode web site at ftp://ftp.unicode.org/Public/PROGRAMS/CVTUTF/ but it was removed. I have no idea why as it was very useful and had a non-restrictive license - you would have to ask them.

It was pretty small and I have used it. I believe it did handle surrogate pairs properly but as I don't have the code in front of me I can't swear by it. I'm sure you can find copies of it elsewhere on the web though.

The downside is that it's of no use if you have to convert to or from a non-unicode character set as it's only between UTF variants.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜