开发者

Is there a C library to convert Unicode code points to UTF-8?

I ha开发者_如何转开发ve to go through some text and write the UTF-8 output according to the character patterns. I thought it’ll be easy if I can work with the code points and get it converted to UTF-8. I have been reading about Unicode and UTF-8, but couldn’t find a good solution. Any help will be appreciated.


Converting Unicode code points to UTF-8 is so trivial that making the call to a library probably takes more code than just doing it yourself:

if (c<0x80) *b++=c;
else if (c<0x800) *b++=192+c/64, *b++=128+c%64;
else if (c-0xd800u<0x800) goto error;
else if (c<0x10000) *b++=224+c/4096, *b++=128+c/64%64, *b++=128+c%64;
else if (c<0x110000) *b++=240+c/262144, *b++=128+c/4096%64, *b++=128+c/64%64, *b++=128+c%64;
else goto error;

Also, doing it yourself means you can tune the api to the type of work you need (character-at-a-time? Or long strings?) You can remove the error cases if you know your input is a valid Unicode scalar value.

The other direction is a good bit harder to get correct. I recommend a finite automaton approach rather than the typical bit-arithmetic loops that sometimes decode invalid sequences as aliases for real characters (which is very dangerous and can lead to security problems).

Even if you do end up going with a library, I think you should either try writing it yourself first or at least seriously study the UTF-8 specification before going further. A lot of bad design can come from treating UTF-8 as a black box when the whole point is that it's not a black box but was created to have very powerful properties, and too many programmers new to UTF-8 fail to see this until they've worked with it a lot themselves.


iconv could be used I figure.

#include <iconv.h>

iconv_t cd;
char out[7];
wchar_t in = CODE_POINT_VALUE;
size_t inlen = sizeof(in), outlen = sizeof(out);

cd = iconv_open("utf-8", "wchar_t");
iconv(cd, (char **)&in, &inl, &out, &outlen);
iconv_close(cd);

But I fear that wchar_t might not represent Unicode code points, but arbitrary values.. EDIT: I guess you can do it by simply using a Unicode source:

uint16_t in = UNICODE_POINT_VALUE;
cd = iconv_open("utf-8", "ucs-2");


A good part of the genius of UTF-8 is that converting from a Unicode Scalar value to a UTF-8-encoded sequence can be done almost entirely with bitwise, rather than integer arithmetic.

The accepted answer is very terse, but not particularly efficient or comprehensible as written. I replaced magic numbers with named constants, divisions with bit shifts, modulo with bit masking, and additions with bit-ors. I also wrote a doc comment pointing out that the caller is responsible for ensuring that the buffer is large enough.

#define SURROGATE_LOW_BITS 0x7FF
#define MAX_SURROGATE     0xDFFF
#define MAX_FOUR_BYTE   0x10FFFF
#define ONE_BYTE_BITS          7
#define TWO_BYTE_BITS         11
#define TWO_BYTE_PREFIX     0xC0
#define THREE_BYTE_BITS       16
#define THREE_BYTE_PREFIX   0xE0
#define FOUR_BYTE_PREFIX    0xF0
#define CONTINUATION_BYTE   0x80
#define CONTINUATION_MASK   0x3F

/**
 * Ensure that buffer has space for AT LEAST 4 bytes before calling this function,
 *   or a buffer overrun will occur.
 * Returns the number of bytes written to buffer (0-4).
 * If scalar is a surrogate value, or is out of range for a Unicode scalar,
 *   writes nothing and returns 0.
 * Surrogate values are integers from 0xD800 to 0xDFFF, inclusive.
 * Valid Unicode scalar values are non-surrogate integers between
 *   0 and 1_114_111 decimal (0x10_FFFF hex), inclusive.
 */
int encode_utf_8(unsigned long scalar, char* buffer) {
  if ((scalar | SURROGATE_LOW_BITS) == MAX_SURROGATE || scalar > MAX_FOUR_BYTE) {
    return 0;
  }

  int bytes_written = 0;

  if ((scalar >> ONE_BYTE_BITS) == 0) {
    *buffer++ = scalar;
    bytes_written = 1;
  }
  else if ((scalar >> TWO_BYTE_BITS) == 0) {
    *buffer++ = TWO_BYTE_PREFIX | (scalar >> 6);
    bytes_written = 2;
  }
  else if ((scalar >> THREE_BYTE_BITS) == 0) {
    *buffer++ = THREE_BYTE_PREFIX | (scalar >> 12);
    bytes_written = 3;
  }
  else {
    *buffer++ = FOUR_BYTE_PREFIX | (scalar >> 18);
    bytes_written = 4;
  }
  // Intentionally falling through each case
  switch (bytes_written) {
    case 4: *buffer++ = CONTINUATION_BYTE | ((scalar >> 12) & CONTINUATION_MASK);
    case 3: *buffer++ = CONTINUATION_BYTE | ((scalar >>  6) & CONTINUATION_MASK);
    case 2: *buffer++ = CONTINUATION_BYTE |  (scalar        & CONTINUATION_MASK);
    default: return bytes_written;
  }
}


libiconv.


Which platform? On Windows, you can use WideCharToMultiByte(CP_UTF8,...)

Arguably, the source codepoint must be encoded in UTF-16, which means you must be able to do such encoding. In some cases (surrogate pairs), it's not trivial.

My understanding is that you have some text in a given codepage and you want to convert it to Unicode (UTF-16). Right? A MultiByteToWideChar(codePage, sourceText,...) / WideCharToMultiByte(CP_UTF8, utf16Text,...) roundtrip will do the trick.


I agree with Clement that the accepted answer doest not explain things very well. The following document explains things in a very simple way:

Yergeau, F. 2003. UTF-8, a transformation format of ISO 10646. RFC 3629, section 3, pp. 3-4.

The following book ...

Korpela, Jukka K. 2006. Unicode Explained. Sebastopol, etc.: O'Reilly Media, Inc. ... provides a good general explanation of UTF-8 on page 298.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜