开发者

ANSI C UTF-8 problem

First I develope an independent platform library by using ANSI C (not C++ and any non standard libs like MS CRT or glibc, ...).

After a few searchs, I found that one of the best way to internationalization in ANSI C, is using UTF-8 encoding.

In utf-8:

  • strlen(s): always counts the number of bytes.
  • mbstowcs(NULL,s,0): The number of characters can be counted.

But I have some problems when I want to random access of elements(characters) of a utf-8 string.

In ASCII encoding:

char get_char(char* assci_str, int n)
{
  // It is very FAST.
  return assci_str[n];
}

In UTF-16/32 encoding:

wchar_t get_char(wchar_t* wstr, 开发者_如何学Cint n)
{
  // It is very FAST.
  return wstr[n];
}

And here my problem in UTF-8 encoding:

// What is the return type?
// Because sizeof(utf-8 char) is 8 or 16 or 24 or 32.
/*?*/ get_char(char* utf8str, int n)
{
  // I can found Nth character of string by using for.
  // But it is too slow.
  // What is the best way?
}

Thanks.


Perhaps you're thinking about this a bit wrongly. UTF-8 is an encoding which is useful for serializing data, e.g. writing it to a file or the network. It is a very non-trivial encoding, though, and a raw string of Unicode codepoints can end up in any number of encoded bytes.

What you should probably do, if you want to handle text (given your description), is to store raw, fixed-width strings internally. If you're going for Unicode (which you should), then you need 21 bits per codepoint, so the nearest integral type is uint32_t. In short, store all your strings internally as arrays of integers. Then you can random-access each codepoint.

Only encode to UTF-8 when you are writing to a file or console, and decode from UTF-8 when reading.

By the way, a Unicode codepoint is still a long way from a character. The concept of a character is just far to high-level to have a simple general mechanic. (E.g. "a" + "accent grave" -- two codepoints, how many characters?)


You simply can't. If you do need a lot of such queries, you can build an index for the UTF-8 string, or convert it to UTF-32 up front. UTF-32 is a better in-memory representation while UTF-8 is good on disk.

By the way, the code you listed for UTF-16 is not correct either. You may want to take care of the surrogate characters.


What do you want to count? As Kerrek SB has noted, you can have decomposed glyphs, i.e. "é" can be represented as a single character (LATIN SMALL LETTER E WITH ACUTE U+00E9), or as two characters (LATIN SMALL LETER E U+0065 COMBINING ACUTE ACCENT U+0301). Unicode has composed and decomposed normalization forms.

What you are probably interested in counting is not characters, but grapheme clusters. You need some higher level library to deal with this, and to deal with normalization forms, and proper (locale-dependent) collation, proper line-breaking, proper case-folding (e.g. german ß->SS) proper bidi support, etc... Real I18N is complex.


Contrary to what others have said, I don' really see a benefit in using UTF-32 instead of UTF-8: When processing text, grapheme clusters (or 'user-perceived characters') are far more useful than Unicode characters (ie raw codepoints), so even UTF-32 has to be treated as a variable-length coding.

If you do not want to use a dedicated library, I suggest using UTF-8 as on-disk, endian-agnostic representation and modified UTF-8 (which differs from UTF-8 by encoding the zero character as a two-byte sequence) as in-memory representation compatible with ASCIIZ.

The necessary information for splitting strings into grapheme clusters can be found in annex 29 and the character database.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜