Questions while updating some scanner code to use ICU

2023-03-09 04:59 问答作者：

I am working on a rudimentary hand-coded lexical scanner and wish to support UTF-8 input (it's not 1970 anymore!). Input characters are read from stdin or a file one at a time and pushed into a buffer until whitespace is seen, etc. I thought about writing my own wrapper for fgetc() that would instead return char[] of bytes that make up the UTF-8 character and work with the result as a string... it'd be easy enough, but would become a slippery-slope. I'd rather not waste time re-inventing the wheel and instead use an existing, tested library like ICU. And so now I have a non-UTF-8 supporting code that works with fgetc(), isspace(), strcmp(), etc. which I am trying to update to use ICU. This is my first foray with ICU and have been reading through the documentation and trying to find usage examples with Google code search, but there are still some points of confusion I'm hoping开发者_运维百科 someone will be able to clarify.

The u_fgetc() function returns UChar, and u_fgetcx() returns UChar32... the documentation recommends using u_fgetcx() to read codepoints, so that's what I'm starting with. I'm keeping the same approach as above, but I'm pushing UChar32s into a buffer instead of chars.

What is the proper way to compare a character against a known value? Originally I was able to do if (c == '+') to check if the plus-sign was fetched from the input. GCC doesn't complain when c is a UChar32 (which is then a comparison between UChar32 and char) but is this really proper?
I was able to use strcmp() to compare the buffered characters with a known value, for example if ((strcmp(buf, "else") == 0). There is u_strcmp() provided by ICU and I think I may need to use the U_STRING_DECL and U_STRING_INIT macros to specify the known literal, but I am not certain. The documentation shows they result in UChar[], though I assume I need UChar32[]... and I'm uncertain how to use them correctly anyway. Any guidance here would be welcomed.
After reading in a series of numeric characters I have been converting them with strtol() so I can work with them. Is there a similar function made available by ICU since I am converting UChar32[] now?

UChar is for holding a Code Unit, while UChar32 is for holding a Code Point. If your input stays on the Basic Multilingual Plane (BMP), UChar is sufficient, and indeed most ICU functions operate on UChar[].

Strongly recommended reading is the ICU User Guide, which explains most of the internals and best practices.

What is the proper way to compare a Unicode character variable against a known value? A character (or UChar or UChar32) is just another integer type with a certain width and signedness, and can be compared to other integer types with the usual caveats and restrictions. As for defining a character value, C99 (chapter 6.4.3) provides Universal character names notation: \u followed by four hex digits, or \U followed by eight hex digits, specifying the ISO/IEC 10646 "short identifier". The area below 0x00a0 (with exceptions of 0x0024 '$', 0x0040 '@', and 0x0060 (backtick) is reserved (but can be represented by casting a simple character constant to UChar). Also reserved is the range from 0xd800 through 0xdfff (for use by UTF-16).
How to define Unicode string literals? U_STRING_DECL and U_STRING_INIT are indeed what you're looking for. (As written above, ICU mainly operates on UChar[].) If you were using C++ instead of C, UNICODE_STRING_SIMPLE (optionally followed by getTerminatedBuffer() to yield UChar[] again) provides a much more comfortable way of defining Unicode string literals.
How to convert a Unicode string representing a numerical into that numerical's value? unum_parse() and its brethren in unum.h will help you there.

The Unicode value for PLUS SIGN is U+002B, and the normal (Latin-1) value for '+' is also 0x2B (053, 43). What you wrote is safe enough where the code set is based on ASCII or ISO-8859-x. The C99 standard provides for Unicode (Universal character names) of the forms \u0123 and \U00102345 (with 4 and 8 hexadecimal digits), but stipulates that you cannot specify values less than \u00A0, such as \u002B. So, I think what you wrote is correct.

However, you could save yourself future angst by using an enum such as
```
 enum { PLUS_SIGN = '+' };
```
defined in an appropriate header and used whereever you need a literal plus sign. That way, if your assumption (and my assumption) is wrong, you have one place to edit - the header.

I note that the page on Strings with ICU suggests that using UTF-32 in an application is unusual.
In pure C, you'd probably use wcscmp(buf, L"else"), assuming that the wchar_t on your system is equivalent to uint32_t and/or UChar32. There seem to be ways to use UnicodeString and UNICODE_STRING("...") followed by ToUTF32() to create a UTF-32 string. There may also be neater ways.
There are 'Formatting' classes which handle both formatting and parsing. You would probably use classes derived from the NumberFormat class.

继续阅读：c icu utf-8

Questions while updating some scanner code to use ICU

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？