How to convert multi-byte UTF-8 character representation to one byte while retaining (non)alphanumeric property?
I have a UTF-8 string as a char*
. In order to get the one byte per character property (and thus have random access into the string by character indexes) I currently just remove all UTF-8 continuation bytes from it (I would like to avoid "proper" conversion to a static byte width representation).
Instead of removing all continuation bytes I would like to be able to check whether a given multi-byte UTF-8 ch开发者_StackOverflowaracter is alphanumeric (or not) and then replace it with a corresponding ASCII character (let's say a
for alphanumerics and .
otherwise). How do I do this?
For each byte in the string:
- If it is an ASCII byte, just copy it.
- If it is a UTF-8 head byte, decode starting from that byte to
wchar_t
usingmbrtowc
, determine an ASCII character whose classification matches by comparing the results of theisw*()
functions, and copy that ASCII character to the output. - If it is anything else, skip it.
There's no way to do this in general, as letters outside the ASCII range (such as α) may be accented as well (ἄ). But you can apply the NFD Unicode normalization to decompose accented codepoints into their constituents, then check whether the components lie within the ASCII range. ICU has normalization support.
Unicode got total 1114111 (0x10FFFF) as highest code points, that means almost over a million characters. Single byte can represent 256 characters.
So simple answer is you can't do it, that way.
As far I understand from question, you want this for random access to characters in the string. You use 32bit characters. (Correct me If I am wrong).
Rather then handling it by writing your code use ICU, and using converter convert it into UTF-32 (4 byte character). ucnv_convertEx is the function to be used for this.
精彩评论