Checking Unicode string for whitespace - byte for byte!
Quick & dirty Q: Can I safely assume that a byte of a UTF-8, UTF-16 or UTF-32 codepoint (character) will not be an ASCII whitespace character (unless the codepoint is representing one)?
I'll explain:
Say that I have a UTF-8 encoded string. This string contains some characters that take more than one byte to store. I need to find out if any of the characters in this string are ASCII whitespace characters (space, horizontal tab, vertical tab, carriage return, linefeed etc - Unicode defines some more whitespace characters, but forget about them).
So what I do is that I loop through the string and check if any of the bytes match the bytes that define whitespace characters. Take e.g. 0D (hex) for carriage return. Note that we are talking bytes here, not characters.
Will this work? Will there be UTF-8 codepoints where the first byte will be 0D and the second byte something else - and this codepoint does not represent a carriage return? Maybe the other way around? Will there be codepoints where the first byte is something weird, and the second (or third, or fourth) byte is 0D - and this codepoint does not represent a carriage return?
UTF-8 is backwards compatible with ASCII, so I really hope that it will work for UTF-8. From what I know of it, it might, but I don't know the details well enough to say for sure.
As for UTF-16 and UTF-32 I doubt it'll work 开发者_运维百科at all, but I barely know anything about the details of these, so feel free to surprise me there...
The reason for this whacky question is that I have code checking for whitespace that works for ASCII, and I need to know if it may break on Unicode. I have no choice but to check byte-for-byte, for a bunch of reasons. I'm hoping that the backwards compatibility with ASCII might give me at least UTF-8 support for free.
For UTF-8, yes, you can. All non-ASCII characters are represented by bytes with the high-bit set and all ASCII characters have the high bit unset.
Just to be clear, every byte in the encoding of a non-ASCII character has the high bit set; this is by design.
You should never operate on UTF-16 or UTF-32 at the byte level. This almost certainly won't work. In fact lots of things will break, since every second byte is likely to be '\0'
(unless you typically work in another language).
In correctly encoded UTF-8, all ASCII characters will be encoded as one byte each, and the numeric value of each byte will be equal to the Unicode and ASCII code points. Furthermore, any non-ASCII character will be encoded using only bytes that have the eighth bit set. Therefore, a byte value of 0D will always represent a carriage return, never the second or third byte of a multibyte UTF-8 sequence.
However, sometimes the UTF-8 decoding rules are abused to store ASCII characters in other ways. For example, if you take the two-byte sequence C0 A0 and UTF-8-decode it, you get the one-byte value 20, which is a space. (Any time you find the byte C0 or C8, it's the first byte of a two-byte encoding of an ASCII character.) I've seen this done to encode strings that were originally assumed to be single words, but later requirements grew to allow the value to have spaces. In order to not break existing code (which used stuff like strtok
and sscanf
to recognize space-delimited fields), the value was encoded using this bastardized UTF-8 instead of real UTF-8.
You probably don't need to worry about that, though. If the input to your program uses that format, then your code probably isn't meant to detect the specially encoded whitespace at that point anyway, so it's safe for you to ignore it.
Yes, but see caveat below about the pitfalls of processing non-byte-oriented streams in this way.
For UTF-8, any continuation bytes always start with the bits 10
, making them greater than 0x7f
, no there's no chance they could be mistaken for a ASCII space.
You can see this in the following table:
Range Encoding Binary value
----------------- -------- --------------------------
U+000000-U+00007f 0xxxxxxx 0xxxxxxx
U+000080-U+0007ff 110yyyxx 00000yyy xxxxxxxx
10xxxxxx
U+000800-U+00ffff 1110yyyy yyyyyyyy xxxxxxxx
10yyyyxx
10xxxxxx
U+010000-U+10ffff 11110zzz 000zzzzz yyyyyyyy xxxxxxxx
10zzyyyy
10yyyyxx
10xxxxxx
You can also see that the non-continuation bytes for code points outside the ASCII range also have the high bit set, so they can never be mistaken for a space either.
See wikipedia UTF-8 for more detail.
UTF-16 and UTF-32 shouldn't be processed byte-by-byte in the first place. You should always process the unit itself, either a 16-bit or 32-bit value. If you do that, you're covered as well. If you process these byte-by-byte, there is a danger you'll find a 0x20
byte that is not a space (e.g., the second byte of a 16-bit UTF-16 value).
For UTF-16, since the extended characters in that encoding are formed from a surrogate pair whose individual values are in the range 0xd800
through 0xdfff
, there's no danger that these surrogate pair components could be mistaken for spaces either.
See wikipedia UTF-16 for more detail.
Finally, UTF-32 (wikipedia link here) is big enough to represent all of the Unicode code points so no special encoding is required.
It is strongly suggested not to work against bytes when dealing with Unicode. The two major platforms (Java and .Net) support unicode natively and also provide a mechanism for determining these kind of things. For e.g. In Java you can use Character class's isSpace()/isSpaceChar()/isWhitespace() methods for your use case.
精彩评论