Check for invalid UTF8
I am converting from UTF8 format to actual value in hex. However there are some invalid sequences of bytes that I need to catch. Is there a quick way to check if a character doesn't belong in UTF8 in C++?开发者_如何学Python
Follow the tables in the Unicode standard, chapter 3. (I used the Unicode 5.1.0 version of the chapter (p103); it was Table 3-7 on p94 of the Unicode 6.0.0 version, and was on p95 in the Unicode 6.3 version — and it is on p125 of the Unicode 8.0.0 version.)
Bytes 0xC0, 0xC1, and 0xF5..0xFF cannot appear in valid UTF-8. The valid sequences are documented; all others are invalid.
Table 3-7. Well-Formed UTF-8 Byte Sequences
Code Points First Byte Second Byte Third Byte Fourth Byte
U+0000..U+007F 00..7F
U+0080..U+07FF C2..DF 80..BF
U+0800..U+0FFF E0 A0..BF 80..BF
U+1000..U+CFFF E1..EC 80..BF 80..BF
U+D000..U+D7FF ED 80..9F 80..BF
U+E000..U+FFFF EE..EF 80..BF 80..BF
U+10000..U+3FFFF F0 90..BF 80..BF 80..BF
U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
U+100000..U+10FFFF F4 80..8F 80..BF 80..BF
Note that the irregularities are in the second byte for certain ranges of values of the first byte. The third and fourth bytes, when needed, are consistent. Note that not every code point within the ranges identified as valid has been allocated (and some are explicitly 'non-characters'), so there is more validation needed still.
The code points U+D800..U+DBFF are for UTF-16 high surrogates and U+DC00..U+DFFF are for UTF-16 low surrogates; those cannot appear in valid UTF-8 (you encode the values outside the BMP — Basic Multilingual Plane — directly in UTF-8), which is why that range is marked invalid.
Other excluded ranges (initial byte C0 or C1, or initial byte E0 followed by 80..9F, or initial byte F0 followed by 80..8F) are non-minimal encodings. For example, C0 80 would encode U+0000, but that's encoded by 00, and UTF-8 defines that the non-minimal encoding C0 80 is invalid. And the maximum Unicode code point is U+10FFFF; UTF-8 encodings starting from F4 90 upwards generate values that are out of range.
Good answer already, I'm just chipping in another take on this for fun.
UTF-8 uses a general scheme by Prosser and Thompson to encode large numbers in single-byte sequences. This scheme can actually represent 2^36 values, but for Unicode we only need 2^21. Here's how it works. Let N be the number you want to encode (e.g. a Unicode codepoint):
- If N < 128, just one byte
0nnnnnnn
. The highest bit is zero. - Otherwise, several bytes. The first byte starts with as many ones as there are bytes in the sequence, followed by a zero, and then the data bits; successive bytes start with
10
followed by six data bits. Examples: - 3 byte sequence:
1110xxxx 10xxxxxx 10xxxxxx
. - 5 byte sequence:
111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
. - 7 byte sequence:
11111110 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
.
A k-byte sequence fits 5 k + 1 bits (when k > 1), so you can determine how many bytes you need given N. For decoding, read one byte; if its top bit is zero, store its value as is, otherwise use the first byte to figure out how many bytes are in the sequence and process all those.
For Unicode as of today we only need at most k = 4 bytes.
public static bool IsValidUtf8(byte[] bytes, int length)
{
// https://en.wikipedia.org/wiki/UTF-8#Codepage_layout
// http://www.w3.org/International/questions/qa-forms-utf-8
// https://social.msdn.microsoft.com/Forums/vstudio/en-US/df18cca9-5e54-410e-a5c5-74efc7b52e29
// http://gallery.technet.microsoft.com/scriptcenter/ConvertTo-String-d79aed45
Encoding enc = Encoding.GetEncoding("iso-8859-1");
string binaryText = enc.GetString(bytes, 0, length);
Regex rx = new Regex(@"\A(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*\z", RegexOptions.IgnorePatternWhitespace | RegexOptions.Singleline);
return rx.IsMatch(binaryText);
}
Note:
To be able to match regular expression against binary data (bytes), the binary data is converted first to a Unicode string (all .net strings are Unicode) using "iso-8859-1" encoding. It is the only single-byte encoding that has one-to-one mapping with the first 256 Unicode code points. Other encodings do not preserve all the binary bytes after conversion to text.
static void Main(string[] args)
{
string filename = "myfile.txt";
byte[] bytes = File.ReadAllBytes(filename);
if (IsValidUtf8(bytes, bytes.Length))
{
Console.WriteLine("encoding: utf-8");
}
else
{
Console.WriteLine("unknown encoding.");
}
}
精彩评论