How to determine if uploaded file is in UTF-8 or UTF-16?
I have a website where a user can upload a txt file of data and the data will be imported into the db. However, some users are uploading the data in UTF-8, and others are uploading it in UTF-16.
byte[] fileData = null;
uploader.PostedFile.InputStream.Read(fileData, 0, length);
开发者_JAVA百科 data = TLCommon.EncodeJsString(System.Text.Encoding.UTF8.GetString(fileData));
When the file is saved in UTF-16 and uploaded, the data is garbage. How can I handle this situation?
There are various heuristics you can employ, such as checking for a high percentage of 00
bytes in the stream. (These won't be present in UTF-8, but are common in UTF-16 text that contains ASCII characters.)
This however, can't distinguish between UTF-8 and Windows-1252, which are incompatible 8-bit encodings that are both very common on U.S. English Windows systems. You can add more checks, such as looking for byte sequences that are invalid in one encoding but not another, but this starts to get very complex and typically doesn't distinguish between different single-byte encodings.
Microsoft provides a library named MLang, which can automatically detect UTF-8, UTF-16, and many 8-bit codepages using statistical analysis of the bytes in the stream. Its accuracy is quite good if it has a large-enough sample of text to work with. I blogged about how to use this method, and posted the full source code on GitHub.
There are a few options you can use: check the content-type to see if it includes a charset parameter which would indicate the encoding (e.g. Content-Type: text/plain; charset=utf-16
); check if the uploaded data has a BOM (the first few bytes in the file, which would map to the unicode character U+FEFF - 2 bytes for UTF-16, 3 for UTF-8), or if you know something about the file (is the first character supposed to be ascii, such as in XML, which start with a '<') then you can use it to find out the encoding. But if you don't have those pieces of information you'll have to guess by using some heuristics.
精彩评论