开发者

Unicode in CSV file?

I need to generate a CSV file. Maybe i am 'doing it wrong' because i am dumping the file with my own code instead of using a lib but anyways.

It looks like i have everything right. Quotes, commas and everything seems to be escaped perfectly. It was rather easy. The problem is i am using unicode strings to test and they come out as ????. When i use MS Excel to save a file with my test string and i hit save as CSV opening the file gets me the same problem (unicode letters becoming ?????). Is unicode not supported?

I just tried dumping the string like this instead of outputting it to a webpage

var f = new System.IO.StreamWriter(filename, false, System.Text.Encoding.Unicode);

and now i see the unicode text but everything is now in one column. Whats weird is everything looks 开发者_如何转开发normal in my text editor of choice and if i copy/paste a few columns out and paste it in saving as .csv i see the columns fine. Although it probably strips unicode out.

How do i save this properly?


System.Text.Encoding.Unicode uses UTF-16 encoding. Try telling your text-editors to decode with UTF-16; I'm guessing the editor you are using to display the output file is defaulting to UTF-8 or ASCII. If this is so, an alternative might be to encode the output with System.Text.Encoding.UTF8 instead.


You need to do two things: mark the text file (or html page) as containing Unicode chars (either UTF-8 or UTF-16), and make sure that you are using a text editor that supports Unicode text. Notepad is a good choice on Windows.

To mark a text file (such as .csv) as containing Unicode text, you need to write a Byte Order Mark (BOM) as the first character in the text file. For UTF-16 little-endian (Intel), the BOM would be bytes 0xFF, 0xFE. The Byte Order Mark tells the document reader whether the characters in the document are ordered as big-endian or little-endian. The BOM character is a reserved non-printing character in the Unicode character tables. This BOM can also be used to distinguish ASCII text from UTF-8 and other Unicode encodings (because the UTF-8 BOM byte sequence is different from UTF-16, etc).

Some document writers will write the BOM for you, or have an option to include or exclude the BOM. Use a binary hex dump to view the text file bytes to determine whether you have a BOM or not. Do not use a text editor - the BOM is a non-display char.

To indicate that an HTML page you are generating contains Unicode characters, you need to set the Content-Type header to indicate a Unicode charset: Content-Type: text/html; charset=utf-8 indicates UTF-8 encoded Unicode text, for example.


It could also just be the font Word is using is missing these characters you are trying to display. If I open Word, hold ALT and mash my numpad, it changes the font to a math font, but still displays the missing character item from the font in question.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜