开发者

C# : Japanese characters with unicode encoding

The intention of the code is printing unicode as japanese characters to a file

   String s = "\u30a2\u30c3\u30d7\u30ed\u30fc\u30c9\u3059\u308b\u30d5\u30a1\u30a4\u30eb\u304c\u6307\u5b9a\u3055\u308c\u3066\u3044\u307e\u305b\u3093";
   var Bytes = Encoding.Unicode.GetBytes(s);      
   string  key = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Unicode, Encoding.UTF8, Bytes));

Key is I want to print to file but has the value \u30a2\u30c3\u30d7\u30ed\u30fc\u30c9\u3059\u308b\u30d5\u30a1\u30a4\u30eb\u304c\u6307\u5b9a\u3055\u308c\u3066\u3044\u307e\u305b\u3093 Any ideas whats wrong开发者_Go百科?


What's wrong is that a string (key) has no notion of the bytes used to store it. In this case, your string is:

String:

アップロードするファイルが指定されていません

this is exactly what

"\u30a2\u30c3\u30d7\u30ed\u30fc\u30c9\u3059\u308b\u30d5\u30a1\u30a4\u30eb\u304c\u6307\u5b9a\u3055\u308c\u3066\u3044\u307e\u305b\u3093" 

means. The expression '\u30a3' looks like 2 Unicode bytes, but it actually just means the character 'ア'.

if you save to a UTF-8 file, the bytes written will be:

UTF-8 bytes

File.WriteAllText("temp.txt", "アップロードするファイルが指定されていません", Encoding.UTF8);

The contents will be (in bytes)

 E3 82 A2 E3 83 83 E3 83 97 E3 83 AD E3 83 BC E3 83 89 E3 81 99 E3 82 8B E3 83 
 95 E3 82 A1 E3 82 A4 E3 83 AB E3 81 8C E6 8C 87 E5 AE 9A E3 81 95 E3 82 8C E3 
 81 A6 E3 81 84 E3 81 BE E3 81 9B E3 82 93

UTF-16 bytes

File.WriteAllText("temp.txt", "アップロードするファイルが指定されていません", Encoding.Unicode);

The contents will be (in bytes)

 A2 30 C3 30 D7 30 ED 30 FC 30 C9 30 59 30 8B 30 D5 30 A1 30 A4 30 EB 30 4C 30 
 07 63 9A 5B 55 30 8C 30 66 30 44 30 7E 30 5B 30 93 30


One doesn't "convert" Unicode to UTF-8 :-/

Unicode, besides being the parent for the entire set of specifications, can be thought of as "simply" defining code-points/characters and the rules of interaction. The UTF-8 encoding is the specific set of rules to map a sequence of Unicode code-points into a sequence of octets (8-bit bytes).

Try this in LINQPad:

String s = "\u30a2\u30c3\u30d7\u30ed";
s.Dump();     // original string
var bytes = Encoding.UTF8.GetBytes(s);      
bytes.Dump(); // see UTF-8 encoded byte sequence
string key = Encoding.UTF8.GetString(bytes);
key.Dump();   // contents restored

The UTF-8 exists only in bytes.

Happy coding.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜