C# : Japanese characters with unicode encoding
The intention of the code is printing unicode as japanese characters to a file
String s = "\u30a2\u30c3\u30d7\u30ed\u30fc\u30c9\u3059\u308b\u30d5\u30a1\u30a4\u30eb\u304c\u6307\u5b9a\u3055\u308c\u3066\u3044\u307e\u305b\u3093";
var Bytes = Encoding.Unicode.GetBytes(s);
string key = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Unicode, Encoding.UTF8, Bytes));
Key is I want to print to file but has the value \u30a2\u30c3\u30d7\u30ed\u30fc\u30c9\u3059\u308b\u30d5\u30a1\u30a4\u30eb\u304c\u6307\u5b9a\u3055\u308c\u3066\u3044\u307e\u305b\u3093
Any ideas whats wrong开发者_Go百科?
What's wrong is that a string (key
) has no notion of the bytes used to store it. In this case, your string is:
String:
アップロードするファイルが指定されていません
this is exactly what
"\u30a2\u30c3\u30d7\u30ed\u30fc\u30c9\u3059\u308b\u30d5\u30a1\u30a4\u30eb\u304c\u6307\u5b9a\u3055\u308c\u3066\u3044\u307e\u305b\u3093"
means. The expression '\u30a3'
looks like 2 Unicode bytes, but it actually just means the character 'ア'
.
if you save to a UTF-8 file, the bytes written will be:
UTF-8 bytes
File.WriteAllText("temp.txt", "アップロードするファイルが指定されていません", Encoding.UTF8);
The contents will be (in bytes)
E3 82 A2 E3 83 83 E3 83 97 E3 83 AD E3 83 BC E3 83 89 E3 81 99 E3 82 8B E3 83
95 E3 82 A1 E3 82 A4 E3 83 AB E3 81 8C E6 8C 87 E5 AE 9A E3 81 95 E3 82 8C E3
81 A6 E3 81 84 E3 81 BE E3 81 9B E3 82 93
UTF-16 bytes
File.WriteAllText("temp.txt", "アップロードするファイルが指定されていません", Encoding.Unicode);
The contents will be (in bytes)
A2 30 C3 30 D7 30 ED 30 FC 30 C9 30 59 30 8B 30 D5 30 A1 30 A4 30 EB 30 4C 30
07 63 9A 5B 55 30 8C 30 66 30 44 30 7E 30 5B 30 93 30
One doesn't "convert" Unicode to UTF-8 :-/
Unicode, besides being the parent for the entire set of specifications, can be thought of as "simply" defining code-points/characters and the rules of interaction. The UTF-8 encoding is the specific set of rules to map a sequence of Unicode code-points into a sequence of octets (8-bit bytes).
Try this in LINQPad:
String s = "\u30a2\u30c3\u30d7\u30ed";
s.Dump(); // original string
var bytes = Encoding.UTF8.GetBytes(s);
bytes.Dump(); // see UTF-8 encoded byte sequence
string key = Encoding.UTF8.GetString(bytes);
key.Dump(); // contents restored
The UTF-8 exists only in bytes
.
Happy coding.
精彩评论