ASCII Encoding and Umlauts and Accents
I have a requirement to produce text files with ASCII encoding. I have a database full of Greek, French, and German characters with Umlauts and Accents. Is this even possible?
string reportString = report.makeReport();
Dictionary<string, string> replaceCharacters = new Dictionary<string, string>();
byte[] encodedReport = Encoding.ASCII.GetBytes(reportString);
Response.BufferOutput = false;
Response.ContentType = "text/plain";
Response.AddHeader("Content-Disposition", "attachment;filename=" + reportName + ".txt");
Response.OutputStream.Write(encodedReport, 0, encodedReport.Length);
Response.End();
When I get the reportString back the characters are represented faithfully. When I save the text file I have ? in place of the special characters开发者_如何学JAVA.
As I understand it the ASCII standard is for American English only and something UTF 8 would be for the international audience. Is this a correct?
I'm going to make the statement that if the requirement is ASCII encoding we can't have the accents and umlauts represented correctly.
Or, am I way off and doing/saying something stupid?
You cannot represent accents and umlauts in an ASCII encoded file simply because these characters are not defined in the standard ASCII charset.
Before Unicode this was handled by "code pages", you can think of a code page as a mapping between Unicode characters and the 256 values that can fit into a single byte (obviously, in every code page most of the Unicode characters are missing).
The original ASCII code page includes only English letters - but it's unlikely someone really wants the original 7-bit code page, they probably call any 8-bit character set ASCII.
The English code page known as Latin-1 is ISO-8859-1 or Windows-1252 (the first is the ISO standard, the second is the closest code page supported by Windows).
To support characters not in Latin-1 you need to encode using different code pages, for example:
874 — Thai
932 — Japanese
936 — Chinese (simplified) (PRC, Singapore)
949 — Korean
950 — Chinese (traditional) (Taiwan, Hong Kong)
1250 — Latin (Central European languages)
1251 — Cyrillic
1252 — Latin (Western European languages)
1253 — Greek
1254 — Turkish
1255 — Hebrew
1256 — Arabic
1257 — Latin (Baltic languages)
1258 — Vietnamese
UTF-8 is something completely different, it encodes the entire Unicode character set by using variable number of bytes per characters, numbers and English letters are encoded the same as ASCII (and Windows-1252) most other languages are encoded at 2 to 4 bytes per character.
UTF-8 is mostly compatible with ASCII systems because English is encoded the same as ASCII and there are no embedded nulls in the strings.
Converting between .net strings (UTF-16LE) and other encoding is done by the System.Text.Encoding class.
IMPORTANT NOTE: the most important thing is that the system on the receiving end will use the same code page and teh system on the sending end - otherwise you will get gibberish.
The ASCII characer set only contains A-Z in upper and lowe case, digits, and some punctuation. No greek characters, no umlauts, no accents.
You can use a character set from the group that is sometimes referred to as "extended ASCII", which uses 256 characters instead of 128.
The problem with using a different character set than ASCII is that you have to use the correct one, i.e. the one that the receiving part is expecting, or it will fail to interpret any of the extended characters correctly.
You can use Encoding.GetEncoding(...)
to create an extended encoding. See the reference for the Encoding class for a list of possible encodings.
You are correct.
- Pure US ASCII is a 7-bit encoding, featuring English characters only.
- You need a different encoding to capture characters from other alphabets. UTF-8 is a good choice.
UTF-8 is backward compatible with ASCII, so if you encode your files as UTF-8, then ASCII clients can read whatever is in their character set, and Unicode clients can read all the extended characters.
There's no way to get all the accents you want in ASCII; some accented characters (like ü) are however available in the "extended ASCII" (8-bit) character set.
Various of the encodings mentioned by other answers can be loosely described as extended ASCII.
When your users are asking for ASCII encoding, they are probably asking for one of these.
A statement like "if the requirement is ASCII encoding we can't have the accents and umlauts represented correctly" risks sounding pedantic to a non-technical user. An alternative is to get a sample of what they want (probably either the ANSI or OEM code page of their PC), determine the appropriate code page, and specify that.
The above is only partially correct. While it's true that you can't encode those characters in ASCII, you can represent them. They exist because some typewriters and early computers couldn't handle those characters.
Ä=Ae
ä=ae
ö=oe
Ö=Oe
ü=ue
Ü=Ue
ß=sz
Edit: Andyraddaz has already written code that replaces lots of Unicode Characters with ASCII Representations. They might not be correct for some Languages/Cultures, but at least you wont have encoding errors. https://gist.github.com/andyraddatz/e6a396fb91856174d4e3f1bf2e10951c
精彩评论