What are some common character encodings that a text editor should support?

2022-12-17 04:53 问答作者：

I have a text开发者_StackOverflow editor that can load ASCII and Unicode files. It automatically detects the encoding by looking for the BOM at the beginning of the file and/or searching the first 256 bytes for characters > 0x7f.

What other encodings should be supported, and what characteristics would make that encoding easy to auto-detect?

Definitely UTF-8. See http://www.joelonsoftware.com/articles/Unicode.html.

As far as I know, there's no guaranteed way to detect this automatically (although the probability of a mistaken diagnosis can be reduced to a very small amount by scanning).

I don't know about encodings, but make sure it can support the multiple different line ending standards! (\n vs \r\n)

If you haven't checked out Mich Kaplan's blog yet, I suggest doing so: http://blogs.msdn.com/michkap/

Specifically this article may be useful: http://www.siao2.com/2007/04/22/2239345.aspx

There is no way how you can detect an encoding. The best thing you could do is something like IE and depend on letter distributions in different languages, as well as standard characters for a language. But that's a long shot at best.

I would advise getting your hands on some large library of character sets (check out projects like iconv) and make all of those available to the user. But don't bother auto-detecting. Simply allow the user to select his preference of a default charset, which itself would be UTF-8 by default.

Latin-1 (ISO-8859-1) and its Windows extension CP-1252 must definitely be supported for western users. One could argue that UTF-8 is a superior choice, but people often don't have that choice. Chinese users would require GB-18030, and remember there are Japanese, Russians, Greeks too who all have there own encodings beside UTF-8-encoded Unicode.

As for detection, most encodings are not safely detectable. In some (like Latin-1), certain byte values are just invalid. In UTF-8, any byte value can occur, but not every sequence of byte values. In practice, however, you would not do the decoding yourself, but use an encoding/decoding library, try to decode and catch errors. So why not support all encodings that this library supports?

You could also develop heuristics, like decoding for a specific encoding and then test the result for strange characters or character combinations or frequency of such characters. But this would never be safe, and I agree with Vilx- that you shouldn't bother. In my experience, people normally know that a file has a certain encoding, or that only two or three are possible. So if they see you chose the wrong one, they can easily adapt. And have a look at other editors. The most clever solution is not always the best, especially if people are used to other programs.

UTF-16 is not very common in plain text files. UTF-8 is much more common because it is back compatible with ASCII and is specified in standards like XML.

1) Check for BOM of various Unicode encodings. If found, use that encoding.
2) If no BOM, check if file text is valid UTF-8, reading until you reach a sufficient non-ASCII sample (since many files are almost all ASCII but may have a few accented characters or smart quotes) or the file ends. If valid UTF-8, use UTF-8.
3) If not Unicode it's probably current platform default codepage.
4) Some encodings are easy to detect, for example Japanese Shift-JIS will have heavy use of the prefix bytes 0x82 and 0x83 indicating hiragana and katakana.
5) Give user option to change encoding if program's guess turns out to be wrong.

Whatever you do, use more than 256 bytes for a sniff test. It's important to get it right, so why not check the whole doc? Or at least the first 100KB or so.

Try UTF-8 and obvious UTF-16 (lots of alternating 0 bytes), then fall back to the ANSI codepage for the current locale.

继续阅读：character encoding text-editor unicode

What are some common character encodings that a text editor should support?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？