Extract correct text from a wifstream regardless of encoding

2022-12-19 22:51 问答作者：

Here is the program: http://codepad.org/eyxunHot

The encoding of the file is UTF-8.

I have a text file named "config.ini" with the following word in it: ➑ball

If I use notepad to save the file with "UTF-8" encoding, then run the program, according to the debugger the value of eight_ball is: ï»¿âball

If I use notepad to save the file with "Unicode" encoding, then run the progr开发者_Python百科am, according to the debugger the value of eight_ball is: ÿþ'b

If I use notepad to save the file with "Unicode big endian" encoding, then run the program, according to the debugger the value of eight_ball is: þÿ'

In all these cases the result is incorrect. Also ANSI encoding doesn't support the ➑ symbol. How do I make sure that the word ➑ball will be extracted from the file when I go config_file >> eight_ball, regardless of encoding? I want the output of this program to be "Program is correct" regardless of the encoding of config.ini.

If you're under Windows and you want to use INI files, keep in mind that the INI APIs support Unicode (UTF-16 little endian) INI files without problems, you just have to provide the empty file with the BOM at the beginning.

By the way, if you want to work with C++ streams and Unicode you may want to look at this article. Besides of the UTF8 thing, you'll learn also how character conversion works under the hood in C++ streams.

Maybe you can yse ICU library.

Windows has many problems with UTF supports. My Ubuntu uses default UTF-8 encodings and this problem solved, but Unix like OS has some strange realization of C++ standart library. I mean using char* for holding UTF-8 text (it use 2 cells of array on letter). But with string class it cleans.

You need to set the locale before wstreams will work correctly. I would instead suggest using regular streams and some library for character conversion, as your input encoding typically will differ anyway. The best algorithm these days is to try reading as UTF-8 first and if that fails, try reading as CP1252 or some other user-configurable legacy charset.

继续阅读：encoding locale utf-8

Extract correct text from a wifstream regardless of encoding

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？