Reading Unicode files line by line C++

2023-01-02 17:30 问答作者：

What is the correct way to read Unicode files line by line in C++?

I am trying to read a file saved as Unicode (LE) by Windows Notepad.

Suppose the file contains simply the characters A and B on separate lines.

In reading the file byte by byte, I see the following byte sequence (hex) :

FE FF 41 00 0D 00 0A 00 42 00 0D 00 0A 00

So 2 byte BOM, 2 byte 'A', 2byte CR , 2byte LF, 2 byte 'B', 2 byte CR, 2 byte LF .

I tried reading the text file using the following code:

   std::wifstream 开发者_开发问答file("test.txt");
   file.seekg(2); // skip BOM
   std::wstring A_line;
   std::wstring B_line;
   getline(file,A_line);  // I get "A"
   getline(file,B_line);  // I get "\0B"

I get the same results using >> operator instead of getline

   file >> A_line;
   file >> B_line;

It appears that the single byte CR character is is being consumed only as the single byte. or CR NULL LF is being consumed but not the high byte NULL. I would expect wifstream in text mode would read the 2byte CR and 2byte LF.

What am I doing wrong? It does not seem right that one should have to read a text file byte by byte in binary mode just to parse the new lines.

std::wifstream exposes the wide character set to your program, which is typically UCS-2 on Windows and UTF-32 on Unix, but assumes that the input file is still using narrow characters. If you want it to behave using wide characters on disk, you need to use a std::codecvt<wchar_t, wchar_t> facet.

You should just be able to find your compiler's implementation of std::codecvt<char, char> which is also a non-converting code conversion facet, and change the chars to wchar_ts.

继续阅读：unicode

Reading Unicode files line by line C++

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？