开发者

reading binary files C++

I would like to ask for help ... I am starting in C++ and I got this homework at school ... We got to write function bool UTF8toUTF16 (const char * src, const char * dst ); which is supposed to read src file coded in UTF-8 and write it into dst file but in UTF-16. We also mustn't use any other libraries than in my code down...

So the first thing I am trying to do is that I make a file "xx.txt" and in classic Windows notepad I write there for example char 'š'. Then am trying to write a program which reads each char of this file in binary mode byte by byte (or bytes by bytes) and prints it's value... but my program doesn't work like that...

So I have this file 'xx.txt' where is only 'š' which has UTF-8 value 'c5 a1', UTF-16 value '0161' and Unicode value '161' and I suppose result that it will print: i = 161 (hex) or something close to this result at least...

Here is my code so far:

#include <stdio.h>
#include <stdlib.h>
#include <iomanip>
#include <iostream>
#include <f开发者_运维问答stream>

using namespace std;

int main ( void ) {
    char name[] = "xx.txt";
    fstream F ( name, ios::in | ios::binary );
    unsigned int i;
    while( F.read ((char *) & i, 2))
    /* I dont know what size to write there - I would guess it s '2' - because I need 2     bytes for the char with hexUTF-16 code '0161', but 2 doesnt work*/
    cout << "i = " << hex << i << " (hex) ";
    cout << endl;
    F.close();
    system("PAUSE");
    return 0;}

Thanks in advance

Nikolas Jíša


You don't know how big a character is in utf8 until you finish parsing it, you need to read "chars" one at a time until you have a complete utf8 character.

edit - you don't say what you are getting as an output - but I suspect it's a byte ordering issue.
You might be better reading the input (if you know it is always a 16bit value) into a char array and then looking at the individual bytes.

See http://www.joelonsoftware.com/articles/Unicode.html


If your input is in UTF-8, you need to read one byte at a time, not two (you'll want i to have type unsigned char). This gives you a stream of binary data, which you need to decode following the UTF-8 Specification, which will yield a stream of unsigned ints (Unicode code points), which you'll then need to re-encode according to the UTF-16 specification.


It depends. If the role of the class is to contain such objects (e.g. a container class), then its very idiomatic, and the normal way of doing things. In most other cases, however, it is considered preferrable to use getter and setter methods. Not necessarily named getXxx and setXxx---the most frequent naming convention I've seen uses m_attr for the name of the attribute, and simply attr for the name of both the getter and the setter. (Operator overloading will choose between them according to the number of arguments.)

-- James Kanze

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜