How to read unicode (utf-8) / binary file line by line
Hi programmers,
I want read line by line a Unicode (UTF-8) text file created by Notepad, i don't want display the Unicode string in the screen, i want just read and compare the strings!.
This code read ANSI file line by line, and compare the strings
What i want
Read test_ansi.txt line by line
if the line = "b" print "YES!"
else print "NO!"
read_ansi_line_by_line.c
#include <stdio.h>
int main()
{
char *inname = "test_ansi.txt";
FILE *infile;
char line_buffer[BUFSIZ]; /* BUFSIZ is defined if you include stdio.h */
char line_number;
infile = fopen(inname, "r");
if (!infile) {
printf("\nfile '%s' not found\n", inname);
return 0;
}
printf("\n%s\n\n", inname);
line_number = 0;
while (fgets(line_buffer, sizeof(line_buffer), infile)) {
++line_number;
/* note that the newline is in the buffer */
if (strcmp("b\n", line_buffer) == 0 ){
printf("%d: YES!\n", line_number);
}else{
printf("%d: NO!\n", line_number,line_buffer);
}
}
printf("\n\nTotal: %d\n", line_number);
return 0;
}
test_ansi.txt
a
b
c
Compiling
gcc -o read_ansi_line_by_line read_ansi_line_by_line.c
Output
test_ansi.txt
1: NO!
2: YES!
3: NO!
Total: 3
Now i need read Unicode (UTF-8) file created by Notepad, after more than 6 months i don't found any good code/library in C can read file coded in UTF-8!, i don't know exactly why but i think the standard C don't support Unicode!
Reading Unicode binary file its OK!, but the probleme is the binary file most be already created in binary mode!, that mean if we want read a Unicode (UTF-8) file created by Notepad we need to translate it from UTF-8 file to BINARY file!
This code write Unicode string to a binary file, NOTE the C file is coded in UTF-8 and compiled by GCC
What i want
Write the Unicode char "ب" to test_bin.dat
create_bin.c
#define UNICODE
#ifdef UNICODE
#define _UNICODE
#else
#define _MBCS
#endif
#include <stdio.h>
#include <wchar.h>
int main()
{
/*Data to be stored in file*/
wchar_t line_buffer[BUFSIZ]=L"ب";
/*Opening file for writing in binary mode*/
FILE *infile=fopen("test_bin.dat","wb");
/*Writing data to file*/
fwrite(line_buffer, 1, 13, infile);
/*Closing File*/
fclose(infile);
return 0;
}
Compiling
gcc -o create_bin create_bin.c
Output
create test_bin.dat
Now i want read the binary file line by line and compare!
What i want
Read test_bin.dat line by line if the line = "ب" print "YES!" else print "NO!"
read_bin_line_by_line.c
#define UNICODE
#ifdef UNICODE
#define _UNICODE
开发者_如何学C#else
#define _MBCS
#endif
#include <stdio.h>
#include <wchar.h>
int main()
{
wchar_t *inname = L"test_bin.dat";
FILE *infile;
wchar_t line_buffer[BUFSIZ]; /* BUFSIZ is defined if you include stdio.h */
infile = _wfopen(inname,L"rb");
if (!infile) {
wprintf(L"\nfile '%s' not found\n", inname);
return 0;
}
wprintf(L"\n%s\n\n", inname);
/*Reading data from file into temporary buffer*/
while (fread(line_buffer,1,13,infile)) {
/* note that the newline is in the buffer */
if ( wcscmp ( L"ب" , line_buffer ) == 0 ){
wprintf(L"YES!\n");
}else{
wprintf(L"NO!\n", line_buffer);
}
}
/*Closing File*/
fclose(infile);
return 0;
}
Output
test_bin.dat
YES!
THE PROBLEM
This method is VERY LONG! and NOT POWERFUL (i m beginner in software engineering)
Please any one know how to read Unicode file ? (i know its not easy!) Please any one know how to convert Unicode file to Binary file ? (simple method) Please any one know how to read Unicode file in binary mode ? (i m not sure)
Thank You.
A nice property of UTF-8 is that you do not need to decode in order to compare it. The order returned from strcmp will be the same whether you decode it first or not. So just read it as raw bytes and run strcmp.
I found a solution to my problem, and I would like to share the solution to any one interested in reading UTF-8 file in C99.
void ReadUTF8(FILE* fp)
{
unsigned char iobuf[255] = {0};
while( fgets((char*)iobuf, sizeof(iobuf), fp) )
{
size_t len = strlen((char *)iobuf);
if(len > 1 && iobuf[len-1] == '\n')
iobuf[len-1] = 0;
len = strlen((char *)iobuf);
printf("(%d) \"%s\" ", len, iobuf);
if( iobuf[0] == '\n' )
printf("Yes\n");
else
printf("No\n");
}
}
void ReadUTF16BE(FILE* fp)
{
}
void ReadUTF16LE(FILE* fp)
{
}
int main()
{
FILE* fp = fopen("test_utf8.txt", "r");
if( fp != NULL)
{
// see http://en.wikipedia.org/wiki/Byte-order_mark for explaination of the BOM
// encoding
unsigned char b[3] = {0};
fread(b,1,2, fp);
if( b[0] == 0xEF && b[1] == 0xBB)
{
fread(b,1,1,fp); // 0xBF
ReadUTF8(fp);
}
else if( b[0] == 0xFE && b[1] == 0xFF)
{
ReadUTF16BE(fp);
}
else if( b[0] == 0 && b[1] == 0)
{
fread(b,1,2,fp);
if( b[0] == 0xFE && b[1] == 0xFF)
ReadUTF16LE(fp);
}
else
{
// we don't know what kind of file it is, so assume its standard
// ascii with no BOM encoding
rewind(fp);
ReadUTF8(fp);
}
}
fclose(fp);
}
fgets() can decode UTF-8 encoded files if you use Visual Studio 2005 and up. Change your code like this:
infile = fopen(inname, "r, ccs=UTF-8");
I know I am bad... but you don't even take under consideration BOM! Most examples here will fail.
EDIT:
Byte Order Marks are a few bytes at the beginnig of the file, which can be used to identify the encoding of the file. Some editors add them, and many times they just break things in faboulous ways (I remember fighting a PHP headers problems for several minutes because of this issue).
Some RTFM: http://en.wikipedia.org/wiki/Byte_order_mark http://blogs.msdn.com/oldnewthing/archive/2004/03/24/95235.aspx What is XML BOM and how do I detect it?
In this article a coding and decoding routine is written and it is explained how the unicode is encoded:
http://www.codeguru.com/cpp/misc/misc/multi-lingualsupport/article.php/c10451/
It can be easily adjusted to C. Simply encode your ANSI or decode the UTF-8 String and make a byte compare
EDIT: After the OP said that it is too hard to rewrite the function from C++ here a template:
What is needed:
+ Free the allocated memory (or wait till the process ends or ignore it)
+ Add the 4 byte functions
+ Tell me that short and int is not guaranteed to be 2 and 4 bytes long (I know, but
C is really stupid !) and finally
+ Find some other errors
#include <stdlib.h>
#include <string.h>
#define MASKBITS 0x3F
#define MASKBYTE 0x80
#define MASK2BYTES 0xC0
#define MASK3BYTES 0xE0
#define MASK4BYTES 0xF0
#define MASK5BYTES 0xF8
#define MASK6BYTES 0xFC
char* UTF8Encode2BytesUnicode(unsigned short* input)
{
int size = 0,
cindex = 0;
while (input[size] != 0)
size++;
// Reserve enough place; The amount of
char* result = (char*) malloc(size);
for (int i=0; i<size; i++)
{
// 0xxxxxxx
if(input[i] < 0x80)
{
result[cindex++] = ((char) input[i]);
}
// 110xxxxx 10xxxxxx
else if(input[i] < 0x800)
{
result[cindex++] = ((char)(MASK2BYTES | input[i] >> 6));
result[cindex++] = ((char)(MASKBYTE | input[i] & MASKBITS));
}
// 1110xxxx 10xxxxxx 10xxxxxx
else if(input[i] < 0x10000)
{
result[cindex++] = ((char)(MASK3BYTES | input[i] >> 12));
result[cindex++] = ((char)(MASKBYTE | input[i] >> 6 & MASKBITS));
result[cindex++] = ((char)(MASKBYTE | input[i] & MASKBITS));
}
}
}
wchar_t* UTF8Decode2BytesUnicode(char* input)
{
int size = strlen(input);
wchar_t* result = (wchar_t*) malloc(size*sizeof(wchar_t));
int rindex = 0,
windex = 0;
while (rindex < size)
{
wchar_t ch;
// 1110xxxx 10xxxxxx 10xxxxxx
if((input[rindex] & MASK3BYTES) == MASK3BYTES)
{
ch = ((input[rindex] & 0x0F) << 12) | (
(input[rindex+1] & MASKBITS) << 6)
| (input[rindex+2] & MASKBITS);
rindex += 3;
}
// 110xxxxx 10xxxxxx
else if((input[rindex] & MASK2BYTES) == MASK2BYTES)
{
ch = ((input[rindex] & 0x1F) << 6) | (input[rindex+1] & MASKBITS);
rindex += 2;
}
// 0xxxxxxx
else if(input[rindex] < MASKBYTE)
{
ch = input[rindex];
rindex += 1;
}
result[windex] = ch;
}
}
char* getUnicodeToUTF8(wchar_t* myString) {
int size = sizeof(wchar_t);
if (size == 1)
return (char*) myString;
else if (size == 2)
return UTF8Encode2BytesUnicode((unsigned short*) myString);
else
return UTF8Encode4BytesUnicode((unsigned int*) myString);
}
just to settle the BOM argument. Here is a file from notepad
[paul@paul-es5 tests]$ od -t x1 /mnt/hgfs/cdrive/test.txt
0000000 ef bb bf 61 0d 0a 62 0d 0a 63
0000012
with a BOM at the start
Personally I dont think there should be a BOM (since its a byte format) but thats not the point
精彩评论