开发者

Convert a UTF-8 without BOM xml file into ISO 8859-1

I have an xml file UTF-8 encoded without BOM. In an hex editor it gives : 3c 3f 78 6d

I buffer my xml file and add the BOM at the beginning:

char* BufferEncoder = (char*)malloc(3);
memset(BufferEncoder, 0, size);
for(int i=0;i<3;i++) BufferEncoder[i] ^= 0xaa;
BufferEncoder[0]=(char)0x开发者_开发知识库ef;
BufferEncoder[1]=(char)0xbb;
BufferEncoder[2]=(char)0xbf;
// concatenate into a new Buffer containing old xml and the BOM

I tried then to convert from UTF-8 with BOM to ISO 8859-1 using these lines of code :

int size = WideCharToMultiByte(28591 /*ISO-8859-1*/, 0,  pBuffer, -1, NULL, 0, NULL, 0);
if (size>0)
{
    char* pBuffer2 = (char*)malloc(size);
    memset(pBuffer2, 0, sizeNew);
    WideCharToMultiByte(28591, 0,pBuffer,-1, pBuffer2, size, NULL, 0);
    // .........

This code is not yet tested. Do you think that this is the best solution? Any idea or advice is welcome. Thank you in advance.


As I touched on in my comment: I think this line of thought necessitates a few questions right back at you, so to speak:

  1. Why are you doing this conversion in the first place?

  2. Do you actually know what WideCharToMultiByte() does?

I'll freely admit that I myself am not entirely clear on exactly what WideCharToMultiByte() does; but I'm going to go right ahead and assume that it converts a string of wide characters to a string of multibyte characters. From a quick glance at the documentation, it seems as if it does this into a new buffer, returning the length of the new string.

Which is all well and dandy. The problem is that UTF-8 is not in fact a wide character encoding; and ISO-8859-1 is not a multibyte encoding. UTF-8 is a multibyte encoding; but that doesn't really help you much in this case.

My advice; then, is that you read up on character encodings; especially about the differences between UTF-8 (multibyte) and UTF-16 (wide).

I also suggest that you find a different interface for whatever you are trying to do that actually accepts UTF-8 strings; because any interface that requires ISO-8859-1 strings, especially when dealing with XML, strikes me as being insanely legacy-y, bordering on completely insane.

Of course, had you actually stated what you were trying, on the whole, to achieve; more specific advice could be given.

Edit: If I understand your conundrum correctly, the issue is that you are getting a correctly formatted and encoded XML file that may contain characters outside of the ASCII range (U+0…U+127). If this is the problem, using ISO-8859-1 in any way, shape or form will set you up for the mother of all headaches down the road:

Encoding Issues

If the text file can contain some character outside of the ASCII range, then it can conceivably contain any character outside of the ASCII range. And while UTF-8 can represent any character, this is not true of ISO-8859-1.

In other words; your best case scenario if you stick to interface that mistreat encodings is irreversible lossage of information; worst case scenario is crashage and burnage.

My point is: Don't coddle the broken interface, and Never Don't Use UTF-8.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜