Convert a UTF-8 without BOM xml file into ISO 8859-1
I have an xml file UTF-8 encoded without BOM. In an hex editor it gives : 3c 3f 78 6d
I buffer my xml file and add the BOM at the beginning:
char* BufferEncoder = (char*)malloc(3);
memset(BufferEncoder, 0, size);
for(int i=0;i<3;i++) BufferEncoder[i] ^= 0xaa;
BufferEncoder[0]=(char)0x开发者_开发知识库ef;
BufferEncoder[1]=(char)0xbb;
BufferEncoder[2]=(char)0xbf;
// concatenate into a new Buffer containing old xml and the BOM
I tried then to convert from UTF-8 with BOM to ISO 8859-1 using these lines of code :
int size = WideCharToMultiByte(28591 /*ISO-8859-1*/, 0, pBuffer, -1, NULL, 0, NULL, 0);
if (size>0)
{
char* pBuffer2 = (char*)malloc(size);
memset(pBuffer2, 0, sizeNew);
WideCharToMultiByte(28591, 0,pBuffer,-1, pBuffer2, size, NULL, 0);
// .........
This code is not yet tested. Do you think that this is the best solution? Any idea or advice is welcome. Thank you in advance.
As I touched on in my comment: I think this line of thought necessitates a few questions right back at you, so to speak:
Why are you doing this conversion in the first place?
Do you actually know what
WideCharToMultiByte()
does?
I'll freely admit that I myself am not entirely clear on exactly what WideCharToMultiByte()
does; but I'm going to go right ahead and assume that it converts a string of wide characters to a string of multibyte characters. From a quick glance at the documentation, it seems as if it does this into a new buffer, returning the length of the new string.
Which is all well and dandy. The problem is that UTF-8 is not in fact a wide character encoding; and ISO-8859-1 is not a multibyte encoding. UTF-8 is a multibyte encoding; but that doesn't really help you much in this case.
My advice; then, is that you read up on character encodings; especially about the differences between UTF-8 (multibyte) and UTF-16 (wide).
I also suggest that you find a different interface for whatever you are trying to do that actually accepts UTF-8 strings; because any interface that requires ISO-8859-1 strings, especially when dealing with XML, strikes me as being insanely legacy-y, bordering on completely insane.
Of course, had you actually stated what you were trying, on the whole, to achieve; more specific advice could be given.
Edit: If I understand your conundrum correctly, the issue is that you are getting a correctly formatted and encoded XML file that may contain characters outside of the ASCII range (U+0…U+127). If this is the problem, using ISO-8859-1 in any way, shape or form will set you up for the mother of all headaches down the road:
Encoding Issues
If the text file can contain some character outside of the ASCII range, then it can conceivably contain any character outside of the ASCII range. And while UTF-8 can represent any character, this is not true of ISO-8859-1.
In other words; your best case scenario if you stick to interface that mistreat encodings is irreversible lossage of information; worst case scenario is crashage and burnage.
My point is: Don't coddle the broken interface, and Never Don't Use UTF-8.
精彩评论