Convert a UTF-8 without BOM xml file into ISO 8859-1

2023-03-15 04:32 问答作者：

I have an xml file UTF-8 encoded without BOM. In an hex editor it gives : 3c 3f 78 6d

I buffer my xml file and add the BOM at the beginning:

char* BufferEncoder = (char*)malloc(3);
memset(BufferEncoder, 0, size);
for(int i=0;i<3;i++) BufferEncoder[i] ^= 0xaa;
BufferEncoder[0]=(char)0x开发者_开发知识库ef;
BufferEncoder[1]=(char)0xbb;
BufferEncoder[2]=(char)0xbf;
// concatenate into a new Buffer containing old xml and the BOM

I tried then to convert from UTF-8 with BOM to ISO 8859-1 using these lines of code :

int size = WideCharToMultiByte(28591 /*ISO-8859-1*/, 0,  pBuffer, -1, NULL, 0, NULL, 0);
if (size>0)
{
    char* pBuffer2 = (char*)malloc(size);
    memset(pBuffer2, 0, sizeNew);
    WideCharToMultiByte(28591, 0,pBuffer,-1, pBuffer2, size, NULL, 0);
    // .........

This code is not yet tested. Do you think that this is the best solution? Any idea or advice is welcome. Thank you in advance.

As I touched on in my comment: I think this line of thought necessitates a few questions right back at you, so to speak:

Why are you doing this conversion in the first place?
Do you actually know what WideCharToMultiByte() does?

I'll freely admit that I myself am not entirely clear on exactly what WideCharToMultiByte() does; but I'm going to go right ahead and assume that it converts a string of wide characters to a string of multibyte characters. From a quick glance at the documentation, it seems as if it does this into a new buffer, returning the length of the new string.

Which is all well and dandy. The problem is that UTF-8 is not in fact a wide character encoding; and ISO-8859-1 is not a multibyte encoding. UTF-8 is a multibyte encoding; but that doesn't really help you much in this case.

My advice; then, is that you read up on character encodings; especially about the differences between UTF-8 (multibyte) and UTF-16 (wide).

I also suggest that you find a different interface for whatever you are trying to do that actually accepts UTF-8 strings; because any interface that requires ISO-8859-1 strings, especially when dealing with XML, strikes me as being insanely legacy-y, bordering on completely insane.

Of course, had you actually stated what you were trying, on the whole, to achieve; more specific advice could be given.

Edit: If I understand your conundrum correctly, the issue is that you are getting a correctly formatted and encoded XML file that may contain characters outside of the ASCII range (U+0…U+127). If this is the problem, using ISO-8859-1 in any way, shape or form will set you up for the mother of all headaches down the road:

Encoding Issues

If the text file can contain some character outside of the ASCII range, then it can conceivably contain any character outside of the ASCII range. And while UTF-8 can represent any character, this is not true of ISO-8859-1.

In other words; your best case scenario if you stick to interface that mistreat encodings is irreversible lossage of information; worst case scenario is crashage and burnage.

My point is: Don't coddle the broken interface, and Never Don't Use UTF-8.

继续阅读：iso-8859-1 utf-8 visual-c++xml

Convert a UTF-8 without BOM xml file into ISO 8859-1

Encoding Issues

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

Encoding Issues

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？