Convert ISO-8859-1 strings to UTF-8 in C/C++
You would think this would be readily available, but I'm having a hard time finding a simple library function that will convert a C or C++ string from ISO-8859-1 coding to UTF-8. I'm reading data that is in 8-bit ISO-8859-1 encoding, bu开发者_运维问答t need to convert it to a UTF-8 string for use in an SQLite database and eventually an Android app.
I found one commercial product, but it's beyond my budget at this time.
If your source encoding will always be ISO-8859-1, this is trivial. Here's a loop:
unsigned char *in, *out;
while (*in)
if (*in<128) *out++=*in++;
else *out++=0xc2+(*in>0xbf), *out++=(*in++&0x3f)+0x80;
For safety you need to ensure that the output buffer is twice as large as the input buffer, or else include a size limit and check it in the loop condition.
To c++ i use this:
std::string iso_8859_1_to_utf8(std::string &str)
{
string strOut;
for (std::string::iterator it = str.begin(); it != str.end(); ++it)
{
uint8_t ch = *it;
if (ch < 0x80) {
strOut.push_back(ch);
}
else {
strOut.push_back(0xc0 | ch >> 6);
strOut.push_back(0x80 | (ch & 0x3f));
}
}
return strOut;
}
You can use the boost::locale library:
http://www.boost.org/doc/libs/1_49_0/libs/locale/doc/html/charset_handling.html
The code would look like this:
#include <boost/locale.hpp>
std::string utf8_string = boost::locale::conv::to_utf<char>(latin1_string,"Latin1");
The C++03 standard does not provide functions to directly convert between specific charsets.
Depending on your OS, you can use iconv() on Linux, MultiByteToWideChar() & Co. on Windows. A library which provides large support for string conversion is the ICU library which is open source.
The Unicode folks have some tables that might help if faced with Windows 1252 instead of true ISO-8859-1. The definitive one seems to be this one which maps every code point in CP1252 to a code point in Unicode. Encoding the Unicode as UTF-8 is a straightforward exercise.
It would not be difficult to parse that table directly and form a lookup table from it at compile time.
The code
isolat1ToUTF8(unsigned char* out, int *outlen,
const unsigned char* in, int *inlen) {
unsigned char* outstart = out;
const unsigned char* base = in;
const unsigned char* processed = in;
unsigned char* outend = out + *outlen;
const unsigned char* inend;
unsigned int c;
int bits;
inend = in + (*inlen);
while ((in < inend) && (out - outstart + 5 < *outlen)) {
c= *in++;
/* assertion: c is a single UTF-4 value */
if (out >= outend)
break;
if (c < 0x80) { *out++= c; bits= -6; }
else { *out++= ((c >> 6) & 0x1F) | 0xC0; bits= 0; }
for ( ; bits >= 0; bits-= 6) {
if (out >= outend)
break;
*out++= ((c >> bits) & 0x3F) | 0x80;
}
processed = (const unsigned char*) in;
}
*outlen = out - outstart;
*inlen = processed - base;
return(0);
}
I think this could be helpfull! And sorry for my last comment what was deleted! I can give you the link if needed there is a full explanation in a .c file. I have got this from it. Cheers!
ISO-8859-1 to UTF-8 involves nothing more than the encoding algorithm because ISO-8859-1 is a subset of Unicode. So you already have the Unicode code points. Check Wikipedia for the algorithm.
The C++ aspects -- integrating that with iostreams -- are much harder.
I suggest you walk around that mountain instead of trying to drill through it or climb it, that is, implement a simple string to string converter.
Cheers & hth.,
精彩评论