开发者

how to convert utf-8 to ASCII in c++?

i am getting response from server in utf-开发者_如何学C8 but not able to read that. how to convert utf-8 to ASCII in c++?


First note that ASCII is a 7-bit format. There are 8-bit encodings, if you are after one of these (such as ISO 8859-1) you'll need to be more specific.

To convert an ASCII string to UTF-8, do nothing: they are the same. So if your UTF-8 string is composed only of ASCII characters, then it is already an ASCII string, and no conversion is necessary.

If the UTF-8 string contains non-ASCII characters (anything with accents or non-Latin characters), there is no way to convert it to ASCII. (You may be able to convert it to one of the ISO encodings perhaps.)

There are ways to strip the accents from Latin characters to get at least some resemblance in ASCII. Alternatively if you just want to delete the non-ASCII characters, simply delete all bytes with values >= 128 from the utf-8 string.


This example works under Windows (you did not mention your target operating system):

    // The sample buffer contains "©ha®a©te®s" in UTF-8
    unsigned char buffer[15] = { 0xc2, 0xa9, 0x68, 0x61, 0xc2, 0xae, 0x61, 0xc2, 0xa9, 0x74, 0x65, 0xc2, 0xae, 0x73, 0x00 };
    // utf8 is the pointer to your UTF-8 string
    char* utf8 = (char*)buffer;
    // convert multibyte UTF-8 to wide string UTF-16
    int length = MultiByteToWideChar(CP_UTF8, 0, (LPCSTR)utf8, -1, NULL, 0);
    if (length > 0)
    {
        wchar_t* wide = new wchar_t[length];
        MultiByteToWideChar(CP_UTF8, 0, (LPCSTR)utf8, -1, wide, length);

        // convert it to ANSI, use setlocale() to set your locale, if not set
        size_t convertedChars = 0;
        char* ansi = new char[length];
        wcstombs_s(&convertedChars, ansi, length, wide, _TRUNCATE);
    }

Remember to delete[] wide; and/or ansi when no longer needed. Since this is unicode, I'd recommend to stick to wchar_t* instead of char* unless you are certain that input buffer contains characters that belong to the same ANSI subset.


If the string contains characters which do not exist in ASCII, then there is nothing you can do, because, well, those characters do not exist in ASCII.

If the string contains only characters which do exist in ASCII, then there is nothing you need to do, because the string is already in the ASCII encoding: UTF-8 was specifically designed to be backwards-compatible with ASCII in such a way that any character which is in ASCII has the exact same encoding in UTF-8 as it has in ASCII, and that any character which is not in ASCII can never have an encoding which is valid ASCII, i.e. will always have an encoding which is illegal in ASCII (specifically, any non-ASCII character will be encoded as a sequence of 2–4 octets all of which have their most significant bit set, i.e. have an integer value > 127).

Instead of simply trying to convert the string, you could try to transliterate the string. Most languages on this planet have some form of ASCII transliteration scheme that at least keeps the text somewhat comprehensible. For example, my first name is "Jörg" and its ASCII transliteration would be "Joerg". The name of the creator of the Ruby Programming Language is "まつもとゆきひろ" and its ASCII transliteration would be "Matsumoto Yukihiro". However, please note that you will lose information. For example, the German sz-ligature gets transliterated to "ss", so the word "Maße" (measurements) gets transliterated to "Masse". However, "Masse" (mass, in the physicist's sense, not the Christian's) is also a word. As another example, Turkish has 4 "i"s (small and capital, with and without dot) and ASCII only has 2 (small with dot and capital without dot), therefore you will either lose information about the dot or whether or not it was a capital letter.

So, the only way which will not lose information (in other words: corrupt data), is to somehow encode the non-ASCII characters into sequences of ASCII characters. There are many popular encoding schemes: SGML entity references, MIME, Unicode escape sequences, ΤΕΧ or LaΤΕΧ. So, you would encode the data as it enters your system and decode it when it leaves the system.

Of course, the easiest way would be to simply fix your system.


UTF-8 is an encoding that can map every unicode character. ASCII only supports a very small subset of unicode.

For the subset of unicode that is ASCII, the mapping from UTF-8 to ASCII is a direct one-to-one byte mapping, so if the server sends you a document that only contains ASCII characters in UTF-8 encoding then you can directly read that as ASCII.

If the response contains non-ASCII characters then, whatever you do, you won't be able to express them in ASCII. To filter these out of a UTF-8 stream you can just filter out any byte >= 128 (0x80 hex).


Check this utf-8 String Library, forget about converting it to ASCII.


Note that there are two UTF8 types: UTF8_with_BOM and UTF8_without_BOM. And you need to handle differently for them in convert to ANSI. The following functions will work.

  • UTF8_with_BOM to ANSI

    void change_encoding_from_UTF8_with_BOM_to_ANSI(const char* filename)
    {
        ifstream infile;
        string strLine="";
        string strResult="";
        infile.open(filename);
        if (infile)
        {
            // the first 3 bytes (ef bb bf) is UTF-8 header flags
            // all the others are single byte ASCII code.
            // should delete these 3 when output
            getline(infile, strLine);
            strResult += strLine.substr(3)+"\n";
    
            while(!infile.eof())
            {
                getline(infile, strLine);
                strResult += strLine+"\n";
            }
        }
        infile.close();
    
        char* changeTemp=new char[strResult.length()];
        strcpy(changeTemp, strResult.c_str());
        char* changeResult = change_encoding_from_UTF8_to_ANSI(changeTemp);
        strResult=changeResult;
    
        ofstream outfile;
        outfile.open(filename);
        outfile.write(strResult.c_str(),strResult.length());
        outfile.flush();
        outfile.close();
    }
    
    // change a char's encoding from UTF8 to ANSI
    char* change_encoding_from_UTF8_to_ANSI(char* szU8)
    { 
        int wcsLen = ::MultiByteToWideChar(CP_UTF8, NULL, szU8, strlen(szU8), NULL, 0);
        wchar_t* wszString = new wchar_t[wcsLen + 1];
        ::MultiByteToWideChar(CP_UTF8, NULL, szU8, strlen(szU8), wszString, wcsLen);
        wszString[wcsLen] = '\0';
    
        int ansiLen = ::WideCharToMultiByte(CP_ACP, NULL, wszString, wcslen(wszString), NULL, 0, NULL, NULL);
        char* szAnsi = new char[ansiLen + 1];
        ::WideCharToMultiByte(CP_ACP, NULL, wszString, wcslen(wszString), szAnsi, ansiLen, NULL, NULL);
        szAnsi[ansiLen] = '\0';
    
        return szAnsi;
    }
    
  • UTF8_without_BOM to ANSI

    void change_encoding_from_UTF8_without_BOM_to_ANSI(const char* filename)
    {
        ifstream infile;
        string strLine="";
        string strResult="";
        infile.open(filename);
        if (infile)
        {
            while(!infile.eof())
            {
                getline(infile, strLine);
                strResult += strLine+"\n";
            }
        }
        infile.close();
    
        char* changeTemp=new char[strResult.length()];
        strcpy(changeTemp, strResult.c_str());
        char* changeResult = change_encoding_from_UTF8_to_ANSI(changeTemp);
        strResult=changeResult;
    
        ofstream outfile;
        outfile.open(filename);
        outfile.write(strResult.c_str(),strResult.length());
        outfile.flush();
        outfile.close();
    }
    


UTF-8 is backwards compatible with ASCII meaning all ASCII characters are encoded as single unchanged byte values in UTF-8. If the text should be ASCII but you are unable to read it then there must be another issue.


ASCII is a codepage representing 128 characters and control codes where as utf8 is able to represent any character in the Unicode standard which is much -much more to ASCII capabilities. So Answer to your Question is : Not Possible Unless you have any more specification for the data source.


As to phrase

"If the string contains characters which do not exist in ASCII, then there is nothing you can do, because, well, those characters do not exist in ASCII."

it's wrong.

UTF-8 is multibyte code set and may take more than 2 sets of symbols(languages). Practically you have either single language (English as usual) or 2 languages one of them is English.

  • First case is simple ASCII char(any encoding).
  • The second one describes ASCII char corresponding encoding. If it's not Chinese or Arabic.

In the conditions above you can convert UTF-8 to ASCII chars. Corresponding functional there is no in C++. So you can do it manually. It's easy detect two byte symbols from 1 byte. The high bit of the first byte is set for two byte ones and unset otherwise.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜