Convert std::string to Unicode in Linux

2023-03-06 05:56 问答作者：

EDIT I modified the question after realizing it was wrong to begin with.

I'm porting part of a C# application to Linux, where I need to get the bytes of a UTF-16 string:

string myString = "ABC";
byte[] bytes = Encoding.Unicode.GetBytes(myString);

So that the bytes array is now:

"65 00 66 00 67 00" (bytes)

How can I achieve the same in C++ on Linux? I have a myString defined as std::strin开发者_JAVA百科g, and it seems that std::wstring on Linux is 4 bytes?

You question isn't really clear, but I'll try to clear up some confusion.

Introduction

Status of the handling of character set in C (and that was inherited by C++) after the '95 amendment to the C standard.

the character set used is given by the current locale
wchar_t is meant to store code point
char is meant to store a multibyte encoded form (a constraint for instance is that characters in the basic character set must be encoded in one byte)
string literals are encoded in an implementation defined manner. If they use characters outside of the basic character set, you can't assume they are valid in all locale.

Thus with a 16 bits wchar_t you are restricted to the BMP. Using the surrogates of UTF-16 is not compliant but I think MS and IBM are more or less forced to do this because they believed Unicode when they said they'll forever be a 16 bits charset. Those who delayed their Unicode support tend to use a 32 bits wchar_t.

Newer standards don't change much. Mostly there are literals for UTF-8, UTF-16 and UTF-32 encoded strings and there are types for 16 bits and 32 bits char. There is little or no additional support for Unicode in the standard libraries.

How to do the transformation of one encoding to the other

You have to be in a locale which use Unicode. Hopefully

std::locale::global(locale(""));

will be enough for that. If not, your environment is not properly setup (or setup for another charset and assuming Unicode won't be a service to your user.).

C Style

Use the wcstomsb and mbstowcs functions. Here is an example for what you asked.

std::string narrow(std::wstring const& s)
{
    std::vector<char> result(4*s.size() + 1);
    size_t used = wcstomsb(&result[0], s.data(), result.size());
    assert(used < result.size());
    return result.data();
}

C++ Style

The codecvt facet of the locale provide the needed functionality. The advantage is that you don't have to change the global locale for using it. The inconvenient is that the usage is more complex.

#include <locale>
#include <iostream>
#include <string>
#include <vector>
#include <assert.h>
#include <iomanip>

std::string narrow(std::wstring const& s,
                   std::locale loc = std::locale())
{
    std::vector<char> result(4*s.size() + 1);
    wchar_t const* fromNext;
    char* toNext;
    mbstate_t state = {0};
    std::codecvt_base::result convResult
        = std::use_facet<std::codecvt<wchar_t, char, std::mbstate_t> >(loc)
        .out(state,&s[0], &s[s.size()], fromNext,
             &result[0], &result[result.size()], toNext);

    assert(fromNext == &s[s.size()]);
    assert(toNext != &result[result.size()]);
    assert(convResult == std::codecvt_base::ok);
    *toNext = '\0';

    return &result[0];
}

std::wstring widen(std::string const& s,
                   std::locale loc = std::locale())
{
    std::vector<wchar_t> result(s.size() + 1);
    char const* fromNext;
    wchar_t* toNext;
    mbstate_t state = {0};
    std::codecvt_base::result convResult
        = std::use_facet<std::codecvt<wchar_t, char, std::mbstate_t> >(loc)
        .in(state, &s[0], &s[s.size()], fromNext,
            &result[0], &result[result.size()], toNext);

    assert(fromNext == &s[s.size()]);
    assert(toNext != &result[result.size()]);
    assert(convResult == std::codecvt_base::ok);
    *toNext = L'\0';

    return &result[0];
}

you should replace the assertions by better handling.

BTW, this is standard C++ and doesn't assume Unicode excepted for the computation of the size of result, you can do better by checking convResult which can indicate a partial conversion).

The easiest way is to grab a small library, such as UTF8 CPP and do something like:

utf8::utf8to16(line.begin(), line.end(), back_inserter(utf16line));

I usually use the UnicodeConverter class from the Poco C++ libraries. If you don't want the dependency then you can have a look at the code.

继续阅读：unicode

Convert std::string to Unicode in Linux

Introduction

How to do the transformation of one encoding to the other

C Style

C++ Style

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

Introduction

How to do the transformation of one encoding to the other

C Style

C++ Style

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？