How to get byte size of multibyte string

2023-01-09 02:17 问答作者：

How do I get the byte size of a multibyte-character string in Visual C? Is there a function or do I have to count the characters myself?

Or, more general, how do I get the right byte size of a TCH开发者_如何转开发AR string?

Solution:

_tcslen(_T("TCHAR string")) * sizeof(TCHAR)

EDIT:

I was talking about null-terminated strings only.

Let's see if I can clear this up:

"Multi-byte character string" is a vague term to begin with, but in the world of Microsoft, it typically meants "not ASCII, and not UTF-16". Thus, you could be using some character encoding which might use 1 byte per character, or 2 bytes, or possibly more. As soon as you do, the number of characters in the string != the number of bytes in the string.

Let's take UTF-8 as an example, even though it isn't used on MS platforms. The character é is encoded as "c3 a9" in memory -- thus, two bytes, but 1 character. If I have the string "thé", it's:

text: t  h  é     \0
mem:  74 68 c3 a9 00

This is a "null terminated" string, in that it ends with a null. If we wanted to allow our string to have nulls in it, we'd need to store the size in some other fashion, such as:

struct my_string
{
    size_t length;
    char *data;
};

... and a slew of functions to help deal with that. (This is sort of how std::string works, quite roughly.)

For null-terminated strings, however, strlen() will compute their size in bytes, not characters. (There are other functions for counting characters) strlen just counts the number of bytes before it sees a 0 byte -- nothing fancy.

Now, "wide" or "unicode" strings in the world of MS refer to UTF-16 strings. They have similar problems in that the number of bytes != the number of characters. (Also: the number of bytes / 2 != the number of characters) Let look at thé again:

text:   t      h      é      \0
shorts: 0x0074 0x0068 0x00e9 0x0000
mem:    74 00  68 00  e9 00  00 00

That's "thé" in UTF-16, stored in little endian (which is what your typical desktop is). Notice all the 00 bytes -- these trip up strlen. Thus, we call wcslen, which looks at it as 2-byte shorts, not single bytes.

Lastly, you have TCHARs, which are one of the above two cases, depending on if UNICODE is defined. _tcslen will be the appropriate function (either strlen or wcslen), and TCHAR will be either char or wchar_t. TCHAR was created to ease the move to UTF-16 in the Windows world.

According to MSDN, _tcslen corresponds to strlen when _MBCS is defined. strlen will return the number of bytes in the string. If you use _tcsclen that corresponds to _mbslen which returns the number of multibyte characters.

Also, multibyte strings do not (AFAIK) contain embedded nulls, no.

I would question the use of a multibyte encoding in the first place, though... unless you're supporting a legacy app, there's no reason to choose multibyte over Unicode.

继续阅读：c character-encoding multibyte size string

How to get byte size of multibyte string

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？