开发者

std::string implementation in GCC and its memory overhead for short strings

I am currently working on an application for a low-memory platform that requires an std::set of many short strings (>100,000 strings of 4-16 characters each). I recently transitioned this set from std::string to const char * to save memory and I was wondering whether I was really avoiding all that much overhead per string.

I tried using the following:

std::string sizeTest = "testString";
std::cout << sizeof(sizeTest) << " bytes";

But it just gave me an output of 4 bytes, indicating that the string contains a pointer. I'm well aware that strings store their data i开发者_JAVA技巧n a char * internally, but I thought the string class would have additional overhead.

Does the GCC implementation of std::string incur more overhead than sizeof(std::string) would indicate? More importantly, is it significant over this size of data set?

Here are the sizes of relevant types on my platform (it is 32-bit and has 8 bits per byte):

char: 1 bytes

void *: 4 bytes

char *: 4 bytes

std::string: 4 bytes


Well, at least with GCC 4.4.5, which is what I have handy on this machine, std::string is a typdef for std::basic_string<char>, and basic_string is defined in /usr/include/c++/4.4.5/bits/basic_string.h. There's a lot of indirection in that file, but what it comes down to is that nonempty std::strings store a pointer to one of these:

  struct _Rep_base
  {
size_type       _M_length;
size_type       _M_capacity;
_Atomic_word        _M_refcount;
  };

Followed in-memory by the actual string data. So std::string is going to have at least three words of overhead for each string, plus any overhead for having a higher capacity than `length (probably not, depending on how you construct your strings -- you can check by asking the capacity() method).

There's also going to be overhead from your memory allocator for doing lots of small allocations; I don't know what GCC uses for C++, but assuming it's similar to the dlmalloc allocator it uses for C, that could be at least two words per allocation, plus some space to align the size to a multiple of at least 8 bytes.


I'm going to guess you are on a 32 bit, 8 bit per byte platform. I'm also going to guess that at least on the gcc version you are using, that they are using a reference counted implementation for std::string. The 4 byte sizeof you see is a pointer to a structure containing the reference count and the string data (and any allocator state if applicable).

In this design of gcc's the only "short" string has size == 0, in which case it can share a representation with every other empty string. Otherwise you get a refcounted COW string.

To investigate this yourself, code up an allocator that keeps track of how much memory it allocates and deallocates, and how many times. Use this allocator to investigate the implementation of the container you're interested in.


If it's guaranteed that ">100,000 strings of 4-16 characters each", then don't use std::string. Instead, write your own ShortString class. It's interesting that "sizeof(std::string) == 4", how is that possible? What are sizeof(char) and sizeof(void *)?


I've performed some comparisons about std::string overhead. In general it is about 48 bytes! Take a look at the article on my blog: http://jovislab.com/blog/?p=76

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜