开发者

Optimization: Preallocate a chunk of heap memory prior to use by multiple objects - GAINS?

Are there large gains in PERFORMANCE to be made by allocating heap memory in advance and filling it incrementally?

Consider the VERY simplified example below:

byte * heapSpace = m开发者_开发问答alloc (1 000 000);
int currentWriteSpot = 0;

struct A {
  int x;
  byte * extraSpace;
  int extraSpaceLength;
};

//a1 needs 10 bytes of extra storage space:
A a1;  
a1.x = 2;
a1.extraSpace = heapSpace + currentWriteSpot;
a1.extraSpaceLength = 10;

currentWriteSpot += 10;

//a2 needs 120 bytes of extra storage space:
A a2;
a2.x = 24;
a2.extraSpace = heapSpace + currentWriteSpot;
a2.extraSpaceLength = 120;

currentWriteSpot += 120;

// ... many more elements added

for ( ... ) {
   //loop contiguously over the allocated elements, manipulating contents stored at "extraSpace"
}

free (heapSpace);

VS:

...
a1.extraSpace = malloc ( 10 );
a2.extraSpace = malloc ( 120 );
a3...
a4...
...

//do stuff

free (a1.extraSpace);
free (a2.extraSpace);
free ...
free ...
free ...

Or is this likely to simply add complexity without significant gains in performance?

Thanks folks!


First of all, doing this does not increase complexity; it decreases it. Because you have already determined at the beginning of your operation that malloc was successful, you don't need any further checks for failure, which would at least have to free the allocations already made and perhaps reverse other changes to various objects' states.

One of the other benefits, as you've noted, is performance. This will be a much bigger issue in multi-threaded programs where calls to malloc could result in lock contention.

Perhaps a more important benefit is avoiding fragmentation. If the entire data object is allocated together rather than in small pieces, freeing it will definitely return usable contiguous space of the entire size to the free memory pool to be used by later allocations. On the other hand, if you allocate each small piece separately, there's a good possibility that they won't be contiguous.

In addition to reducing fragmentation, allocating all the data as a single contiguous block also avoids per-allocation overhead (at least 8-16 bytes per allocation are wasted) and improves data locality for cache purposes.

By the way, if you're finding this sort of allocation strategy overly complex, you might try making a few functions to handle it for you, or using an existing library like GNU obstack.


The reason you would want to do this is if you need to guarantee consistent allocation times (where 'consistent' != 'fastest'). The biggest example is the draw loop of a game or other paint operation - it's far more important for it not to "hiccup" than getting an extra 2 FPS at the expense of consistency.

If all you want is to complete an operation as fast as possible, the Win7 LFH is quite fast, and is already doing this optimization for you (this tip is from the days back when the heap manager typically sucked and was really slow). That being said, I could be completely wrong - always profile your workload and see what works and what doesn't.


Generally it is best to let the memory manager do this kind of thing, but in some extreme cases (eg. LOTS of small allocates and de-allocates) can be better handled using your own implementation. Ie. you grab one big chunk of memory and allocated/deallocate as required. Generally such cases are going to be very simplified cases (eg. you own sparse matrix implementation) where you can apply domain-specific optimizations that the standard memory manager cannot do. Eg. in the sparse matrix example, each chunk of memory is going to be the same size. This makes garbage collection relatively simple - chunks of deallocated memory do not need to be joined - just a simple "in use" flag is required, etc,etc.


You should only request to the memory manager for as many blocks of memory as you need to be separately controllable- in an ideal world where we have infinite optimization time, of course. If you have several A objects that do not need to be lifetimed separately, then do not allocate them separately.

Of course, whether or not this is actually worth the more intensive optimization time, is another question.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜