开发者

malloc cpu cycles

What is the cost of malloc(), in terms of CPU cycles? (Vista/OS, latest version of gcc, highest optimization level,...)

Basically, I'm implementing a complex DAG structure (similar to a linked list) composed of some 16B (less common) and 20B nodes (more common).

Occasionally, I will have to remove some nodes and then add some. But, rather than always using malloc() and free(), I can simply move unneeded nodes to the end of my data structure, and then update the fields as my algorithm cont开发者_StackOverflowinues. If a free node is available, I will update the fields; if not, I'll have to allocate a new one.

The problem is, I might have only one free node available while having to input, for example, 20 nodes worth of data. This means:

  • I will check for an available free node
  • The check will succeed, and that free node will get updated
  • I will check for an available node 19 more times
  • All checks will fail, and malloc() will be called each time

Question: Is it really worth it? Should I just malloc() and free() as usual, or is it worth it to keep some free nodes available at the end of the list, and keep checking even if it will usually fail and lead to malloc() anyway?

More concretely,

What is the CPU cost of malloc()??


Does it matter what it costs? Really?

The true answer is "it depends".

It depends on loads of things

  • What else the OS is doing at the time
  • How fragmented memory has become
  • speed of the memory and processor on the client PC
  • etc

If this code is massively performance critical, them time everything you can and work out the best pattern for your usage case.

If it is isn't the most performance critical bit of code, just do whatever is the clearest and simplest to implement and maintain.

"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil", Donald Knuth


malloc() does not have a fixed cost in terms of latency because of the numerous possible states the memory manager has to deal with to fulfill your request.

Since your node sizes are relatively small, you should consider always doing an allocation of some larger size, perhaps 10 or more node sizes per allocation and stuffing the extra ones into your unused pool. That way you'll incur allocation uncertainly less frequently. But more importantly, you'll reduce the amount of memory fragmentation caused by so many tiny allocations.

Incidentally, I don't consider this sort of design consideration "Premature Optimization" since you aren't looking for an excuse to inject obtuse design characteristics without good reason. Data structures which can grow to arbitrary size and persist for arbitrary durations do need a little bit of forethought.

Particularly since data structures tend to find their way into unplanned usages later and often by other developers, it is important to strike a reasonable balance in terms of clarity and anticipated behavior.

Write your structure proper with your own allocation and deallocation functions. Implement those separately. Initially have them just malloc and free a single node to make debugging easier. Later you can redesign them with fancier algorithms as your needs dictate.


Is it really worth it?

You will have to measure in order to know, period.


If memory is never freed, malloc() will tend to run pretty fast. If many blocks of memory are used and freed, malloc() may become quite slow. The particulars of how fast or slow it will be for any given pattern of usage depend strongly upon the implementation, and sometimes only slightly-less strongly on the phase of the moon.

In some cases, particularly with embedded systems, memory usage will strictly follow a LIFO pattern. In that case, it may be helpful to simply grab all the memory one might want to use (on embedded systems this can often be done at link time), and keep a pointer to the start of that area and the end of allocated space (which initially is the start of the area). To allocate 'n' bytes, simply copy the end-of-allocated-space pointer, add 'n' to the original, and return the copied value. To free up a chunk and everything allocated after it, copy the address of the chunk to the end-of-allocated-space pointer.

Note that this approach has zero per-block overhead, and that both allocation and deallocation are very cheap. The LIFO limitation might be a problem, but if most of the usage is LIFO and one explicitly knows everything that needs to persist after a "sweep", one may be able to relocate everything that needs to be kept after a "sweep" to the start of allocable space, and put the pointer after the relocated stuff.


In addition to what @rikh highlighted, if you want ultra fast memory allocation, one technique is to pre-allocate blocks that are the size you need (lots of them).

I've written custom memory managers that have pre-allocated lists of blocks of different sizes.

In addition, you can also incorporate a memory bounds checking scheme into the blocks you are managing.


You might want to look into pooled allocators; AT&T's vmalloc package provides pooled allocator for example.


Heaps, especially for small memory allocations, a often structured as a linked list, where each heap cell points to the next. When allocating memory, the allocator will walk the heap until it finds a cell big enough for the required allocation. As your memory becomes more fragmented, you will have to walk a larger and larger number of cells. Although a large amount of work has been done to minimize allocation times, it is better to avoid the problem all together.

It may well be a good idea to allocate a large block and divide this amongst a number of list items. this will probably mean you have fewer cache misses when walking your linked list.

For this reason, I would avoid the high frequency use of malloc and free and add the extra complexity of a freelist.


It's worth finding out what the minimum allocatable block is in your target OS. You may be better off malloc()ing in 4K blocks and using that as your unused pool.


Asking for the cost of a single malloc is the wrong question.

Usual performance degradation factors are:

  • Size of working set (how many bytes you are "touching" within e.g. a second)
  • Memory fragmentation (how long does it take malloc to find a suitable block, and how much will this increase working set size)

From my experience, when you have to expect many nodes of that size (>~ 100K...Millions), these things do matter.

Custom Allocator
Of course, if you can tune your algorithm to use less memory, or less nodes, do so. However, instead of letting the allocation cost concern leak into your solution, isolate it in a custom allocator.

The simplest choice for that would be overloading new for your class, this means your solution code is not affected.

Which allocator depends a bit on the needs of the algorithm. For frequently allocating and freeing same-sized blocks, a fixed-size pool is the canonical choice.

An arena allocator can perform even better if you have many allocations and very few deletes (i.e. you can afford to not release the memory that was freed).

However, the deciding factor between the two is usually locality of reference. If there's anything you can do to boost that, you can win big time.


Any advice above that urges you to try some specific technique is wrong. The advice above that tells you to avoid premature optimization (a very important principle indeed), is right.

You have given us a question which is meaningless. What CPU? What speed? What architecture? Malloc is a C function. What implementation of the standard heap routines are you talking about? The one in Microsoft Visual C/C++? The one that comes with GNU standard libraries (stdlibc) on Linux/Unix/Posix?

You haven't measured your performance and then told us what the performance under load is, you didn't tell us you wrote unit tests for load testing. Are you doing your initial design and your "thinking about how many cycles" at the same time? Because that's just silly.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜