Realloc Performance

Recently I was tracking a post reasoning about the current implementation of realloc() on glibc and following it into the kernel. Very interesting reading but it left a bitter taste in the end since it was not followed by a careful investigation campaign. Of course, challenge accepted. I set off to write it.

So before that, what is realloc?  Malloc/Realloc/Free are part of the C dynamic memory allocation subsystem. It has been standardized in the POSIX.1-2008 standard and IEEE Std 1003.1, 2013 Edition.

In short, Malloc allocates memory and Free deallocates it. Realloc is used when you need to extend a block that has been allocated with Malloc.

The article goes down the rabbit hole that the C library is (glibc on Linux) and eventually into the kernel. The bottom line: for blocks bigger than 128kb, a simple virtual memory trick is used within the kernel which makes (re)allocation extremely fast. But that was a conclusion based on inspection or, in my own perspective, speculation. And speculation is for traders, not technologists.

So I set off to write my own tests. It took roughly one hour to write the tests and compile the results. I am going to get technical from here on. For non technologists, here is where you should stop reading.

I used two machines: one older AMD Phenom-X2 and a newer Haswell i7-5820k. Both running on Ubuntu 14.04.2 LTS, and the standard glibc 2.19 plus G++ 4.8.4.

The test loops through sizes 1 up to 512 in powers of two. For each size, it forks a separate process and allocates a vector of 1M pointers. Then the test process starts calling malloc(),  then realloc() for a size 2x bigger and finally free().

I measure all these calls with a RDTSC, of course taking care of all the pitfalls following the current Intel time measurement guidelines. Also I take steps to avoid out-of-order execution by both compiler and processor, discounting “dead time”, ie time to actually execute the timing instructions (RDTSC, RDTSCP, CPUID).

All results are in clock cycles to discard all edge cases relating to Intel speed step and to put the two different architectures on same footing.

For the first batch I used the AMD Phenom box.  On the first set, I allocate all vectors in a serial manner, ie with a loop ( j=0; j<N; ++j ). The results are as follows:

On the second set, I allocated the vectors in random order to kill all possible cache read-ahead effects. The results as expected are slightly worse:

What is interesting is that usually the average is higher than the median, which indicates normal spikes of high duration on the samples. However on this batch for realloc() with SZ<16 the median is larger. Digging into the stats, I see a double mode pattern. For example for realloc/SZ=1:

    32-64 cycles: 42.2%
64-128 cycles: 2.3%
128-256 cycles: 50.8%
256-512 cycles: 4.2%

Ie the majority of the samples take 208 cycles but a distinct group – ~40% of them – take much less, I’d guess ~60 cycles. This is very telling.

For the second batch, I used a newer Haswell i7-5820k (hyperthreading enabled). The results are better now as expected due to the better memory management.

As one can see, times in general improved quite substantially as expected. Struck me that on the newer Haswell the cache effects were much less noticeable.

I quickly ran a comparision batch with a memory pool I currently use in our trading systems. For shuffle/512 I get alloc: 25 avg, 2 med realloc: 59 avg, 4 med and free: 9 avg, 1 med. In this approach realloc is a free followed by an alloc.

The conclusion is that the standard C library malloc/realloc/free are really orders of magnitude more expensive than what you can do yourself implementing your own allocators.  This is mandatory to achieve true low latency speeds.

But the article’s central point was really focused on large realloc’s so I just ran another batch with the newer Haswell and in serial mode, now for blocks 1k up to 256M.

The results are very telling in the sense that the 128k threshold is clearly evidenced in the test results. There is a large buildup O(N), roughly ~ 1.3 cycles/byte from 1kb up to 64kb and then a sudden drop to 4k cycles.

To summarize the two batches and sets, I plotted the results in a double-log chart for easier visualization:

On this chart it is easy to see that all three function calls follow an exponential path up to a certain limit. Malloc hits the nail on the head earlier and smooths out to constant time at about 8kb. However it seems that the current glibc limits for realloc are very high and should be lowered since it spikes up to 80 thousand cycles per realloc (~ 20 microseconds on a typical server) before dropping to more reasonable levels.

With this result it is pretty clear that one to tune up their algos so to never issue a realloc call with sizes between 8k and 128k bytes. Instead, one should resort to a custom reallocation algorithm or tweak the mmap() threshold down to 8k with a mallocopt() call.

TL;DR  Memory management is a complex but very important part of any mission critical system. Building fast algorithms that do very well on its own is very easy. To make it work in the jungle of an application where all components compete against each other for memory, cache lines and CPU resources, is a much more difficult task that has to be tackled head on with painstaking, detail-oriented care.

* Henrique (the author) is a geek, nerd, owner of Vitorian LLC that is always finding ways to make his (and his clients) algos faster or more powerful. He lives in Chicago and has a beautiful family.