Last update: 05 August 2022
An arena_allocator implementation that keep both allocate()
and deallocate()
to a complexity of O(1).
This arena_allocator is useful when :
- in a program there is an extensive use of "new" and "delete" for a well-defined "data type";
- we want to avoid memory fragmentation, since the program is supposed to run for long time without interruptions;
- performances in our program are important and we want to avoid waste of CPU cycles.
Let's see how it works and how it is possible to keep allocation and deallocation with a complexity of O(1).
In the following [Figure 1] there is representation for the arena_allocator; In this specific example we are considering chuck_size = 5
since we want to illustrate the mechanism and 5 memory_slot are more than enough for this scope.
As we can see in [Figure 1], there are different elements:
- MEMORY CHUNK: in order to avoid massive call to
new
anddelete
there is a single allocation, one for eachchunks
, where the amount of memory is enough to storechunk_size
items of typedata_t
plus some extra memory required to manage the structure. In fact, the drawback to have better performances, at least with this implementation, is that we need a pointer for each singlememory_slot
, so 32 bits or 64 bits depending on the CPU architecture. Moreover, we need to keep other two pointers, respectivelyfirst
andlast
, used for integrity checks as well as for deallocating the memory oncearena_allocator
will be destroyed. - MEMORY SLOT: each
memory_slot
is divide in two areas that are respectively reserved foruser_data
information and a pointer to next freememory_slot
, this last when thememory_chunk
is initialized will be initialized with the address of next contiguousmemory_slot
, but after that, so during operation, this value can contain any of the available address for memory slots, in any of the allocatedmemory_chunk
. - next_free : a pointer to
memory_slot
that always contain the address of next free memory_slot to be released with a call to allocate(). If there are no availablememory_slot
value will benullptr
.
Now lets see a minimalist code for allocate()
and deallocate()
:
data_t* arena_allocator::allocate( )
{
memory_slot* pCurrSlot = _next_free;
if ( pCurrSlot == nullptr ) // There are no free slot available, this can occur in
return nullptr; // if our arena_allocator have been limited to a maximum
// amount of memory_slot (size_limit template parameter)
// or in case the system run out-of-memory.
// When we still have free memory_slot then the following 3 steps are done.
_next_free = pCurrSlot->next(); // STEP 1: set _next_free to the next() in the chain.
// with reference to Figure 1, if _next_free was
// referring to 'Slot_1', then we move it to
// Slot_1->next then means 'Slot_2'
pCurrSlot->_next = nullptr; // STEP 2: set next pointer to nullptr since this
// memory_slot is "in use" and 'ownership'
// is released to the user.
return pCurrSlot->prt(); // STEP 3: release a pointer to data_t to the caller
}
The above method allocate()
basically performs:
- 1 x initial check
- 2 x assignment respectively at STEP 1 and STEP 2
void arena_allocator::deallocate( data_t* userdata ) noexcept
{
assert( userdata != nullptr );
slot_pointer pSlot = memory_slot::slot_from_user_data(userdata);
userdata->~data_t(); // invoke data_t() destructor like a call to delete()
pSlot->_next = _next_free; // STEP 1: setting _next to _next_free, so following
// the status after allocate() above, we have
// pSlot = Slot_1, pSlot->_next = nullptr and
// and _next_free = Slot_2; then after this
// assignment we will have:
// pSlot->next = Slot_2
_next_free = pSlot; // STEP 2: _next_free = Slot_1, so in this specific
// use case we roll-back to the original status.
}
The above method deallocate()
basically performs:
- 1 x subtraction done by memory_slot::slot_from_user_data()
- 2 x assignment respectively at STEP 1 and STEP 2
In the above examples, there is no mention for constructor
and destructor
time, since this are data_t
dependent and then in charge of the user in any case. Moreover, both constructor and destructor are executed out of any synchronization mechanism, so do not have impact on arena_allocator performances in a concurrent environment.
Please note that both the above methods are simplified, since there is no synchronization for the full implementation, please refer to arena_allocator.h or to core/arena_allocator.h where there are two different methods for both allocate()
and deallocate()
, one version is thread safe the other one is unsafe, then synchronization for unsafe_ version is in charge to the caller.
Let's do some benckmarks
in different conditions.
First results are obtained in without multi-threads
, so without context switch
and without concurrency
, since we want to see the differet behaviours in different conditions. The following first table have been obtained using nanobench with the bm_arena_allocator.cpp. Compiler gcc 12.0.1 20220319
CPU 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz
.
ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark |
---|---|---|---|---|---|---|---|---|---|
16.92 | 59,111,448.27 | 0.2% | 233.62 | 38.92 | 6.003 | 53.62 | 0.0% | 7.63 | new and delete |
21.60 | 46,293,002.98 | 0.0% | 53.00 | 49.74 | 1.066 | 7.00 | 0.0% | 9.74 | lock_free::arena_allocator |
20.77 | 48,149,902.91 | 0.1% | 46.00 | 47.81 | 0.962 | 4.00 | 0.0% | 9.41 | lock_free::arena_allocator unsafe |
12.40 | 80,662,221.74 | 0.9% | 186.00 | 28.53 | 6.518 | 43.00 | 0.0% | 5.67 | core::arena_allocator |
8.61 | 116,171,334.50 | 1.4% | 36.00 | 19.81 | 1.817 | 3.00 | 0.0% | 3.88 | core::arena_allocator unsafe |
The table show the difference between the classic new
and delete
compared with allocate()
and deallocate()
in the arena_allocator in both modes safe and unsafe modes and both implementations.
In this specific benchmark
we have only one thread allocating and deallocating memory, but when your program has many other calls to new and delete with different sizes, then running the program for hours will degrade new
and delete
performances due to fragmentation when the arena_allocator keep performances constant in time.
There is nothing magic since the arena allocator leverage the fact that it has all memory_slot
with the same size, instead the new
and delete
should deal with generic requests and even with tcache
optimization and/or buddy
implementations the new
operation will cost more than O(1) to search a new slot for the user.
Now that we saw what happens with a single thread, lets move in a multi-threads
environment, in this new condition we will only use thread friend methods.
Just note that we are going to generate an environment with an High-Contended resource only for the reason we are interested to see the behaviour of our implementations in extreme conditions as well as the behaviour of new
and delete
operators that rely on malloc
and on the system I'm running leverage mmap
, but we will come back on this topic at the end. What is important to highlight is that extreme conditions like this shall be avoided in any application design.
- For the following
benchmarks
eachrun
will execute 50 Millions allocation and 50 Millions deallocation; we want to keep total number of operations as constant in order better focus on time variation increasing the number of concurrent threads. Sofirst run
will do all operations with one thread, for thesecond run
operations will be distributed on 2 theads and this means that each thread will do 25 Millions and 25 Millions deallocation and so on up to maximum number of threads. - Number of thread will start from 1 and will be increased with step of 1 up to 16 threads, so for each implementation we come out with
16 runs
. - Results are related to:
gcc 12.0.1 20220319
CPU11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz
, using a different compiler such asMSVC
orCLang
can bring different results, since they rely on different implementations with different solutions. In the following I will present only results generate withGCC
compiler. - Source code used for this specific benckmark bm_mt_arena_allocator.cpp
- Results are also available mt_results.ods
- for each row we will compute the AVG and then this last will be used to compare the 3 results + one more results obtained using
std::mutex
instead ofcore::mutex
.
Threads | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | AVG |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1.301 | 1.301 | |||||||||||||||
2 | 0.631 | 0.66 | 0.646 | ||||||||||||||
3 | 0.507 | 0.519 | 0.476 | 0.501 | |||||||||||||
4 | 0.373 | 0.443 | 0.422 | 0.37 | 0.402 | ||||||||||||
5 | 0.426 | 0.366 | 0.377 | 0.329 | 0.312 | 0.362 | |||||||||||
6 | 0.406 | 0.327 | 0.41 | 0.32 | 0.295 | 0.338 | 0.349 | ||||||||||
7 | 0.373 | 0.371 | 0.355 | 0.342 | 0.317 | 0.26 | 0.304 | 0.332 | |||||||||
8 | 0.328 | 0.325 | 0.347 | 0.329 | 0.246 | 0.298 | 0.246 | 0.283 | 0.300 | ||||||||
9 | 0.331 | 0.335 | 0.253 | 0.253 | 0.327 | 0.243 | 0.301 | 0.279 | 0.309 | 0.290 | |||||||
10 | 0.293 | 0.249 | 0.268 | 0.229 | 0.302 | 0.333 | 0.343 | 0.231 | 0.229 | 0.255 | 0.281 | ||||||
11 | 0.318 | 0.282 | 0.228 | 0.32 | 0.257 | 0.228 | 0.305 | 0.209 | 0.284 | 0.255 | 0.251 | 0.268 | |||||
12 | 0.319 | 0.312 | 0.252 | 0.276 | 0.313 | 0.274 | 0.251 | 0.274 | 0.195 | 0.244 | 0.266 | 0.241 | 0.284 | ||||
13 | 0.3 | 0.31 | 0.27 | 0.266 | 0.252 | 0.195 | 0.234 | 0.291 | 0.238 | 0.264 | 0.285 | 0.294 | 0.233 | 0.265 | |||
14 | 0.235 | 0.288 | 0.298 | 0.298 | 0.223 | 0.289 | 0.303 | 0.244 | 0.188 | 0.289 | 0.235 | 0.294 | 0.222 | 0.227 | 0.272 | ||
15 | 0.3 | 0.301 | 0.3 | 0.308 | 0.313 | 0.256 | 0.305 | 0.302 | 0.295 | 0.297 | 0.287 | 0.311 | 0.288 | 0.116 | 0.244 | 0.298 | |
16 | 0.301 | 0.248 | 0.291 | 0.29 | 0.241 | 0.285 | 0.293 | 0.284 | 0.278 | 0.297 | 0.282 | 0.289 | 0.274 | 0.278 | 0.157 | 0.146 | 0.279 |
Threads | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | AVG |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1.04 | 1.040 | |||||||||||||||
2 | 7.193 | 6.676 | 6.935 | ||||||||||||||
3 | 8.958 | 8.763 | 9.104 | 8.942 | |||||||||||||
4 | 9.297 | 9.401 | 8.86 | 9.366 | 9.231 | ||||||||||||
5 | 9.189 | 9.408 | 9.47 | 9.58 | 9.458 | 9.421 | |||||||||||
6 | 9.887 | 10.103 | 8.284 | 10.096 | 9.803 | 10.175 | 9.725 | ||||||||||
7 | 10.663 | 8.453 | 10.758 | 10.767 | 10.366 | 10.852 | 10.792 | 10.379 | |||||||||
8 | 11.301 | 10.912 | 11.295 | 8.893 | 10.996 | 10.07 | 11.353 | 11.106 | 10.741 | ||||||||
9 | 10.997 | 7.877 | 9.067 | 10.307 | 10.771 | 10.823 | 11.203 | 10.526 | 11.189 | 10.196 | |||||||
10 | 9.175 | 10.239 | 10.006 | 9.961 | 7.228 | 8.042 | 11.007 | 9.746 | 10.985 | 10.499 | 9.426 | ||||||
11 | 9.347 | 6.603 | 7.463 | 10.227 | 9.85 | 8.551 | 10.935 | 9.038 | 10.92 | 10.389 | 10.291 | 9.002 | |||||
12 | 10.657 | 7.708 | 10.731 | 8.599 | 9.859 | 11.076 | 8.301 | 11.099 | 6.006 | 10.773 | 9.167 | 10.738 | 9.754 | ||||
13 | 10.684 | 8.997 | 8.308 | 10.485 | 9.928 | 10.637 | 8.015 | 10.537 | 10.676 | 9.401 | 10.559 | 10.501 | 10.265 | 9.699 | |||
14 | 8.81 | 10.208 | 9.529 | 10.851 | 10.471 | 10.281 | 7.728 | 10.314 | 6.222 | 10.221 | 9.173 | 9.577 | 10.839 | 10.384 | 9.774 | ||
15 | 9.82 | 9.439 | 9.977 | 8.706 | 9.584 | 10.37 | 7.27 | 10.387 | 9.957 | 8.695 | 9.405 | 9.898 | 9.975 | 9.693 | 9.974 | 9.444 | |
16 | 10.002 | 9.253 | 9.297 | 9.923 | 7.991 | 10.497 | 10.063 | 10.491 | 8.783 | 10.028 | 9.92 | 9.135 | 9.513 | 8.575 | 8.045 | 10.082 | 9.690 |
Threads | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | AVG |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0.732 | 0.732 | |||||||||||||||
2 | 6.527 | 6.662 | 6.595 | ||||||||||||||
3 | 7.527 | 7.144 | 7.995 | 7.555 | |||||||||||||
4 | 11.509 | 12.235 | 13.447 | 13.269 | 12.615 | ||||||||||||
5 | 18.322 | 20.645 | 20.954 | 21.15 | 20.876 | 20.389 | |||||||||||
6 | 20.058 | 13.398 | 18.757 | 21.403 | 18.865 | 21.41 | 18.982 | ||||||||||
7 | 22.631 | 22.271 | 22.68 | 18.891 | 19.795 | 13.406 | 20.827 | 20.072 | |||||||||
8 | 25.28 | 22.443 | 19.029 | 25.518 | 22.126 | 21.79 | 19.746 | 25.482 | 22.677 | ||||||||
9 | 26.245 | 27.364 | 23.283 | 17.574 | 15.791 | 27.325 | 14.596 | 21.768 | 27.352 | 21.743 | |||||||
10 | 25.093 | 27.146 | 25.93 | 14.911 | 14.773 | 29.44 | 22.381 | 30.523 | 30.345 | 30.517 | 23.775 | ||||||
11 | 29.139 | 25.356 | 27.401 | 18.279 | 14.591 | 31.17 | 31.131 | 23.907 | 29.547 | 27.278 | 31.078 | 25.122 | |||||
12 | 34.612 | 32.115 | 13.793 | 19.552 | 15.229 | 35.082 | 23.196 | 31.419 | 31.696 | 34.031 | 31.894 | 35.099 | 25.625 | ||||
13 | 39.23 | 25.543 | 38.904 | 34.507 | 41.056 | 40.871 | 40.163 | 41.068 | 14.414 | 28.761 | 39.404 | 31.804 | 38.503 | 37.668 | |||
14 | 23 | 34.515 | 35.448 | 12.444 | 35.778 | 37.65 | 18.954 | 37.935 | 24.742 | 37.937 | 28.553 | 31.715 | 34.714 | 37.801 | 29.466 | ||
15 | 48.109 | 38.414 | 48.091 | 47.921 | 24.952 | 48.058 | 47.325 | 41.911 | 35.352 | 47.396 | 46.152 | 46.477 | 46.645 | 42.713 | 42.588 | 43.098 | |
16 | 49.105 | 44.869 | 39.921 | 51.058 | 42.998 | 50.617 | 49.154 | 51.268 | 51.296 | 49.352 | 43.876 | 51.499 | 51.5 | 49.531 | 35.247 | 47.484 | 47.374 |
From the above data is already clear what happened, but for a better view let's put all in a graph that include also a different implementation that leverage on std::mutex
.
Note: if someone want to repeat the tests also with the std::mutex
, the only change to be done in core/arena_allocator.h is the following:
..
....
//mutable mutex _mtx_next;
mutable std::mutex _mtx_next;
..
...
This image, confirm what we have seen before in the first benchmark a single thread context, so all implementation show the same performances or better performances than new
and operator
, but this is valid only for the 1 thread, since as soon as we move to a multi-thread everything changes.
What is emerging from this first graph are the results obtained with std::mutex, that on this system leverage on pthread_mutex but I believe that also with other compilers and in other OS we can experiment the same type of degradation for the total time that reached up to 67 seconds to perform 100 Millions operations total, compared with the first result of 1.3 seconds required in a single thread context, with 16 threads we have about 67-1.3 = 65.7 seconds lost from all threads on waiting
to capture the mutex
.
Removing std::mutex
results, we have a graph that is more intelligible and matches with the three tables presented above.
Removing one element remaining information scaled up and it is more visible the results with one thread as well as the behaviour for all 3 allocation strategies.
- Spinlock: we can observe that the trend is growing linearly, so increasing the number of threads the number also generate more collisions increase the time as well;
- Lock-Free: the results in a single thread environment are worst, since the sum of all atomic operations for a single call of
allocate()
ordeallocate()
are comparable to the time required to acquire amutex
or to create a "critical section" and to release it, but starting from 4 threads in parallel, we can observe that lock-free implementation is performing better than a spin-lock implementation, and the curious fact is that from this point forward performances seems to be not affected from the number of concurrent threads, in fact we have more or less the same results for each run. So as far we can see from these results, a Lock-Free implementation seems perfect for environments with high-contended resources. - New & Delete: bizarre results, right? with one single thread is one of the worst but after that it seems to improve the results and is going better and better increasing the number of threads. To understands what is going on, we have do dig a little bit in details in the malloc() implementation and we can find two reasons for those results.
First here a link for everyone interested to know more about malloc with the explanation on how it works internally. I really hope that this results will help to disrupt the misconceit that new() and delete() are slow, there is an huge amount of work behind this simple, but fundamental, functionality and a lot of research and improvements are still in progress; not perfect for sure, but improving release by release.
Please note that with other implementation results can be quite different, and yes in some implementation for new
and delete
.
Now let's see the two main reason to explain why we have this strange, sorry bizzarre, results on using "new
and delete
".
In the following image there we have the comparison between new
and delete
and std::mutex
that have respectively a time of 1.301s and 1.344s, quite closer, right? from this result, seems that malloc()
internally is using a mutex
and checking the implementation it is exactly like that.
At this point it is really important to remind, that the benchmarks
have been executed taking the total number of operations as a constant, so this means that we have the following distribution increasing the number of threads.
And this simple reminder, can help to explain the following graph, in which increasing the number of threads the global time, here presented from the AVG, is going down in clear contrast with the trends observed for all other implementations.
The above results, are real and they are possible only without contended resources, so from our observations seems that new
& delete
are acting like in single thread even there are multiple threads operating with those operators. Again the explanation for that is coming from malloc() implementation where in order to avoid or limit the number of resources shared between threads, to each thread is assigned an arena
, this means that is like each thread have his own new
and delete
and depending on the number of total threads such arena isn't shared or better it is used only by one single thread, and if we complete this information with the above table, where increasing the number of threads we also reduce the number of operation per single thread then the graph make perfectly sense.
Will not surprising if someone is asking them self, why do need an arena allocator if new
and delete
have such great performances. An arena_allocator is useful in many environments, but as we saw from all the results we must take care to use it in the right manner or we will waste a lot of CPU computational power, as well as energy, obtaining wretched performances. An arena allocator is a building block
that can have many allocation strategies, helping to resolve issues like memory fragmentation
and also to obtain a relevant improvement in performances.
Hope you enjoyed this article, and you find usful information in it.