Memory Allocators: Table of Contents

It is perhaps a bit late but I am starting to have quite a few posts on the topic of memory allocators and navigation was beginning to be a bit complex.

I thought it would be a good idea to have a central place that links everything together.

All the relevant code for this lives under Gin (code) on GitHub.

Table of Content

Virtual Memory Aware Stack Frame Allocator

Today we cover the second variant: the virtual memory aware stack frame allocator (code). This is a fairly popular incarnation of the stack frame allocator pattern and I have seen quite a few implementations in the wild in line with this toy implementation. Similar in spirit to the virtual memory aware linear allocator, here again we gain in simplicity and performance by leveraging the virtual memory system.

How it works

Not a whole lot differs from the previous implementation aside from the fact that we are no longer segmented. In AAA video games, it is quite common to have a known upper bound (or it is easily approximated) for these allocators and as such we can simplify the implementation quite a bit by having a single segment. To avoid wasting memory, we simply reserve a virtual address range, and commit/decommit memory as needed.

With a single segment, a lot of logic is removed and simplified. When we allocate, we will commit more memory gradually in blocks of our page size. In a real implementation, this size would likely be larger depending on the page size, 64KB would be a decent size to use.

Popping is very easy and fast and it never causes memory to decommit. This is done to avoid repeated commit/decommit calls and managing some hysteresis. In practice, these allocators often have a known fixed lifetime. For example, we might have an instance of the allocator for main thread allocations during a single frame. As such, it is easy at the end of the frame to decommit once and retain some amount of slack. It is also very common to have 2 or 3 for rendering purposes and while they might be double (or triple) buffered, the principle remains.

What can we use it for

Similar to the classic segmented implementation, this variant is almost a drop in replacement. It is best suited when the upper bound on the usage is known. It would not be unusual for this implementation to live alongside the segmented implementation: during development, the segmented version is used such that it never fails, and closer to releasing the product, the upper bound is calculated and the virtual memory aware variant is used instead.

What we can’t use it for

For obvious reasons, this variant is not ideal when the maximum memory footprint is not known or if it needs to grow unbounded. While on a 64 bit system you could technically reserve 4TB of address space and it would likely be enough, in practice another variant might be a better fit.

Edge cases

Unlike the classic greedy variant, popping will not decommit memory, as such if memory pressure is high, special care needs to be taken to avoid issues. Under some circumstances, if an out of memory situation arises with some other allocator, decommitting the slack might allow you to retry the allocation.

Performance

The performance of this allocator is very good. Allocation is O(1) and deallocation is O(1). Both operations are very cheap if no call to the kernel is required to commit memory.

Conclusion

This is our second usable allocator and its usage is quite common. Single segment implementations might not always be virtual memory aware, but they remain the same in nature. Many AAA video game titles have been released with very similar implementations to great success.

Next up, we will revisit the linear memory allocator family to cover an important addition: the thread safe linear allocator. This will be our first thread safe implementation and it is a variant that is very common in multithreaded rendering code.

Reddit thread

Back to table of contents

Greedy Stack Frame Allocator

Today we cover the first stack frame allocator variant: the greedy stack frame allocator (code). It takes the term greedy from how it manages free segments after popping a frame. To better understand how this allocator works, you should first read up on the stack frame allocator family.

How it works

This allocator uses memory segments chained together. When an allocation is made, we find a free segment that can accommodate our allocation request. The first segment we check is the current live segment (the segment last used). Should it be full or the remaining space too small, we then look in the free segment list and pick (or allocate) a suitable candidate. Finally, we update our current live segment to the new used segment.

A segment behaves much like our linear allocator previously seen (also commonly called a memory region). Allocation is very cheap since it only involves bumping a pointer forward.

When we pop a frame, all segments that were used and are no longer required or alive are appended to a free list. This is where the allocator is greedy: it will only free the segments once the allocator is destroyed. The upside of this is that we will rarely require to allocate new segments which can involve an expensive malloc/kernel call.

Supporting realloc is trivial and works as expected.

The allocator also supports supplying it with pre-allocated segments that are externally managed. This is handy if you want to prime it with a larger segment which will thus become the minimum managed size. For example, you might know that the allocator is usually on average around 2MB, instead of letting it allocate segments of 64KB on demand, you register a single 2MB segment at initialization and later if it needs more memory, it will allocate 64KB segments on demand.

What can we use it for

The greedy nature of the algorithm makes it ideal when the upper bound on memory usage is known or at the very least manageable and we can afford to keep the memory allocated. This is suitable for many applications and is indeed used in a similar form in Unreal 3 under the name of FMemStack. Many AAA games have shipped with this making extensive use of it.

What we can’t use it for

The greedy nature makes it far less ideal when the upper bound isn’t known ahead of time or is possibly unbounded.

Edge cases

Again, the usual edge cases here involve overflow caused by the size or alignment requested and must be carefully dealt with. Another edge case specific to this allocator remains: because of the linear allocation nature of the algorithm, live segments might not be 100% used and will naturally keep some unused slack. Generally speaking this isn’t such a big deal but it can trivially be tracked if required.

Potential optimizations

Here again we can omit the overflow checks. In practice they are rarely required and could be left to assert instead. This is generally the approach taken in AAA game implementations.

Reallocate support is optional here as well and could be stripped if needed.

The implementation uses a FrameDescription in order to keep track of the order in which the frames are pushed to ensure we pop them back in reverse order. This is entirely for safety and in practice it could be omitted in a shipping build.

Performance

The performance of this allocator is very good. Allocation is O(1) and deallocation is a noop. Frame pushing is O(1) and popping is O(n) where n is the number of live segments freed. All of these operations are very cheap.

Frames are very compact and their footprint is split between AllocatorFrame and a frame description allocated internally.

Segments have a small header (at most 24 bytes) required to chain them as well as to keep track of the allocated size, segment size, and some flags.

Generally, a stack frame allocator will often manage less than 4GB of memory as such it can be templated to use uint32_t internally. This yields a footprint of 40 bytes on 64 bit platforms for the allocator (a single cache line). This ensures that when we allocate, only 2 cache lines will usually be touched (the allocator instance and the live segment header).

Conclusion

This is our first usable allocator and it is a powerful one. On older, slower hardware where taking locks and using a general malloc/free was expensive, I used a thread local stack allocator to great success in the performance sensitive portions of the code. Its simplicity and flexibility makes it ideal to replace allocators of containers that end up on the stack which might often reallocate memory.

Next up, we will cover a virtual memory aware variant better suited for high performance AAA video games where the upper bound is generally known or can be reliably approximated. This second variant will be the last variant we will cover in this allocator family.

Note that it is often desirable to avoid the allocator to grow unbounded in memory in the presence of spikes and a common way to deal with this is to free some segments when we pop frames by keeping a free list with a small number of entries (slack). This is easily achieved with an hysteresis constraint. Alternatively, at a specific point in the code, you could call a function on the allocator to free some of its slack (e.g: at the start or end of a frame).

Reddit thread

Back to table of contents

Stack Frame Allocators

This allocator family is quite possibly one of the most common memory management technique.

Note that there are many variants possible of this allocator. As such we will only cover a few to give a general overview of what it can look like. Actual production versions in the wild will often be custom and tailored to the specific application needs. Unlike our linear allocator previously covered, this allocation technique is very common and in use in many applications.

How it works

A topic that comes very early when introducing many programming languages is the call stack. It is central to C, C++, and many other languages. The execution stack used for this mechanic operates as a stack frame based memory management algorithm which is handled by the compiler and the hardware (on x86 and x64).

Every C++ programmer is familiar with the stack and all it has to offer. Each function call pushes a new frame on the stack when it enters and later pops it when it exits. In this context, a frame is a section of memory that holds our local function variables that cannot be stored in registers.

Our stack frame allocator works in much the same way except that pushing and popping of frames is made explicit. This makes it a very powerful and familiar tool.

Our frame based allocators will use a common AllocatorFrame class and the usage is simple:

AllocatorFrame frame(allocatorInstance);
// Allocate here with ‘allocatorInstance’

Popping of the frame is automatic when the destructor is called or it can be popped manually with AllocatorFrame::Pop().

In a frame based allocator such as this, when the frame is popped, all allocations made within our frame can be freed and the memory reclaimed. Different allocator variants will handle this differently as we will later see. The bottom line is that you should not keep pointers to that memory since it is no longer valid. On the plus side, freeing memory is very fast as a bulk operation.

What can we use it for

It allows us to allocate memory dynamically and return it from a function. This is very common and handy when we do not know the maximum amount of potential results returned (maybe we have a rare worst case scenario) and we want to avoid increasing the stack.

It supports realloc for the topmost allocation, useful when appending to vectors.

When using a thread local allocator, we can avoid expensive heap accesses since we offer better performance due to reduced cache and TLB usage. Execution stack memory is often forced to use 4KB pages and this is often controlled by the kernel but our parallel stack can use any page size it wants (when such control is available to user space).

Moving large allocations (such as strings) off the call stack reduces the risks of a malicious user causing the return address to be overwritten and abused in the presence of buffer overflow bugs. It can trivially replace alloca and everything it is used for.

Depending on the variant, it can access all system memory unlike our linear allocator.

Because it is often faster than a generalized heap allocator (potentially shared between multiple threads), if our allocated data is thread local (e.g: a vector on the stack), we can trivially migrate the allocations to use our frame allocator and save on allocation, deallocation, and reallocation.

What we can’t use it for

Much like our linear allocator, stack frame allocators do not support generalized deallocation of memory. This is largely mitigated by the fact that freeing is still supported but now happens in a last in/first out order due to our stack frames.

Performance

Performance of stack frame based allocators is generally very good due in large part to their simplicity when allocating and the bulk free operation. It is very common to have a stack frame allocator per thread which can also avoids a lock operation since they generally do not require to be shared between threads.

Generally speaking, the performance is enhanced compared to a generalized heap allocator because by design we use a smaller virtual memory range or we guaranty that we can re-use a previously used range keeping our TLB usage optimal. This also helps the CPU cache.

Conclusion

Up next we will cover a simple variant: the greedy stack frame allocator.

Alternate names

This memory allocator is often abbreviated to simply ‘Stack Allocator’. There are so many variants that are possible that there are probably just as many names.

Note that if you know a better name or alternate names for this allocator, feel free to contact me.

Reddit thread

Back to table of contents

GPGPU woes part 4

After fixing the previous addressing issue, we move on to the next.

Inconsistencies and limitations

Here are the outputs for the various OpenCL driver values on my laptop:

Windows 8.1 x64

  • Device: Intel(R) Iris(TM) Graphics 5100

    • Max workgroup size: 512
    • Num compute units: 40
    • Local mem size: 65536
    • Global mem size: 1837105152
    • Is memory unified: true
    • Cache line size: 64
    • Global cache size: 2097152
    • Address space size: 64
    • Kernel local workgroup size: 512
    • Kernel local mem size: 0
  • Device: Intel(R) Core(TM) i5-4278U CPU @ 2.60GHz

    • Max workgroup size: 1024
    • Num compute units: 4
    • Local mem size: 32768
    • Global mem size: 1837105152
    • Is memory unified: true
    • Cache line size: 64
    • Global cache size: 262144
    • Address space size: 64
    • Kernel local workgroup size: 1024
    • Kernel local mem size: 12288

OS X x64

  • Device: Iris

    • Max workgroup size: 512
    • Num compute units: 40
    • Local mem size: 65536
    • Global mem size: 1610612736
    • Is memory unified: true
    • Cache line size: 0
    • Global cache size: 0
    • Address space size: 64
    • Kernel local workgroup size: 512
    • Kernel local mem size: 30720
  • Device: Intel(R) Core(TM) i5-4278U CPU @ 2.60GHz

    • Max workgroup size: 1024
    • Num compute units: 4
    • Local mem size: 32768
    • Global mem size: 0
    • Is memory unified: true
    • Cache line size: 3145728
    • Global cache size: 64
    • Address space size: 64
    • Kernel local workgroup size: 1
    • Kernel local mem size: 0

As we can see, many things differs and both seems to report bad values for a few things.

An important difference between the CPU and GPU is that CPU local storage is limited to 32KB. But Why? Couldn’t the CPU variant simply consider local storage to be the same as normal memory and thus only be bounded by virtual (or physical) memory?

This seemingly arbitrary limitation makes a lot of sense when you consider the fact that local storage is supposed to be used to avoid touching main memory by keeping intermediate results in fast memory. The CUDA programming guide states that they expose functions to let you tweak how much of GPU L1 is split between real L1 usage and local storage and L1 is fast, really fast.

Sadly, on the CPU we usually have no such control with how the L1 is used or controlled and it is also quite small: usually 32KB for code and 32KB for data. Note that some exotic platforms do allow you some minimal control over the CPU caches such as the Wii cache line locking instructions. However, generally speaking, it is very hard to control effectively by hand.

With most CPUs (including the one I am using) having very small L1 caches, it makes sense to limit local storage to the size of the L1. After all, if you design an algorithm around the use of very fast memory, performance could be adversely affected should you have less available in reality.