GPGPU woes part 3

After fixing the previous mind bending issue, we move on to the next issue.

On the GPU, the address space size can differ

Virtually every programmer out there knows that the address space size on the CPU can differ based on the hardware, executable, and kernel. The two most common sizes being 32 bits and 64 bits (partially true since in practice only 48 bits are used for now). Some older programmers will also remember the times of the 16 bits address space.

What few programmers realize is that other chips in their computer might have a different address space size and as such sharing binary structures with them that contain pointers is generally unsafe.

Case in point: my GPU has a 64 bits address space.

What this means is that if I run a i386 executable in either OS X or Windows, the pointer size will be 4 bytes on the CPU but the GPU will use 8 bytes for pointers. This means that structures like this that are shared will not work:

struct dom_node
{
    struct dom_node *parent;
    cl_int id;
    cl_int tag_name;
    cl_int class_count;
    cl_int first_class;
    cl_int style[MAX_STYLE_PROPERTIES];
};

The above code was causing the reading and writing of memory far past the buffer that contained them when running on the GPU. This caused the output to differ from the expected result of the CPU version.

In particular, it ended up corrupting read-only buffers that the kernel was using (the read-only stylesheet) causing further confusion since the output differed from run to run. I was lucky enough that I didn’t end up corrupting anything else more critical as it could have made debugging the issue much harder (e.g: driver crash).

The fix is to ensure there is proper padding by wrapping the pointer in a union:

struct dom_node
{
    union
    {
        struct dom_node *parent;
        cl_ulong pad_parent;
    };
    cl_int id;
    cl_int tag_name;
    cl_int class_count;
    cl_int first_class;
    cl_int style[MAX_STYLE_PROPERTIES];
};

OpenCL has integer types with the cl_ prefix in order to ensure their size does not differ between the CPU and GPU but sadly no such trick is possible with pointers: we have to resort to our manual union hack. In practice we could probably introduce a macro to wrap this and make it cleaner.

Ideally sharing structures that contain pointers with the GPU isn’t such a good idea. Unless memory is unified, the GPU will not be able to access that memory and as such it is wasteful. Even if the memory is unified, typically the GPU accessible memory must be allocated with particular flags and depending on the platform it might only be possible through the driver.

GPGPU woes part 2

After fixing the previous painful issue, we move on to the next strange issue (and much worse!).

On the GPU, a NULL pointer isn’t always NULL

Despite the code being very simple, the CPU and GPU versions produced very different outputs for me in OS X (not verified in Windows 8.1). For the life of me I could not figure it out as it proved even stranger than the previous issue.

Ultimately, the issue was here:

// Inlined the CSS_CUCKOO_HASH_FIND macro and minor formatting cleanup

__global const struct css_rule * rule = 0;
if (hash_->left[left_index_].type != 0 && hash_->left[left_index_].value == value_) {
    rule = &hash_->left[left_index_];
}
if (hash_->right[right_index_].type != 0 && hash_->right[right_index_].value == value_) {
    rule = &hash_->right[right_index_];
}
if (rule_set != 0) {
    // code

}

Can you see it? If you can’t, I don’t think anybody could blame you.

After painfully narrowing it down to this precise piece of code, it turns out that the rule_set != 0 check was failing when no rule was found (neither if statement was taken) on the GPU (and was obviously working as expected on the CPU).

This is mere speculation but I have the gut feeling that this issue might be caused because internally some bits in the memory addresses must be used to tell global memory from local memory on the GPU.

It is entirely possible that a NULL pointer for global memory might not equal a NULL pointer for local memory (or constant memory). In such a scenario, comparing both would return false. Perhaps the memory qualifier information was lost and the optimizer was left to perform a comparison with the literal 0 value.

The only other alternative would be a compiler bug. Sadly I could not take a look at the generated assembly to double check what was happening.

The fix was simply to introduce a boolean/integer variable that we could safely compare against. Again note that I was more concerned with getting the code to work while keeping it as close to the original than making it fast. It is also possible that force casting the 0 literal to the pointer type might have worked. This is left as an exercise to the reader.

GPGPU woes part 1

Curiosity about the Servo project from Mozilla finally got the best of me and I looked around to see if I could contribute somehow. A particular exploratory task caught my eye that involved running the CSS rule matching on the GPU. I forked the original sample code and got to work.

Little did I know I would hit some very weird and peculiar issues related to OpenCL on my Intel Iris 5100 inside my MacBook Pro. These issues are so exotic and rarely discussed that I figure they warranted their own blog posts.

Just getting the sample to work reliably on OS X and Windows 8.1 while attempting to get identical results in x86, x64, CPU, and GPU versions proved to take considerable time due to various issues.

Moving pieces

GPGPU has a lot of moving pieces that can easily cause havoc. Code is typically written in a C like dialect (OpenCL, CUDA) and it is easy to make mistakes if you are not careful. If the old adage of blowing your foot off with C is true on the CPU, writing C on the GPU is probably akin to blowing up your whole neighbourhood along with your foot.

The first moving piece is the driver you use. Different platforms have different drivers, they also differ by hardware and how they update also differs. Bugs in the drivers are not unheard of and quite frequent since the hardware is still rapidly evolving and the range of supported devices grows everyday. For example, OS X provides the drivers as opposed to the manufacturer providing them like they do on Windows. This means the update process is much slower.

The second moving piece is the hardware itself. Even from a single manufacturer, there is considerable variation. From the number of compute units, the size of local storage, the size of the address space, all the way to whether memory is unified or not with the CPU.

This brings us to our first issue.

On the GPU, there is no stack

The first exotic issue I hit was that the original code would not run for me. On Windows 8.1, the GPU would hit 100% utilization and cause the driver to time out (sometimes forcing me to power cycle). On OS X, the kernel would return after 5 or 10 seconds of runtime and attempting to run it a second time would cause the program to crash (after modifying it to run the kernel more than once to gather average timings).

After hunting for several hours, I finally found the culprit: a C style stack array with 16 elements. The total size of this array was 3 integers times 16 or 192 bytes. This seems fairly small but it fails to take into account how the GPU and generated assembly handle C style stack arrays.

struct css_matched_property
{
    cl_int specificity;
    cl_int property_index;
    cl_int property_count;
};

// Later inside the kernel function

struct css_matched_property matched_properties[16];

On the GPU, there is no stack. From past experiences, the generated assembly will attempt to keep everything in registers instead of putting it in local or global storage (since local and global storage usage require explicit keywords). Because of this reason, it also will fail to spill in local storage or global memory if we run out of registers. In practice the driver could probably spill to memory when this happens but the performance would be terrible.

According to the hardware specifications of my GPU, each thread has 128 registers that each store 32 bytes (SIMD 8 elements of 32 bits). The above array requires 48 such registers if the data is not coalesced into fewer registers. Since we use a struct with 3 individual integers, this is a reasonable assumption. Along with everything else going on in the function, my kernel would in all likelihood (due to the nature of the crash, I failed to get exact measurements for the number of registers used) exhaust all available registers.

This marks the second time I see a GPU crash caused by the driver attempting to run a kernel that requires more registers than are available.

These sort of issues are nasty since the same kernel would work on hardware with more registers. It also looks like clean and simple code if you aren’t aware of what happens being the curtains.

The fix, as is probably obvious now, was to keep the array in shared local memory. This ensures we can calculate how much actual memory we require for this and based on the amount available on the given hardware, it caps the maximum number of threads we can execute in a group to avoid running out.

const cl_int MAX_NUM_WORKITEMS_PER_GROUP = 320;
__local struct css_matched_property matched_properties[16 * MAX_NUM_WORKITEMS_PER_GROUP];
cl_int matched_properties_base_offset = 16 * get_local_id(0);

Keep in mind that at this stage of development, I was more concerned with getting the code running correctly than I was in getting it to run fast. The is no point in having fast code that does not do what you want.

Out of Memory

I have been playing Game of War: Fire Age for some time now and a peculiar issue keeps recurring which prompted this post: the game often runs out of memory.

This is an often rarely discussed issue and the solutions to it are seldom discussed as well. This post is an attempt to document various causes and solutions to this and help offer some insight into this.

The causes

In video games, the memory workload is typically very predictable: we have hard limits on many features (e.g: max number of players) and generally fixed data. These two things coupled together imply that out of memory situations are generally quite rare. The typical causes are as follow:

An excessively large memory size is requested and cannot be serviced.

I mean by this that 2GB or more might be requested on a device with very little memory (e.g: 256MB). This is generally caused by attempting to allocate memory with a negative size (size_t is typically used in C++ to represent this and it is unsigned however depending on the warning level and the compiler, automatic (or sometimes programmer forced) coercion can happen). This is generally an error due to unforeseen circumstances which rarely happens in a released title but that happens from time to time during development.

More memory is allocated than the system allows.

Again, due to the predictable memory footprint of things in video games, it generally happens during development and very rarely in a released title.

Over time, due to memory leaking, you run out of memory.

This can happen in released titles and more than a few have shipped with memory leaks.

Memory fragmentation.

If the memory pressure is high and fragmentation is present, even though free memory might exist to service a particular allocation request, the system might fail due to fragmentation (either in user space or due to physical memory fragmentation on some embedded devices). Fragmentation is a real and painful issue to deal with when it creeps up. It will often remain hidden during development until very late primarily due to two things: final content often comes very late in production and the game will often not run for more than one hour until late in production as well. Out of memory situations can happen in released titles on memory constrained devices. I see it at least twice a day in Game of War on my android tablet.

How to deal with it

On memory constrained devices, if you have memory fragmentation it is a fact of life that you will hit out of memory situations. This is even more likely if your software might run on devices below your minimum specifications (e.g: mobile android devices). When this happens, there are a few ways to deal with this. Here are the ones that come to mind, in increasing order of complexity:

Do nothing and let it crash and burn.

Many games go this route and it costs literally nothing to adopt this strategy (if you can call it that). Sadly, not all crashes will be equal in impact. It is quite common for save games to generate quite a few memory allocations and crashing while the game is saving can often result in corrupted save games. For obvious reasons, this is very bad for the user experience.

Let it crash but do so in a controller manner.

Games that realize that they might crash and instead opt to handle this with the least amount of effort will typically poll how much free memory remains and crash when it passes a threshold in a controlled and safe manner. Typically this implies doing this when the game isn’t doing anything important (such as saving the game) and presenting some kind of fatal error message to the user. As far I as I can recall, a version of Gears of War running on Unreal 3 simply displayed a dirty disk error. This is generally considered acceptable since while it isn’t ideal for the user, at least nothing of value will be lost and ultimately it will remain a minor annoyance (depending on the frequency of course).

Deal with it in a clever way.

Game of War is a good example of this. When the game runs out of memory, it sends you back to your home city screen and flashes some colours briefly. (I do not have access to the source code to confirm this but it appears to me to be the cause of this peculiar behaviour.) This can happen almost anywhere except when in the city screen. This is likely because the city screen has a low or very predictable memory footprint. This is superior to the previous approaches since while it remains a minor annoyance, at least you remain within the game and presumably you can continue playing for a little while longer.

Fix the underlying issues.

This often requires the largest time investment. Not only does it require extensive testing to make sure even under all your imposed hard limits you do not exceed the maximum memory allowed (e.g: 16 vs 16 players) but it often requires dealing with memory fragmentation and making sure that there is none or very little.

Case study: AAA title (2014)

During the development of AAA title (2014), our final content for most maps came in very late in development and it all came at once. This made testing everything very hard. We knew very early on that a single platform would struggle with memory pressure and that prediction proved very accurate: our PlayStation 3 title suffered from rampant out of memory situations.

A number of factors lead to this:

  • 64KB page sizes meant that our memory allocator had to deal properly with virtual memory to avoid fragmentation.
  • The PS3 has ~213MB of usable main memory and ~256MB of usable video memory. While you can use video memory as general purpose memory, accessing it from the CPU is very slow and is generally not recommended. This makes it the platform with the least amount of general purpose memory.
  • With high memory pressure comes memory fragmentation issues.

While we made sure to perform memory optimizations throughout development to reduce our footprint, ultimately it proved not to be sufficient when we neared our release date. The final major memory optimizations (both code and data) came in about 6 months before we released our title. Around that time, memory fragmentation reared its ugly head and the battle began.

Fighting memory fragmentation is hard and painful. Even though I had knowledge of how it happens prior to facing it, I had never had actual experience dealing with it.

The battle raged on for 6 months before we finally eradicated it for good. We ultimately released the game with much more free memory than we anticipated: our efforts finally paid off.

But that is not the whole story. Dealing with memory fragmentation is complicated and is best left to a future blog post. I will however discuss our plan to deal with our worst case scenario: failure to fix our memory fragmentation issues.

Few people on AAA title (2014) really knew how bad it really got. At one point I had over 100 separate bug reports of out of memory issues: so many that whenever I would make an improvement, all I could do was claim everything as fixed and see what came back. It became a running gag that I had over half the bug reports assigned to me.

At some point, about 2 months before our release date, we could play the game for about an hour or two before running out of memory due to memory fragmentation. It was bad and we weren’t sure if we were going to be able to fix the issue in time for the release or even in time for our first patch. To prepare for this scenario, we took a similar approach to Game of War and when we detected a low memory situation, we would force a map transition into the Player Hub level. This was a small level that you would return to in between story arcs. This made it the perfect place. It was expected that the map transition would always succeed (at least before the user got tired!) due to the fact that whatever was unloading was larger than what we loaded. The map transition would also save your progress ensuring that your save game would never corrupt due to this.

It was a horrible hack, it was ugly, but it was necessary. With this, we knew that it was better than crashing and that if the user continued to play after this, and fragmentation became really bad, at worst he would not be able to leave that level without reloading into it and they would eventually get the message and restart the title. A necessary evil.

Ultimately, only 2 or 3 bug reports ever spoke of this weird behaviour and by the time we released our game, our memory fragmentation issues were fixed and it became unnecessary. In the end, we removed the hack from the final product since it was only for a single platform and now unnecessary. I personally played the game for over 4 consecutive hours on the night prior to our release and made sure our free memory would never dip below our acceptable threshold: 10MB. Most maps ended up with 15-20MB of free memory with our biggest maps closer to 10MB.

These hacks are a poor substitute for a real fix but with the pressures of the real world, they are often a necessary and realistic option. Do you have a similar war story?

Back to table of contents

Virtual Memory Aware Linear Allocator

This allocator is a variant of the linear allocator we covered last time and again it serves to introduce a few important concepts to allocators. Today we cover the virtual memory aware linear memory allocator (code).

How it works

The internal logic is nearly identical to the linear allocator with a few important tweaks:

  • We Initialize the allocator with a buffer size. The allocator will use this size to reserve virtual memory but it will not commit any physical memory to it until it is needed.
  • Allocate will commit physical memory on demand as we allocate.
  • Deallocate remains unchanged and does nothing.
  • Reallocate remains largely unchanged and like Allocate it will commit physical memory when needed.
  • Reset remains largely the same but it now decommits all previously committed physical memory.

Much like the vanilla linear allocator, the buffer is not modified by the allocator and there is no memory overhead per allocation.

There are two things currently missing from the implementation. The first is that we do not specify how eagerly we commit physical memory. There is a cost associated with committing (and decommitting) physical memory since we must call into the kernel to update the TLB page tables (and invalidate the relevant TLB entries). Depending on the usage scenarios, this can have an important cost. Currently we commit memory with a granularity of 4KB (the default page size on most platforms). In practice, even if the system uses pages of 4KB, we could use any multiple higher than this to commit memory with (e.g: commit in blocks of 128KB). The second missing implementation detail is that we simply decommit everything when we Reset. In practice, some sort of policy would be used here in order to manage slack. This policy would be required at initialization as well as when we reset. For similar reasons as stated prior, committing and decommitting have costs and reducing that overhead is important. In many cases it would make sense to keep some amount of the memory always committed. While these two important details are not necessarily important in this allocator variant, in many other allocators it is of critical importance. Decommitting memory is often very important to fight fragmentation and is critical when multiple allocators must work along side each other.

What can we use it for

Unlike the vanilla linear allocator, because we commit physical memory on demand, this allocator is better suited for large buffers or for buffers where the used size is not known ahead of time. The only requirement is that we know an upper bound.

In the past, I have used something very similar to manage video game checkpoints. It is often known or enforced by the platform that a save game or checkpoint does not have a size larger than a known constant. However, it is very rare that we know the exact size it will have when the buffer must be created. To create your save game, you can simply employ this linear allocator variant with an upper bound, use it behind a stream type class to facilitate serialization and be done with it. You can sleep well at night knowing that no realloc will happen and no memory will be needlessly copied.

What we can’t use it for

Much like the vanilla linear allocator, this allocator is ill suited if freeing memory at the pointer granularity is required.

Edge cases

The edge cases of this allocator are identical to the vanilla linear allocator with the exception of two new ones we add to the list.

When we first initialize the allocator, virtual memory must be reserved. Virtual memory is a finite resource and can run out like everything else. On 32 bit systems, this value is typically 2 to 4GB. On 64 bit systems, this value is very large since typically 40 bits are used to represent it.

When we commit physical memory, we might end up running out. In this scenario, we still have free reserved virtual memory but we are out of physical memory. This can happen if the physical memory becomes fragmented and mixed page sizes are used by the application. For example, consider an allocator A using pages of 2MB while another allocator B uses pages of 4KB. It might be possible for the system to end up with holes that are smaller than 2MB. Since the TLB must refer to a contiguous region of physical memory when large pages are used, this is bad. On many platforms, the kernel will defragment physical memory if this happens by copying memory around and remapping TLB entries. However, not all platforms will do this and some will simply bail out on you.

Potential optimizations

Much like the vanilla linear allocator, all observes optimization opportunities are also available here. Also, as previously discussed, depending how greedy we are with committing and how much slack we keep when decommitting, we can tune the performance quite effectively.

One notable other optimization avenue that is not for the faint of heart is that we can remove the check to commit memory inside the Allocate function and instead let the system produce an invalid access fault. By modifying the handler function and registering our allocator, we could commit memory when this happens and retry the faulting instruction. Depending on the granularity of the pages used to commit memory, this could reduce somewhat the overhead required for allocation by removing the branch required for this check.

While this implementation uses a variable to keep track of how much memory is committed, depending on the actual policy used, it could potentially be dropped as well. It was added for simplicity not out of necessity since in the current implementation, the allocated size could be rounded up to a multiple of the page size used for the purpose of tracking the committed memory.

Performance

Due to its simplicity, it offers great performance. All allocation and deallocation operations are O(1) and only amount to a few instructions. However, resetting and destroying the allocator now have the added cost of decommitting physical memory and will thus be linearly dependant on the amount of committed memory.

On most 32 bit platforms, the size of an instance should be 28 bytes if size_t is used. On 64 bit platforms, the size should be 56 bytes with size_t. Both versions can be made even smaller with smaller integral types such as uint32_t or by stripping support for Reallocate. As such, either version will comfortably fit inside a single typical cache line of 64 bytes.

Conclusion

Once again, this is a very simply and perhaps even toy allocator. However it serves as an important building block to discuss the various implications of committing and decommitting memory as well as the general implementation details surrounding these things. Manual virtual memory management is a classic and important tool of modern memory allocators and seeing it put to use in a simple context will serve as a learning example.

Fundamentally, linear allocators are a variation of a much more interesting and important allocator: the stack frame allocator. In essence, linear allocators are stack frame allocators where only a single frame is supported. Pushing of the frame happens at initialization and popping happens when we reset or at destruction.

Next up, we will cover the ever so useful: stack frame allocators.

Alternate names

To my knowledge, there is no alternate name for this allocator since it isn’t really a real allocator one would see in the wild.

Note that if you know a better name or alternate names for this allocator, feel free to contact me.

Reddit thread

Back to table of contents