03 May 2015
gin
aims to be a playground for high performance C++ ideas all under the MIT license. gin
is currently empty but it will not remain so for long!
Here is an overview of the things I intend to cover in the coming posts and that will be included in the library:
A convenient memory allocator interface suitable for high performance and containers
Several high performance memory allocators (linear, frame, small block, heap, external, etc.)
A number of containers using the above technology (all sorts of array variants, bit sets, etc.)
And much more!
I will use C++11 features where relevant and keep things in headers as much as I can for easy integration as well as all the popular reasons.
I choose the name gin
for a number of reasons: it is a refreshing drink for the upcoming summer months, there is no associated programming material with that word, it is short, and if I ever need a side project, I can cleverly name it tonic
.
The code can be found here on github.
01 May 2015
In the AAA video game community, it is common to reimplement containers to fix a number of deficiencies. EA STL being a very good example of this. You can find a document here that explains in great detail the reasoning behind it. Some of the points are a bit dated and have been fixed with C++11 or will be fixed in C++14. I will not repeat them here but I will add some of my own.
Some function names are un-intuitive
deque
and list
have push/pop
which are symmetric and this is good. But insert/erase
seems a poor choice of words and we have the emplace
variants which overlap the two functionalities.
vector
appears to have a mix of queue/stack like functions (push/pop
and front/back
) but this choice of words is a poor fit when talking about arrays.
Naming things is always sensitive subject and in the interest of discussion I give the following two examples: Java uses add/remove
for all container operations while C# uses Add/Remove
and Insert/RemoveAt
.
Vector is a poor container name
In video games, and mathematical applications, vectors are N dimensional. This reflects the STL container name but in video games vectors are almost always 2D, 3D, or 4D. Ultimately, this can yield some confusion. Even in the vector
documentation, it is explained in terms of arrays while array is reserved for an array of fixed size that also happens to be inline. In practice, there are many useful variants of array implementations and adding a simple qualifier to the name makes everything much more clear: FixedArray
, InlineArray
, DynamicArray
, HybridArray
, etc.
The meaning of the word ‘size’ is overloaded and confusing
Size in C++ means many things. size_t
is defined as the type of the result of the sizeof
family of operators. On most systems it will be the same size as the largest integral register (32 or 64 bits) and will typically match uintptr_t
for this reason. The sizeof
operators also return a size and this size is measured in bytes (which can be 8 or more bits depending on the platform). For the containers, size()
returns the number of elements.
Size is thus a type, a number of bytes and a number of elements depending on the context.
Templating the allocator on the container type is a bad idea
EA STL touches on most of the issues related with the poor STL memory allocator support but it forgets an important point: the speed gain of templating the allocator type renders refactoring code much harder. It is not unusual to write a function that takes a container as argument and at a later time, the need to change the allocator type arises. This forces the user to change every site where the allocator type appears which can end up being many functions. I have witnessed optimizations that could not be performed because the amount of work would be too great due to this. It is also often desirable to write functions that will require allocation/deallocation that are allocator agnostic without templating the whole function on the allocator type.
A common fix for this is to use an allocator interface with virtual functions which is simple and clean. The bitsquid engine does this. It is not without its faults but it is ultimately much more flexible. I plan to discuss this in further detail in later posts.
Containers are often larger than they need to be
It is not unusual to end up with hundred of thousands or even millions of array containers and the likes. In this scenario, every byte counts and this can end up bloating objects that hold them. Generally, size_t
is far too large to keep track of sizes and in fact, many video game engines will simply hardcode a uint32_t
instead (only the most recent game consoles now have more than 4GB of memory, and just barely). However, even this is often wasteful as the norm is to have tiny arrays and not larger ones. It would be much more flexible is we had a base container templated on the type used to track the sizes and define all popular variants: Vector8
, Vector16
, Vector32
, Vector64
, etc. The container should then check for overflow and assert/fail if it does so. We could have a default Vector
without a size specified that defaults to a sensible configurable value (e.g: Vector32
or Vector64
).
Another source of waste is the allocator contained inside the containers. For reasons discussed above, it is desirable to store a pointer to the allocator inside containers. Since a container that requires allocation will already have a pointer to point to the allocated memory, and since we can rely on the fact that the container has two states (unallocated and allocated), we can store that pointer inside the data pointer in the former case (and use either a flag in the LSB or the capacity of the container to tell which state it is in) and simply append it at the front or back of the allocated memory in the later case. This will save a few bytes in the container object itself reducing the size of the holding object. I will discuss this again in a future blog post.
Removing items from an unsorted array
It is generally the norm that array containers will hold their contents unsorted. When this is the case, upon removing an item, we can simply overwrite it with the last item of the array instead of shifting all the elements left. This ensures a much more deterministic runtime cost for this operation.
Containers hide everything
Sometimes, having a C style API with a raw view is beneficial. The bitsquid foundation library explains and uses this. However, at the same time, in the vast majority of cases, it does not make sense to expose programmers directly to this as it can be error prone (it becomes easy to make a container out of sync or corrupted). For example, bit sets often come in two variants: an inline array of bits and a dynamically allocated array of bits. It is often the case that the code will be duplicated to support both cases. However, at times a third cases arises: when you need to make your dynamic bit array inline as part of a larger memory block (e.g.: imagine a dynamic array where for each slot we also store a bit, with the STL containers, we would have two dynamic allocations when one would suffice). In the later scenario, none of the bit set functions can be reused and they must again be duplicated. In practice, all three cases can be implemented by reusing functions that take as input a raw view of the bit set along with the number of bits.
I plan to look at a number of these topics more in depth in future blog posts, stay tuned!
30 Apr 2015
Very often, C++ interviews ask about the virtual table: how it works, how is it implemented and the tricky bits around constructors & destructors. However, rarely are the following questions asked: why is it the way it is and are there alternatives?
Seeing how the C++ standard does not mandate how virtual function dispatching should be implemented, it is interesting that most compilers ended up converging on the current table approach. The wikipedia article does a fairly good job to explain how it works and gives a good idea of the problem it is trying to solve. However, it fails to address the assumptions and performance constraints that likely shaped the current design.
For simplicity’s sake, we will only consider the case of single inheritance but keep in mind all methods discussed here can be extended to support multiple inheritance.
Let us go back to the drawing board and take a closer look at the problem we are trying to solve, what our tools are and our performance constraints.
The problem:
It is often desirable to derive from a base class and override part of its behaviour. This is best achieved by implementing a specialized function that the base class would call (or for that matter, any code) when present or call the base implementation when our only knowledge is the type of the base class at a particular function call site.
All classes and functions are known at compile time and ideally we want to leverage that as much as possible.
Memory and CPUs are slow and scarce and we want to have this mechanism be as fast as possible and at the same time be as compact in memory as possible. We can safely assume that for some classes we could have a large number of such virtual functions and we can assume as well that for some classes we might have a large number of instances.
The inline solution:
The simplest solution would simply be to store a function pointer for every such virtual function inside the object itself. At object construction, we can simply set these pointers and later when we need to invoke them, we simply call the correct function pointer.
The upside of such a simple method is that the information we need is right there inside the object and we can access it directly. It will also end up with the rest of the adjacent object data in whatever cache line is loaded to access the function pointer. By identifying which data is used by which function, we could move relevant data close to that function pointer and improve locality. Calling such a virtual function would thus incur a single data cache miss (to access the pointer).
The downside is that it does not scale. The more virtual functions we add, the larger our objects will end up being. This is made even worse on platforms with 64 bit pointers. The cost to initialize those data members also increases with each new virtual function. Finally, if enough virtual functions are added, our cache line might end up being completely filled by them such that no other immediately useful information ends up being loaded anyway.
Ultimately, this method will not degrade gracefully when either the number of object instances increases or the number of virtual functions rises.
This method is ideal when:
- the number of virtual functions per class is low
- meaningful data can be moved close to the virtual function pointer (eg: a getter and the returned value)
- the number of class instances is low
The dispatch function:
A more flexible approach would be for our objects to have a single function pointer to a per class dispatch function. This function would then be called in a similar fashion to syscall(..): an extra argument to indicate which virtual function is expected would be provided. On object construction, similar to the previous approach, we simply set the proper dispatch function.
The upside is that now our cost per object is a single function pointer and we are very flexible. At runtime we could add extra data to change the dispatching behaviour (and indeed a mechanism similar to this is used by many dynamic languages for duck typing and such) and we are also very flexible in our we store the virtual function pointers: we can either store them in an array somewhere along with the identifier that we need to match or we can store it inline in the assembly stream. The latter approach is also interesting because it could end up being more compact than the former on architectures where instruction sizes can vary and that support relative jumps (e.g.: x64). Another point in favour of the later approach is that the code cache generally has less pressure than the data cache by virtue of the facts that code execution is largely linear and more easily prefetched and there is also less code compared to data. Last but not least, a dispatch function for a given class only needs to test the virtual functions it implements and defer resolution to its base class should it not handle the call.
The downside is that to find the correct virtual function pointer to call we need to introduce extra branching which will could end up being poorly predicted. Various methods can mitigate this: sorting the virtual functions in a tree structure to guaranty O(log(n)) access time (it could even be updated at runtime to favour hot functions) or the introduction of a bloom filter as an early out mechanism. Another problem is the calling convention: because for efficiency reasons we need a register to hold the virtual function identifier when we call our dispatch function, that register must be reserved to avoid shuffling all the registers and the stack when the effective virtual function is called. While this approach scales well when the number of object instances increases, it does not scale as well when the number of virtual functions rises.
Compared to the previous approach, we will need to incur one data cache miss for the dispatch function pointer and one (or more) code cache miss for the dispatching code. The compiler could be smart here and position the dispatch code close to the virtual functions of the same class in hope that it ends up being on the same page and thus avoid a TLB miss when the actual function is called.
Much like the previous approach, this method does not degrade gracefully when the number of virtual functions increases.
This method is ideal when:
- The dispatching behaviour needs to be altered at runtime
- The dispatching logic is complicated
- The number of virtual functions per class is low
- The number of class instances is high
The virtual table:
Finally we reach the current popular implementation: a single pointer to an array of virtual function pointers. Similar to previous methods, on object construction we simply set the proper array pointer. For this method to work, the array must contain function pointers for all virtual functions of the current class as well as all base classes. When we perform a virtual function call, knowing which function is called we can infer an offset in the array to use and simply call the correct virtual function pointer.
The upside is that like the previous approach, this scales well to a large number of object instances due to the low per object overhead but unlike the previous approach, it also scales well to a large number of virtual functions since finding the correct one is O(1).
There are no obvious downsides and in the end, this method will degrade gracefully.
However, things are not perfect: the array of virtual functions must be put somewhere. When we call a virtual function, we will thus incur two data cache misses: one for the pointer to the array and another for the actual virtual function pointer used. This second cache miss may very well not contain relevant information in the cache line besides the few bytes we need for the pointer. This will of course depend heavily on the usage but in general, there will be a lot of waste in that cache line. Because we can’t position the array in meaningful position, it is also possible that the page where it resides does not contain other relevant information, causing pressure on the TLB. In practice, other virtual function arrays are likely to end up in that page and a large number of them could fit inside. Thus a program using virtual functions with this method with 16KB worth of virtual function arrays could very easily end up permanently using four 4KB TLB entries (it is typical for most kernels to use 4KB pages for this sort of data). Incidentally, with increased TLB pressure, we also have an increased pressure on higher level caches such as ERAT on PowerPC and the micro TLB on ARM chips.
This method is ideal when:
- The number of class instances is high
- The number of virtual functions per class is high
The conclusion:
Depending on the scenario and usage, each variant could end up being the superior choice but ultimately, if a language does not expose control over which method is used for this, it must make a safe choice that supports its relevant use cases (e.g.: duck typing at runtime) and degrades as gracefully as possible under unknown scenarios.
26 Apr 2015
The art of writing fast software is subtle and ever changing. Fundamentally, it is all about where your data comes from and how you access it.
At a high level, you have the entire field of algorithmic complexity dedicated to touching the least amount of data for a given task. At this level, the actual hardware that executes the algorithm hardly matters since we deal in very large numbers but this is not to say that it is irrelevant, faster hardware will always help.
As the amount of data manipulated shrinks, the gains from an optimal algorithm versus a slightly less optimal one will blur and a new perspective is necessary.
At a lower level, knowing how to leverage the hardware becomes of paramount importance. The operating system and the hardware go to great lengths to cache data in various ways and knowing how they are used is key in squeezing the very last drop of performance.
Last but not least, the modern processor is really a tiny virtual computer. At this micro level, the instruction stream ordering and what instructions are used can still make a significant difference on performance.
Each tier has its own optimization specialists but they do not all enjoy the same amount of attention. The reason for this is simple: they are all not equal in value. Out in the real world, most programmers will be impacted largely by the first and second tier, in that order. It isn’t unusual to work with thousands of elements or more and at these scales, algorithmic complexity and good cache usage will always dominate. The last tier is mostly reserved to extreme low level programmers: hardware designers, embedded programmers, compiler programmers and specialized library programmers.
The secret to writing fast software is to make your assumptions explicit and measure early and often. Explicit assumptions are key to making sure and documenting that the choices you make as a result are proper and make sense. Should your assumptions change or prove wrong, bad things could happen or new opportunities might arise.
It is not unusual for this to happen due to changes in technology. Consider how performance assumptions had to be revised after SSDs were introduced or when C++11 made the addition of move semantics.
However, high performance software is slightly different. Much like race cars, performance is built in from the ground up, it is intrinsic to its DNA. A Toyota Camry is a good car but fitting in a Formula One engine into it and dropping the car on a race track will yield disappointing results compared to true breeds. The same applies to software and the same could be said of the many goals previously discussed (as Adobe Flash found out, security cannot be retro-fitted easily).
25 Apr 2015
The modern computers of today are an amazing and complex creation. From the smallest cellphones up to your super charged desktop PC, each and everyone of them is in reality a mixture of many smaller specialized computers working in concert.
However, at the end of the day, they all do the same thing: they munch on data. That is their only purpose and from the meaning inscribed in that data comes out the complex behaviours that we see in everyday programs and devices.
Programming is the art of organizing that data in meaningful ways in order to achieve a specific end result. Many criteria exist to take into consideration when designing such systems:
- User friendly: how do we show the data and how do we allow the data to be manipulated by the user?
- Development friendly: how do we organize the data (and code) for it to be easily manipulated in the ways that are required by the product?
- Hardware friendly: how do we make sure that the data is layered out in the optimal way for the hardware underneath?
- Communication friendly: how do we make sure that the data is easily communicated in between various systems either internally in the computer or externally?
- Resilient to damage: how do we make the data safe from hardware failures?
- Tamper proof: how do we make the data safe from malicious tampering?
- Secure: how do we make sure that data in transit isn’t intercepted by a third party?
I am sure there are many more, these a simply a small subset.
What is often less discussed is that very often these goals have conflicting desires. In such scenarios, it is paramount to clearly identify the main goals of the software in order to make and validate our assumptions. These should guide all the important decisions regarding how the software is built and how the data is dealt with.
For example, in the context of AAA video games, with the hardware being largely fixed and the demand for ever higher & prettier visuals, the two most important criteria are almost always user friendliness and hardware friendliness for single player games. For multiplayer games, the complexity increases and communication & tampering become important. Last but not least, for games with micro transactions, security comes into play. Depending on the platform and the game, the risks will also vary in scale and scope.
In contrast, in the context of mobile games, development friendliness is king to allow churning updates, content and new games as quick as possible. It isn’t unusual to find mobile games with poor user interfaces, bad performance and that are easily tampered or compromised. In the top tier of games, user & hardware friendliness are again very important and show up in very clean games such as Clash of Clans and Candy Crush.
The single most important thing for a team is to be aligned on these goals and consider them when making every decision. Failure to do this exercise can have disastrous consequences: a change might easily introduce a performance issue or a security issue if one isn’t careful and paying attention to these goals. Ultimately the entire product can become at risk.
This is the reality in software design as much as it is in any other industry where teams must juggle multiple goals.