Nicholas Frechette's Blog

Christmas came early: ACL 2.1 is out!

2023-12-16T00:00:00+00:00

After over 30 months of work, the Animation Compression Library has finally reached v2.1 along with an updated v2.1 Unreal Engine plugin.

Notable changes in this release include:

Added support for bind pose stripping which reduces memory usage by 3-5% on average
Added support for loop handling through a looping policy
Added support for pre-processing
Added automatic compression level selection which reduces memory usage by 1-2% on average
Optimized compression through dominant shell computation which reduces memory usage by up to 10%
Added support for per sub-track rounding
Updated to Realtime Math 2.2.0
Many other improvements!

Overall, memory savings average between 10-15% for this release compared to the prior version. Decompression performance remained largely the same if not slightly faster. Compression is slightly slower in part due to the default compression level being more aggressive thanks to many performance improvements that make it viable. Visual fidelity improved slightly as well.

This release took much longer than prior releases because a lot of time was spent on quality-of-life improvements. CI now runs regression tests and tests more platforms and toolchains. CI uses docker which makes finding and fixing regressions much easier. A satellite project was spun off to generate the regression test data. Together, these changes make it much faster for me to polish, test, and finalize a release.

This release saw many contributions from the community through GitHub issues, questions, code, and fixes. Special thanks to all ACL contributors!

Who uses ACL?

The Unreal Engine plugin of ACL has proved to be quite popular over the years. The reverse engineering community over at Gildor’s Forums have a post that tracks who uses ACL in UE. At the time of writing, the list includes:

Desktop games: All latest Dontnod’s games: two last episodes of Life is Strange 2, Twin Mirror, Tell Me Why, most likely all their future releases as well. Beyond a Steel Sky, Chivalry 2, Fortnite (current version), Kena: Bridge of Spirits, Remnant: From the Ashes (current version), Rogue Company (current version), Star Wars: Tales From The Galaxy’s Edge, Final Fantasy VII Remake (including Intergrade), The Ascent, The Dark Pictures Anthology (Man of Medan / Little Hope / House of Ashes / Devil in Me) (current version), Valorant, Evil Dead: The Game, The Quarry, The DioField Chronicle, Borderlands 3, Tiny Tina’s Wonderlands, Medieval Dynasty (current version), Divine Knockout, Lost Ark, SD Gundam Battle Alliance, Back 4 Blood, KartRider: Drift (current version), Dragon Quest X Offline, Gran Saga (current version), Gundam Evolution (current version), The Outlast Trials, Harvestella, Valkyrie Elysium, The Dark Pictures Anthology: The Devil In Me, The Callisto Protocol, Synced (current version), High On Life, PlayerUnknown’s Battlegrounds (current version), Deliver Us Mars, Sherlock Holmes The Awakened, Hogwarts Legacy, Wanted: Dead, Like a Dragon: Ishin, Atomic Heart, Crime Boss: Rockay City, Dead Island 2, Star Wars Jedi: Survivor, Redfall, Park Beyond, The Lord of the Rings: Gollum, Gangstar NY, Lies of P, Warhaven (current version), AEW: Fight Forever, Legend of the Condor, Crossfire: Sierra Squad, Mortal Kombat 1, My Hero Ultra Rumble, Battle Crush, Overkill’s The Walking Dead, Payday 3

Mobile games: Apex Legends Mobile (current version), Dislyte, Marvel Future Revolution, Mir4, Ni no Kuni: Cross Worlds, PUBG Mobile (current version), Undecember, Blade & Soul Revolution (current version), Crystal of Atlan, Mortal Kombat: Onslaught (old versions), Farlight 84, ArcheAge War, Assassin’s Creed: Codename Jade, Undawn, Arena Breakout, High Energy Heroes, Dream Star

Console games: No More Heroes 3

Many others use ACL in their own internal game engines. If you aren’t using it yet, let me know if anything is missing!

What’s next

I’ve already started fleshing out the task list for the next minor release here. This release will bring about more memory footprint improvements.

If you use ACL and would like to help prioritize the work I do, feel free to reach out and provide feedback or requests!

Now that ACL is the default codec in Unreal Engine 5.3, this reduces my maintenance burden considerably as Epic has taken over the plugin development. This will leave me to focus on improving the standalone ACL project and Realtime Math. Going forward, I will aim for smaller and more frequent releases.

The Animation Compression Library in Unreal Engine 5.3

2023-09-17T00:00:00+00:00

The very first release of the Animation Compression Library (ACL) Unreal Engine plugin came out in July of 2018 and required custom engine changes. It was a bit rough around the edges. Still, in the releases that followed, despite the integration hurdles, it slowly grew in popularity in games large and small. Two years later, version 1.0 came out with official support of UE 4.25 through the Unreal Engine Marketplace. Offering backwards compatibility with each new release and a solid alternative to the stock engine codecs, the plugin continued to grow in popularity. Today, many of the most demanding and popular console and mobile games that use Unreal already use ACL (Fortnite itself has been using it for many years). It is thus with great pleasure that I can announce that as of UE 5.3, the ACL plugin comes out of the box with batteries included as the default animation compression codec!

This represents the culmination of many years of hard work and slow but steady progress. This new chapter is very exciting for a few reasons:

Better quality, compression and decompression performance for all UE users out of the box.
A dramatically smaller memory footprint: ACL saves over 30% compared with the defaults codecs in UE 5.2 and more savings will come with the next ACL 2.1 release later this year (up to 46% smaller than UE 5.2).
Maintenance churn to keep up with each UE release was a significant burden. Epic will now take ownership of the plugin by continuing its development in their fork as part of the main Unreal Engine development branch.
With official UE distribution comes better documentation, localization, and support through the Unreal Developer Network (UDN). This means a reduced burden for me as I now can support the plugin as part of my day job at Epic as opposed to doing so on my personal time (burning that midnight oil).

Best of all, by no longer having to maintain the plugin in my spare time, it will allow me to better focus on the core of ACL that benefits everyone. Many proprietary game engines leverage standalone ACL and in fact quite a few code contributions (and ideas) over the years came from them.

Unfortunately, it does mean that some tradeoffs had to be made:

The GitHub version of the plugin will remain largely in read only mode, taking in only critical bug fixes. It will retain its current MIT License. Contributions should instead be directed to the Unreal Engine version of the plugin.
The Unreal Engine Marketplace version of the plugin will no longer be updated to keep up with each UE release (since it comes packaged with the engine already).
Any changes to the plugin will not be backwards compatible as they will be made in the Unreal Engine main development branch. To get the latest improvements, you’ll have to update your engine version (like any other built in feature).
Each plugin release will no longer publish statistics comparing ACL to the other UE codecs.
The ACL plugin within UE will take on the engine code license (it already had a dual license like all code plugins do on the Unreal Engine Marketplace).

That being said, once ACL 2.1 comes out later this year, one last release of the plugin will be published on GitHub and the Unreal Engine Marketplace (for UE versions before 5.3). It brings many improvements to the memory footprint and visual fidelity. A future version of UE will include it (TBD).

To further clarify, the core ACL project will remain with me and continue to evolve on GitHub with the MIT License as it always has. I still have a long list of improvements and research topics to explore!

Dominance based error estimation

2023-02-26T00:00:00+00:00

As I’ve previously written about, the Animation Compression Library measures the error introduced through compression by approximating the skinning deformation typically done with a graphical mesh. To do so, we define a rigid shell around each joint: a sphere. It is a crude approximation, but the method is otherwise very flexible, and any other shape could be used: a box, convex, etc. This rigid shell is then transformed using its joint transform at every point in time (provided as input through an animation clip). Using only individual joints, this process works in local space of the joint (local space) as transforms are relative to their parent joint in the skeleton hierarchy. When a joint chain that includes the root joint is used, the process works in local space of the object (or object space). While local space only accounts for a single transform (the joint’s), in object space we can account for any number of joints, it depends entirely on the topology of the joint hierarchy. As a result of these extra joints, measuring the error in object space is slower than in local space. Nevertheless, this method is very simple to implement and tune since the resulting error is a 3D displacement between the raw rigid shell and the lossy rigid shell undergoing compression.

ACL leverages this technique and uses it in two ways when optimizing the quantization bit rates (e.g finding how many bits to use for each joint transform).

First, for each joint, we proceed to find the lowest bit rate that meets our specified precision requirement (user provided, per joint). To do so, a pre-generated list of permutations is hardcoded of all possible bit rate combinations for a joint’s rotation, translation, and scale. That list is sorted by total transform size. We then simply iterate on that list in order, testing everything until we meet our precision target. We start with the most lossy bit rates and work our way up, adding more precision as we go along. This is an exhaustive search and to keep it fast, we do this processing in local space of each joint. Once this first pass is done, each joint has an approximate answer for its ideal bit rate. A joint may end up needing more precision to account for object space error (since error accumulates down the joint hierarchy), but it will never need less precision. This thus puts a lower bound on our compressed size.

Second, for each joint chain, we check the error in object space and try multiple permutations by slowly increasing the bit rate on various joints in the chain. This process is much slower and is controlled through the compression level setting which allows a user to tune how much time should be spent by ACL when attempting to optimize for size. Because the first pass only found an approximate answer in local space, this second pass is very important as it enforces our precision requirements in object space by accounting for the full joint hierarchy. Despite aggressive caching, unrolling, and many optimizations, this is by far the slowest part during compression.

This algorithm was implemented years ago, and it works great. But ever since I wrote it, I’ve been thinking of ways to improve it. After all, while it finds a local minimum, there is no guarantee that it is the true global minima which would yield the optimal compressed size. I’ve so far failed to come up with a way to find that optimal minima, but I’ve gotten one step closer!

To the drawing board

Why do we have two passes?

The second pass, in object space, is necessary because it is the one that accounts for the true error as it accumulates throughout the joint hierarchy. However, it is very hard to properly direct because the search space is huge. We need to try multiple bit rate permutations on multiple joint permutations within a chain. How do we pick which bit rate permutations to try? How do we pick which joint permutations to try? If we increase the bit rate towards the root of the hierarchy, we add precision that will improve every child that follows but there is no guarantee that even if we push it to retain full precision that we’ll be able to meet our precision requirements (some clips are exotic). This pass is thus trying to stumble to an acceptable solution and if it fails after spending a tunable amount of time, we give up and bump the bit rates of every joint in the chain as high as we need one by one, greedily. This is far from optimal…

To help it find a solution, it would be best if we could supply it with an initial guess that we hope is close to the ideal solution. The closer this initial guess is to the global minima, the higher our chances will be that we can find it in the second pass. A better guess thus directly leads to a lower memory footprint (the second pass can’t drift as far) and faster compression (the search space is vastly reduced). This is where the first pass comes in. Right off the bat, we know that bit rates that violate our precision requirements in local space will also violate them in object space since our parent joints can only add to our overall error (in practice, the error sometimes compensates but it only does so randomly and thus can’t be relied on). Since testing the error in local space is very fast, we can perform the exhaustive search mentioned above to trim the search space dramatically.

If only we could calculate the object space error using local space information… We can’t, but we can get close!

Long-range attachments

Kim, Tae & Chentanez, Nuttapong & Müller, Matthias. (2012). Long Range Attachments - A Method to Simulate Inextensible Clothing in Computer Games. 305-310. 10.2312/SCA/SCA12/305-310.

Above: A static vertex in red with 3 attached dynamic vertices in black. A long-range constraint is added for each dynamic vertex, labeled d1, d2, and d3. T0 shows the initial configuration and its evolution over time. When a vertex extends past the long-range constraint, a correction (in green) is applied. Vertices within the limits of the long-range constraints are unaffected.

Back around 2014, I read the above paper on cloth simulation. It describes the concept of long-range attachments to accelerate and improve convergence of distance constraints (e.g. making sure the cloth doesn’t stretch). They start at a fixed point (where the cloth is rigidly attached to something) and they proceed to add additional clamping distance constraints (meaning the distance can be shorter than the desired distance, but no longer) between each simulated vertex on the cloth and the fixed point. These extra constraints help bring back the vertices when the constraints are violated under stress.

A few years later, it struck me: we can use the same general idea when estimating our error!

Dominance based error estimation

When we compress a joint transform, error is introduced. However, the error may or may not be visible to the naked eye. How important the error is depends on one thing: the distance of the rigid shell.

For the translation and scale parts of each transform, the error is independent of the distance of the rigid shell. This means that a 1% error in those, yields a 1% error regardless of how far the rigid shell lives with respect to the joint.

However, the story is different for a transform’s rotation: distance acts as a lever. A 1% error on the rotation does not translate in the same way to the rigid shell. It depends on how close the rigid shell is to the joint. The closer it lives, the lower the error will be and the further it lives, the higher the error.

Using this knowledge, we can reframe our problem. When we measure the error in local space for a joint X, we wish to do so on the furthest rigid shell it can influence. That furthest rigid shell of joint X will be associated with a joint that lives in the same chain. It could be X itself or it could be some child joint Y. We’ll call this joint Y the dominant joint of X.

The dominant rigid shell of a dominant joint is formed through its associated virtual vertex, the dominant vertex.

For leaf joints with no children, we have a single joint in our chain, and it is thus its own dominant joint.

If we add a new joint, we look at it and its immediate children. Are the dominant joints of its children still dominant or is the new joint its own dominant joint? To find it, we look at the rigid shells formed by the dominant vertices of each joint and we pick the furthest.

After all, if we were to measure the error on any rigid shell enclosed within the dominant shell, the error would either be identical (if there is no error contribution from the rotation component) or lower. Distance is king and we wish to look for error as far as we can.

As more joints are introduced, we iterate with the same logic at each joint. Iteration thus starts with leaf joints and proceeds towards the root. You can find the code for this here.

An important part of this process is the step wise nature. By evaluating joints one at a time, we find and update the dominant vertex distance by adding the current vertex distance (the distance between the vertex and its joint). This propagates the geodesic distance. This ensures that even if a joint chain folds on itself, the resulting dominant shell distance is only ever increasing as if every joint was in a straight line.

The last piece of the puzzle is the precision threshold when we measure the error. Because we use the dominant shell as a conservative estimate, we must account for the fact that intermediate joints in a chain will contain some error. Our precision threshold represents an upper bound on the error we’ll allow and a target that we’ll try to reach when optimizing the bit rates. For some bit rate, if the error is above the threshold, we’ll try a lower value, but if it is below the threshold, we’ll try a higher value since we have room to grow. As such, joints in a chain will add their precision threshold to their virtual vertex distance when computing the dominant shell distance except for the dominant joint. The dominant joint’s precision threshold will be used when measuring the error and accounted for there.

This process is done for every keyframe in a clip and the dominant rigid shell is retained for each joint. I also tried to do this processing for each segment ACL works with and there was no measurable impact.

Once computed, ACL then proceeds as before using this new dominant rigid shell distance and the dominant precision threshold. It is used in both passes. Even when measuring a joint in object space, we need to account for its children.

But wait, there’s more!

After implementing this, I realized I could use this trick to fix something else that had been bothering me: constant sub-track collapsing. For each sub-track (rotation, translation, and scale), ACL attempts to determine if they are constant along the whole clip. If they are, we can store a single sample: the reference constant value (12 bytes). This considerably lowers the memory footprint. Furthermore, if this single sample is equal to the default value for that sub-track (either the identity or the bind pose value when stripping the bind pose), the constant sample can be removed as well and reconstructed during decompression. A default sub-track uses only 2 bits!

Until now, ACL used individual precision thresholds for rotation, translation, and scale sub-tracks. This has been a source of frustration for some time. While the error metric uses a single 3D displacement as its precision threshold with intuitive units (e.g. whatever the runtime uses to measure distance), the constant thresholds were this special edge case with non-intuitive units (scale doesn’t even have units). It also meant that ACL was not able to account for the error down the chain: it only looked in local space, not object space. It never occurred to me that I could have leveraged the regular error metric and computed the object space error until I implemented this optimization. Suddenly, it just made sense to use the same trick and skip the object space part altogether: we can use the dominant rigid shell.

This allowed me to remove the constant detection thresholds in favor of the single precision threshold used throughout the rest of the compression. A simpler, leaner API, with less room for user error while improving the resulting accuracy by properly accounting for every joint in a chain.

Show me the money

The results speak for themselves:

Baseline:

Data Set	Compressed Size	Compression Speed	Error 99th percentile
CMU	72.14 MB	13055.47 KB/sec	0.0089 cm
Paragon	208.72 MB	10243.11 KB/sec	0.0098 cm
Matinee Fight	8.18 MB	16419.63 KB/sec	0.0201 cm

With dominant rigid shells:

Data Set	Compressed Size	Compression Speed	Error 99th percentile
CMU	65.83 MB (-8.7%)	34682.23 KB/sec (2.7x)	0.0088 cm
Paragon	184.41 MB (-11.6%)	20858.25 KB/sec (2.0x)	0.0088 cm
Matinee Fight	8.11 MB (-0.9%)	17097.23 KB/sec (1.0x)	0.0092 cm

The memory footprint reduces by over 10% in some data sets and compression is twice as fast (the median compression time per clip for Paragon is now 11 milliseconds)! Gains like this are hard to come by now that ACL has been so heavily optimized already; this is really significant. This optimization maintains the same great accuracy with no impact whatsoever to decompression performance.

The memory footprint reduces because our initial guess following the first pass is now much better: it accounts for the potential error of any children. This leaves less room for drifting off course in the second pass. And with our improved guess, the search space is considerably reduced leading to much faster convergence.

Now that this optimization has landed in ACL develop, I am getting close to finishing up the upcoming ACL 2.1 release which will include it and much more. Stay tuned!

Back to table of contents

Manipulating the sampling time for fun and profit

2022-06-05T00:00:00+00:00

The Animation Compression Library works with uniformly sampled animation clips. For each animated data track, every sample falls at a predictable and regular rate (e.g. 30 samples/frames per second). This is done to keep the decompression code simple and fast: with samples being so regular, they can be packed efficiently, sorted by their timestamp.

When we decompress, we simply find the surrounding samples to reconstruct the value we need.

float t = ... // At what time in the clip we sample

float duration = (num_samples - 1) * sample_rate;
float normalized_t = t / duration;
float sample_offset = normalized_t * (num_samples - 1);

int first_sample = trunc(sample_offset);
int second_sample = min(first_sample + 1, num_samples - 1);

float interpolation_alpha = sample_offset - first_sample;

interpolation_alpha = apply_rounding(interpolation_alpha, rounding_mode);

float result = lerp(sample_values[first_sample], sample_values[second_sample], interpolation_alpha);

This allows us to continously animate every value smoothly over time. However, that is not always necessary or desired. To support these use cases, ACL supports several sample time rounding modes in order to achieve various effects.

Rounding modes:

none: no rounding means that we linearly interpolate.
floor: the first/earliest sample is returned with full weight (interpolation alpha = 0.0). This can be used to step animations forward. Visually, on a character animation, it would look like a stop motion performance.
ceil: the second/latest sample is returned with full weight (interpolation alpha = 1.0). This might not have a valid use case but ACL supports it regardless for the sake of completeness. Unlike floor above, ceil returns values that are in the future which might not line up with other events that haven’t happened yet: particle effects, audio cues (e.g. footstep sounds), and other animations. If you know of a valid use case for this, please reach out!
nearest: the nearest sample is returned with full weight (interpolation alpha = 0.0 or 1.0 using round to nearest). ACL uses this internally when measuring the compression error. The error is measured at every keyframe and because of floating point rounding, we need to ensure that a whole sample is used. Otherwise, we might over/undershoot.
per_track: each animated track can specify which rounding mode it wishes to use.

Per track rounding

This is a new feature that has been requested by a few studios (included in the upcoming ACL 2.1 release).

When working with animated data, sometimes tracks within a clip are not homogenous and represent very different things. Synthetic joints can be introduced which do not represent something directly observable (e.g. camera, IK targets, pre-processed data). At times, these do not represent values that can be safely interpolated.

For example, imagine that we wish to animate a stop motion character along with a camera track in a single clip. We wish for the character joints to step forward in time and not interpolate (floor rounding) but the camera must move smoothly and interpolate (no rounding).

More commonly, this happens with scalar tracks. These are often used for animating blend shape (aka morph target) weights along with various other gameplay values. Gameplay tracks can mean anything and even though they are best stored as floating point values (to keep storage and their manipulation simple), they might not represent a smooth range (e.g. integral values, enums, etc).

When such mixing is used, each track can specify what rounding mode to use during decompression (ACL does not store this information in the compressed data).

Note: tracks that do not interpolate smoothly also do not blend smoothly (between multiple clips) and special care must be taken when doing so.

See here for details on how to use this new feature.

Performance implications

During decompression, ACL always interpolates between two samples even if the per track rounding feature is disabled during compilation. The most common rounding mode is none which means we interpolate. This allows us to keep the code simple. We simply snap the interpolation alpha to 0.0 or 1.0 when we floor and ceil respectively (or with nearest) and we make sure to use a stable linear interpolation function.

As a result, every rounding mode has the same performance with the exception of the per track one.

Per track rounding is disabled by default and the code is entirely stripped out because it isn’t commonly required. When enabled, it adds a small amount of overhead on the order of 2-10% slower (AMD Zen2 with SSE2) regardless of which rounding mode individual tracks use.

During decompression, ACL unpacks 8 samples at a time and interpolates them in pairs. These are then cached in a structure on the execution stack. This is done ahead, before the interpolated samples are needed to be written out which means that during interpolation, it does not know which sample belongs to which track. That determination is only done when we read it from the cached structure. The reason for this will be the subject of a future blog post but it is done to speed up decompression by hiding memory latency.

As a result of this design decision, the only way to support per track rounding is to cache all possible rounding results. When we need to read an interpolated sample, we can simply index into the cache with the rounding mode chosen for that specific track. In practice, this is much cheaper than it sounds because even though it adds quite a few extra instructions to execute (SSE2 will be quite a bit worse than AVX here due to the lack of VEX prefix and blend instructions), they can do so independently of other surrounding instructions and in the shadow of expensive square root instructions for rotation sub-tracks (we need to reconstruct the quaternion w component and we need to normalize the resulting interpolated value). In practice, we don’t have to interpolate three times (one for each possible alpha value) because both 0.0 and 1.0 are trivial.

Caveats when key frames are removed

When the database feature is used, things get a bit more complicated. Whole keyframes can be moved to the database and optionally stripped. This means that non-neighboring keyframes might interpolate together.

When all tracks interpolate with the same rounding mode, we can reconstruct the right interpolation alpha to use based on our closest keyframes present.

For example, let’s say that we have 3 keyframes: A, B, and C. Consider the case where we sample the time just after where B lies (at time t).

If B has been moved to the database and isn’t present in memory during decompression, we have to reconstruct our value based on our closest remaining neighbors A and C (an inherently lossy process). When all tracks use the same rounding mode, this is clean and fast since only one interpolation alpha is needed for all tracks:

none: one alpha value is needed between A and C, past the 50% mark where B lies (see x' above)
floor: one alpha value is needed at 50% between A and C to reconstruct B (see B' above)
ceil: one alpha value is needed at 100% on C
nearest: the same alpha value as floor and ceil is used

However, consider what happens when individual tracks have their own rounding mode. We need three different interpolation alphas that are not trivial (0.0 or 1.0). Trivial alphas are very cheap since they fully match our known samples. To keep the code flexible and fast during decompression, we would have to fully interpolate three times:

Once for floor since the interpolation alpha needed for it might not be a nice number like 0.5 if multiple consecutive keyframes have been removed.
Once for ceil for the same reason as floor.
Once for none for our actual interpolated value.

As a result, the behavior will differ for the time being as ACL will return A with floor instead of the reconstructed B. This is unfortunate and I hope to address this in the next minor release (v2.2).

Note: This discrepancy will only happen when interpolating with missing keyframes. If the missing keyframes are streamed in, everything will work as expected as if no database was used.

In a future release, ACL will support splitting tracks into layers to group them together. This will allow us to use different rounding modes for different layers. We will also be able to specify which tracks have values that can be interpolated or not, allowing us to identify boundary keyframes that cannot be moved to the database.

I would have liked to properly handle this right off the bat however due to time constraints, I opted to defer this work until the next release. I’m hoping to finalize the release of v2.1 this year. There is no ETA yet for v2.2 but it will focus on streaming and other improvements.

Back to table of contents

Compressing looping animation clips

2022-04-03T00:00:00+00:00

Animations that loop have a unique property in that the last character pose must exactly match the first, for playback to be seamless as they wrap around. As a result, this means that the repeating first/last pose is redundant. Looping animations are fairly common, and it thus makes sense to try and leverage this in order to reduce the memory footprint. However, there are important caveats with this, and special care must be taken.

The Animation Compression Library now supports two looping modes: clamp and wrap.

To illustrate things, I will use a simple 1D repeating sequence but in practice, everything can be generalized in 3D (e.g. translations), 4D (e.g. rotation quaternions), and other dimensions.

Clamping loops

Clamping requires the first and last samples to match exactly, and both will be retained in the final compressed data. This keeps things very simple.

Calculating the clip duration can be achieved by taking the number of samples and subtracting one before multiplying by the sampling rate (e.g. 30 FPS). We subtract by one because we count how many intervals of time we have in between our samples.

If we wish to sample the clip at some time t, we normalize it by dividing by the clip duration to bring the resulting value between [0.0, 1.0] inclusive.

We can then multiply it with the last sample index to find which two values to use when interpolating:

The first sample is the floor value (or the truncated value)
The second sample is the ceil value (or simply the first + 1) we clamp with the last sample index

The interpolation alpha value between the two is the resulting fractional part.

Reconstructing our desired value can then be achieved by interpolating the two samples.

float duration = (num_samples - 1) * sample_rate;
float normalized_t = t / duration;
float sample_offset = normalized_t * (num_samples - 1);

int first_sample = trunc(sample_offset);
int second_sample = min(first_sample + 1, num_samples - 1);

float interpolation_alpha = sample_offset - first_sample;

float result = lerp(sample_values[first_sample], sample_values[second_sample], interpolation_alpha);

This works for any sequence with at least one sample, regardless of whether or not it loops. By design, we never interpolate between the first and last samples as playback loops around.

This is the behavior in ACL v2.0 and prior. No effort had been made to take advantage of looping animations.

Wrapping loops

Wrapping strips the last sample, and instead uses the first sample to interpolate with. As a result, unlike with the clamp mode, it will interpolate between the first and last samples as playback loops around.

This complicates things because it means that the duration of our clip is no longer a function of the number of samples actually present. One less sample is stored, but we have to remember to account for our repeating first sample.

Accounting for that fact, this gives us the following:

float duration = num_samples * sample_rate;
float normalized_t = t / duration;
float sample_offset = normalized_t * num_samples;

int first_sample = trunc(sample_offset);
int second_sample = first_sample + 1;

if (first_sample == num_samples) {
    // Sampling the last sample with full weight, use the first sample
    first_sample = second_sample = 0;
    sample_offset = 0.0f;
}
else if (second_sample == num_samples) {
    // Interpolating with the last sample, use the first sample
    second_sample = 0;
}

float interpolation_alpha = sample_offset - first_sample;

float result = lerp(sample_values[first_sample], sample_values[second_sample], interpolation_alpha);

By design, the above logic only works for clips which have had their last sample removed. As such, it is not suitable for non-looping clips.

How much memory does it save?

I analyzed the animation clips from Paragon to see how much memory a single key frame (whole pose at a point in time) uses. The results were somewhat underwhelming to say the least.

Paragon	Before	After
Compressed size	208.93 MB	208.71 MB
Track error 99th percentile	0.0099 cm	0.0099 cm

Roughly 200 KB were saved or about 0.1% even though 1454 clips (out of 6558) were identified as looping. Looking deeper into the data shows the following hard truth: animated key frame data is pretty small to begin with.

Paragon	50th percentile	85th percentile	99th percentile
Animated frame size	146.62 Bytes	328.88 Bytes	756.38 Bytes

Half the clips have an animated key frame size of 146 bytes or smaller. Because joints are not all animated, they do not all benefit equally.

Here is the full distribution of how much memory was saved as a percentage of the original size:

Interestingly, some clips end up larger when the last key frame is removed. How come? It all boils down to the fact that ACL tries to split the number of key frames evenly across all sub-segments it generates. Segments are used to improve accuracy and lower the memory footprint but the current method is naive and will be improved on in the future. As such, when a loop is detected, ACL can do one of two things: the last key frame can be removed before segments are created or afterwards. I opted to do so before to ensure proper balancing. As such, through the re-balancing that occurs, some clips can end up being larger. Either way, the small size of the animated key frame data puts a fairly low ceiling on how much this optimization can save.

Side node: thanks to this, I found a small bug that impacts a very small subset of clips. Clips impacted by this ended up with segments containing more key frames than they otherwise would have resulting in a slightly larger memory footprint.

Caveats

Special care must be taken when removing the last sample, as joint tracks are not all equal. In a looping clip, the last sample will match the first for joints which have a direct impact on visual feedback to the user, but it might not be the case for joints which have an indirect impact, or those that have no impact. For example, while the joint for a character’s hand has direct visual feedback as we see it on screen, other joints like the root have only indirect feedback (through root motion), and any number of custom invisible joints might exist for things like inverse kinematics and other metadata required.

A looping clip with root motion might have a perfectly matching last sample for the visible portion of the character pose, but the root value might differ between the two. When this is the case, removing the last sample means truncating important data! For that reason, ACL only detects wrapping when ALL joints repeat perfectly between their first and last sample.

Why is root motion special?

An animation clip contains a series of transforms for every joint at different points in time. To remove as much redundancy as possible, joints are stored relative to their parent (if they have one, local space). In order to display the visual mesh, each joint transform is combined with its parent, recursively, all the way to the root of the character (converting from local space to object space).

In an animation editor, a character will be animated relative to some origin point at [0,0,0] (object space). However, in practice, we need to be able to play that animation anywhere in the world (world space). As such, we need to transform joints from object space into world space by using the character’s transform.

The pelvis joint is usually the highest most visible joint in the hierarchy with the legs and spine in turn parented to it, etc. When the character moves, every joint has direct visual feedback as their movement deforms the parts of the visual mesh attached to them (through the rigid skinning process).

Now what happens if we wish to make a walking animation where the character moves at some desired speed? If the pelvis joint is the highest most joint in the hierarchy, this motion will happen there, and it will drag every joint parented to it (because children transforms are stored relative to their parent to avoid having redundant motion in every joint).

This is problematic because now, the character motion is mixed in with what the pelvis does in this particular clip (e.g. swaying back and forth). It is no longer possible to extract cleanly that motion free from the noise of the animation. As the animation plays back in the world, things might be attached to the character (e.g. camera) that we wish to attach to the character himself, and not some particular joint. If only the joints move, this won’t work. As such, we often want to extract this character motion and apply it as a delta from the previous animation update. We would remove this motion from the joints, and instead transfer it onto the character.

To simplify this process, character motion is generally stored in a root joint that sits highest in the joint hierarchy. This joint does not directly move anything, and we instead extract the character motion from it and apply it onto the character.

As a result, character motion is stored as an absolute displacement value relative to that [0,0,0] origin mentioned earlier. In a 1-second-long animation where the character moves 1 meter, on the first sample the root value will be zero meters and on the last sample, the value is 1 meter. Values in between can vary depending on how the character velocity changes. As such, even if the character pose is identical between the first and last samples, their root joints differ and CANNOT be interpolated between.

When we sample the animation, we can find out our character motion delta as follow: current_root_position - previous_root_position. This gives us how much the character moved since the last update. When the animation loops, we must break this calculation down into two parts:

last_root_position - previous_root_position to get the trailing end of the motion
current_root_position - first_root_position to get the rest of the motion (note that first_root_position is usually zero and can be omitted)

There is also one more special case to consider: if the animation loops more than once since the last update. This can happen for any number of reasons: playback rate changes, the character was off screen for a while and is now visible and updating again, etc. When this happens, we must extract the total displacement of the animation and multiply it by how many times we looped through. The total displacement value lives in our last sample and can simply be read from there: last_root_position * num_loops.

If, for whatever reason, we forcibly remove the last sample, we will no longer have information on how fast the root was moving as it wrapped around. This velocity information is permanently lost and cannot be reconstructed.

It is generally desirable to keep the number of samples for every joints the same for compression purposes within an animation clip. If we don’t, we would need to either store how many samples we have per joints (and there can be many) or whether the last sample has been removed or not. This added overhead and complexity is ideally one we’d like to do without.

It is thus tempting to remove the last sample and estimate the missing value somehow. While this might work with few to no visible artifacts, it can have subtle implications when other things that should be moving at the same overall rate are visible on screen. For example, two characters might be running side by side at the same velocity, but with different animations. Just because the overall velocity is the same (e.g. both move at 1 meter per second), it does not mean that the velocity on the last sample matches. One animation might start fast and slow down before looping, while the other does the opposite. If this is the case, both characters will end up slowly drifting from each other even though they should not.

Back to table of contents

Bind pose stripping

2022-01-23T00:00:00+00:00

A little over three years ago, I wrote about storing animation clips relative to the bind pose in order to improve the compression ratio. In a gist, this stores each clip as an additive onto the base bind pose.

This does two things for us:

It reduces the range of each sub-track (rotation, translation, scale) which allows more accuracy to be retained (e.g. instead of having the pelvis bone animating at 60cm above the root on the ground, it will animate around 0cm relative to the bind position 60cm above said ground).
It increases the likelihood that sub-tracks will become equal to their identity value as very often joints are not animated and will be equal to their bind pose value. This allows us to remove their values entirely from the compressed byte stream since we only need a bit set to reconstruct them: whether the value is equal to the identity or not.

It is very common for sub-tracks to not be animated and to retain the bind pose value, especially for translations. For example, upper body animations might have all of the lower body be identical to the bind pose. Facial animations might have the rest of the body equal to it. For that reason, at the time, I reported memory savings of up to 8% with that method.

The main drawback of the technique is that in order to reconstruct the clip, we have to apply the output pose onto the bind pose much like we would with an additive clip. This means performing an expensive transform multiplication. With a QVV (quat-vec3-vec3) format, this means performing three quaternion multiplications which isn’t cheap.

However, there is an alternate way to leverage the bind pose to achieve similar memory savings without the expensive overhead: stripping it entirely.

How it works

The Animation Compression Library now allows you to specify per joint what its default value should be. Default values are not stored within the compressed clip and instead rely on a simple bit set.

If a sub-track is not animated, it has a constant value across all its samples. If that constant value is equal to the default value specified, then ACL simply strips it. Later, during decompression, if a sub-track has a default value, we simply write it into the output pose.

ACL is very flexible here and it allows you to specify either a constant value for every default sub-track (e.g. the identity) or you can use a unique value per sub-track (e.g. the bind pose). This way, the full output pose can be safely reconstructed. As a bonus, it also allows default sub-tracks to be skipped entirely. This is very handy when you pre-fill the output pose buffer with the bind pose before decompressing a clip. This can be achieved efficiently with memcpy (or similar) and during decompression default sub-tracks will be skipped, leaving the value untouched in the output pose.

By removing the bind pose, we achieve a very similar result as we would storing the clip as an additive on top of it. We can leverage the fact that many sub-tracks are not animated and are equal to their bind pose value. However, we do not reduce the range of motion.

Crucially, reconstructing the original pose is now much cheaper as it does not involve any expensive arithmetic and the bind pose will often be warm in the CPU cache as multiple clips will use it and it might be used for other things as part of the animation update/evaluation.

Side note: Reducing the range of motion can be partially achieved for translation and scale by simply removing the bind pose value with component wise subtraction. This allows us to reconstruct the original value by adding the bind pose values which is very cheap.

Results

Now that ACL supports this, I measured bind pose stripping against two data sets:

Carnegie-Mellon University’s motion capture database which contains 2534 clips.
Paragon which contains 6558 clips.

I measured the final compressed size before and after as well as the 99th percentile error (99% of joint samples have an error below this value):

CMU	Before	After
Compressed size	75.55 MB	75.50 MB
Track error 99th percentile	0.0088 cm	0.0088 cm

For CMU, the small gain is expected due to the nature of motion capture data and bind pose stripping performs about as well as storing clips relative to it. Motion capture is often noisy and sub-tracks are unlikely to be constant, let alone equal to the bind pose.

Paragon	Before	After
Compressed size	224.30 MB	220.85 MB
Track error 99th percentile	0.0095 cm	0.0090 cm

However, for Paragon, the results are 1.54% smaller. It turns out that quite a few clips are adversely impacted by bind pose stripping. They can end up with a size far higher which I found surprising.

In the above image, I plotted the size delta as a percentage. Positive values denote a reduction in size.

As we can see, for the vast majority of clips, we observe a reduction in size: 5697 (87% of) clips ended up smaller. Over 1889 (29% of) clips saw a reduction of 10% or more. The median represents 4.8% in savings. Not bad!

Sadly, 59 (1% of) clips saw an increase in size of 10% or more with the largest increase at 67.7%.

Looking at some of the clips that perform terribly helped shed some light as to what is going on. Clips that have long bone chains of default sub-track values equal to their bind pose can end up with very high error. To compensate ACL retains more bits to preserve quality. This is caused by two things.

First, we use an error threshold to detect when a sub-track is equal to its default value or not. Even though the error threshold is very conservative, a very small amount of error is introduced and it can compound in long bone chains.

Second, when we measure the compression error, we do so by using the original raw clip. Because the bind pose values we strip rely on the error threshold mentioned above, the optimization algorithm can end up trying to reach a pose that is can never reach. For example, if a joint is stripped by being equal to the bind pose and doing so introduces an error of 1cm on some distant child, even if every joint in between remains with full raw precision values, the error will remain.

Side note, constant sub-tracks use the same error threshold to detect if they are animated or not which can lead to the same issues happening.

This is not easily remedied without some form of error compensation which ACL does not currently support. However, I’m hoping to integrate a partial solution in the coming months. Stay tuned!

Side note, we could pre-process the raw data to ensure that constant and default sub-tracks are clean with every sample perfectly repeating. This would ensure that the error metric does not need to compensate. However, ACL does not own the raw data and as such it cannot do such transformations safely. A future release might expose such functions to clean up the raw data prior to compression.

Conclusion

For the time being I recommend that if you use this new feature, you should also try not stripping the bind pose and pick the best of the two results (for most clips, ACL compresses very fast). The develop branch of the Unreal Engine 4 ACL plugin now supports this experimental feature and testing both codec variations can easily be achieved there (and in parallel too).

Anedoctally, a few people have reached out to me about leveraging this feature and they reported memory savings in the 3-5% range. YMMV.

While the memory savings of this technique aren’t as impressive as storing clips as additives of the bind pose, their dramatically lower decompression cost makes it a very attractive optimization.

Back to table of contents

ACL 2.0 is hot off the press

2021-05-04T00:00:00+00:00

After 18 months of work, the Animation Compression Library has finally reached v2.0 along with an updated v2.0 Unreal Engine 4 plugin.

Notable changes in this release include:

Unified and cleaned up APIs
Cleaned up naming convention to match C++ stdlib, boost
Introduced streaming database support
Decompression profiling now uses Google Benchmark
Decompression has been heavily optimized
Compression has been heavily optimized
First release to support backwards compatibility going forward
Migrated all math to Realtime Math
Clips now support 4 billion samples/tracks
WebAssembly support added through emscripten
Many other improvements

Overall, this release is cleaner, leaner, and much faster.

Decompression is now 1.4x (PC) to 1.8x (Mobile) faster than ACL 1.3 which is no small feat! I’ll be writing a blog post in the next few months with the juicy details. To make this possible, the memory footprint may increase slightly (mostly header related changes, a few bytes here and there, and alignment padding) but many datasets showed no increase at all. Quality remains unchanged.

What’s next

I’ve already started fleshing out the task list for the next minor release here. This release will bring about more memory footprint improvements.

If you use ACL and would like to help prioritize the work I do, feel free to reach out and provide feedback or requests!

ACL 2.0 turned out to be a massive undertaking and it took countless hours, late nights, and weekends to make it happen (over 1700 commits since the project began 4 years ago). As such, I’ll pause development for a month or two (or three) while I focus on writing a few blog posts I’ve been meaning to get to and take a much needed break. However, I’ll continue to make patch releases during this time if anything important pops up.

Special thanks to NCSOFT for sponsoring many of the improvements that came with this major release!

Controlling animation quality through streaming

2021-01-17T00:00:00+00:00

For a few years now, I’ve had ideas on how to leverage streaming to further improve compression with the Animation Compression Library. Thanks to the support of NCSOFT, I’ve been able to try them out and integrate progressive quality streaming for the upcoming 2.0 release next month.

Progressive quality streaming is perfectly suited for modern games on high end consoles all the way down to mobile devices and web browsers. Its unique approach will empower animators to better control animation quality and when to pay the price for it.

Space is precious

For many mobile and web games out there, the size of animation data remains an ever present issue. All of this data needs to be downloaded on disk and later loaded into memory. This takes time and resources that devices do not always have in large supply. Moreover, on a small screen, the animation quality doesn’t matter as much as compression artifacts are often much less visible than on a large 4K monitor. Although it might seem like an ancient problem older consoles and early mobile phones had to worry about, many modern games still contend with it today.

A popular technique to deal with this is to use sub-sampling: take the key frames of an input animation clip and re-sample it with fewer key frames (e.g. going from 30 FPS to 24 FPS). Unreal Engine 4 implements a special case of this called frame stripping: every other key frame is removed.

By their nature, these techniques are indiscriminate and destructive: the data is permanently removed and cannot be recovered. Furthermore, if a specific key frame is of particular importance (e.g. highest point of a jump animation), it could end up being removed leading to undesirable artifacts. Despite these drawbacks, they remain very popular.

In practice, some data within each animation cannot be removed (metadata, etc) and as I have documented in a previous blog post, the animated portion of the data isn’t always dominant. For that reason, frame stripping often yields a memory reduction around 30-40% despite the fact that we remove every other key frame.

Bandwidth is limited

Animation data also competes with every other asset for space and bandwidth. Even with modern SSDs, loading and copying hundreds of megabytes of data still takes time. Modern games now have hundreds of megabytes of high quality textures to load as well as dense geometry meshes. These often have solutions to alleviate their cost both at runtime and at load time in the form of Levels of Details (e.g. mip maps). However, animation data does not have an equivalent to deal with this problem because animations are closer in spirit to audio assets: most of them are either very short (a 2 second long jump and its 200 millisecond sound effect) or very long (a cinematic and its background music).

Most assets that leverage streaming end up doing so in one of two ways: on demand (e.g. texture/mip map streaming) or ahead of time (e.g. video/audio streaming). Sadly, neither solution is popular with animation data.

When a level starts, it generally has to wait for all animation data to be loaded. Stalling and waiting for IO to complete during gameplay would be unacceptable and similarly playing a generic T-stance would quickly become an obvious eye sore. By their nature, gameplay animations can start and end at any moment based on player input or world events. Equally worse, gameplay animations are often very short and we wouldn’t have enough time to stream them in part (or whole) before the information is needed.

On the other hand, long cinematic sequences that contain lots of data that play linearly and predictably can benefit from streaming. This is often straightforward to implement as the whole animation clip can be split into smaller segments each compressed and loaded independently. In practice, cinematics are often loaded on demand through higher level management constructs and as such progressive streaming is not very common (UE4 does support it but to my knowledge it is not currently documented).

The gist

Here are the constraints we work with and our wish list:

Animations can play at any time and must retain some quality for their full duration
Not all key frames are equally important, some must always be present while others can be discarded or loaded later
Most animations are short
Large contiguous reads from disk are better than many small random reads
Decompression must remain as fast as possible

Enter progressive quality streaming

Because most animations are short, it makes sense to attempt to load their data in bulk. As such, ACL now supports aggregating multiple animation clips into a single database. Important key frames and other metadata required to decompress will remain in the animation clip and optionally at runtime, the database can be provided to feed it the remaining data.

This leads us to the next question: how do we partition our data? How do we determine what remains in the animation clip and what can be streamed later? Crucially, how do we make sure that decompression remains fast now that our data can live in multiple locations?

To solve this second part of the problem, during compression ACL will tag each whole key frame with how much error removing it contributes. We use this information to construct a variant of linear key reduction where only whole key frames are removed. However, in our case, they will simply be moved to the database instead of being lost forever. This allows us to quickly find the data we need when we sample our animation clip with a single search for which key frames we need. This helps keep the cost constant regardless of how many key frames or joints an animation clip has. By further limiting the number of key frames in a segment to a maximum of 32, finding the data we need boils down to a few bit scanning operations efficiently implemented in hardware.

The algorithm is straightforward. We first assume that every key frame is retained. We’ll pick the first that is movable (e.g. not first/last in a segment) and measure the resulting error when we remove it (both on itself and on its neighbors that might have already been removed). Any missing key frames are reconstructed using linear interpolation from their neighbors. To measure the error, we use the same error metric used to optimize the variable bit rates. We record how much error is contributed and we add back the key frame. We’ll iterate over every key frame doing so. Once we have the contributing error of every key frame, we’ll pick the lowest error and remove that key frame permanently. We’ll then repeat the whole process with the remaining key frames. Each step will remove the key frame that contributes the least error until they are all removed. This yields us a sorted list of how important they are. While not perfect and exhaustive, this gives us a pretty good approximation. This error contribution is then stored as extra metadata within the animation clip. This metadata is only required to build the database and it is stripped when we do so.

Now that we know which key frames are important and which aren’t, we’ll iterate over every animation clip and move the least important key frames out into the database first. Our most important key frames will remain within each animation clip to be able to retain some quality if we need to play back either with no database or with partial database data. How much data each tier will contain is user controlled and optionally the lowest tier can be stripped.

We consider three quality tiers for now:

High importance key frames will remain in the animation clips as they are not optional
Medium and low importance key frames are moved to the database each in its separate tier which can be streamed independently

Since we know every clip that is part of the database, we can find the globally optimal distribution. As such, if we wish to move the least important 50% percent, we will remove as much data as frame stripping but now the operation is far less destructive. Some clips will contribute more key frames than others to reach the same memory footprint. This is frame stripping on steroids!

This partitioning allows us to represent three visual fidelity levels:

Highest visual fidelity requires all three importance quality tiers to be loaded
Medium visual fidelity requires only the high and medium importance tiers to be loaded
Lowest visual fidelity requires only the high importance tier to be loaded (the clip itself)

Under the hood, ACL allows you to stream both tiers independently and in any order you wish. For simplicity, the UE4 plugin exposes the desired visual fidelity level and the streaming request size granularity while abstracting what needs to stream in or out. This allows the game to allocate memory on demand when data is streamed in while also allowing the game to unload tiers that are no longer needed. In a pinch, the entire database can be unloaded or destroyed and animations can continue to play at the lowest visual fidelity level.

Unprecedented control

What this means is that you can now group animations into as many databases as makes sense for your game. Some animations always need the highest fidelity and shouldn’t belong to any database (e.g. main character locomotion) while general gameplay animations and exotic animations (e.g. emotes) can be split into separate databases for ultimate control. You can now decide at a high level how much data to stream later and when to stream it. Crucially, this means that you can decide ahead of time if a quality tier isn’t required and strip it entirely from disk or you can make that decision at runtime based on the device the game runs on.

You can make a single package for your mobile game that can run with reduced quality on lower end devices while retaining full quality for higher end ones.

Your multiplayer game can stream in the hundreds of emotes by grouping them by popularity lazily in the background.

via GIPHY

Room to grow

Because the feature is new, I had to make a few executive decisions to keep the scope down while leaving room for future changes. Here are a few points that could be improved on over time.

I had to settle on three quality tiers both for simplicity and performance. Making the number arbitrary complicates authoring while at the same time might degrade decompression performance (which now only adds 100-150 instructions and 2 cache misses compared to the normal path without a database lookup). That being said, if a good case can be made, that number could be increased to five without too much work or overhead.

Evaluating how much error each key frame contributes works fine but it ends up treating every joint equally. In practice, some joints contribute more while others contribute less. Facial and finger joints are often far less important. Joints that move fast are also less important as any error will be less visible to the naked eye (see Velocity-based compression of 3D rotation, translation, and scale animations by David Goodhue about using velocity to compress animations). Instead of selecting the contributing error solely based on one axis (which key frame), we could split into a second axis: how important joints are. This would allow us to retain more quality for a given quality tier while reaching the same memory footprint. The downside though is that this will increase slightly the decompression cost as we’ll now need to search for four key frames to interpolate (we’ll need two for high and two for low importance joints). Adding more partitioning axes increases that cost linearly.

Any key frame can currently be moved to the database with two exceptions: each internal segment retains its first and last key frame. If the settings are aggressive enough, everything else can be moved out into the database. In practice, this is achieved by simply pinning their contributing error to infinity which prevents them from being moved. This same trick could be used to prevent specific key frames from being moved if an animator wished to author that information and feed it to ACL.

In order to avoid small reads from disk, data is split into chunks of 1 MB. At runtime, we specify how many chunks to stream at a time. This means that each chunk contains multiple animation clips. No metadata is currently kept for this mapping and as a result it is not currently possible to stream in specific animation clips as it would somewhat defeat the bulk streaming nature of the work. Should this be needed, we can introduce metadata to stream in individual chunks but I hope that it won’t be necessary. In practice, you can split your animations into as many databases as you need: one per character, per character type, per gameplay mode, etc. Managing streaming in bulk ensures a more optimal usage of resources while at the same time lowering the complexity of what to stream and when.

Coming soon

All of this work has now landed into the main development branches for ACL and its UE4 plugin. You can try it out now if you wish but if not, the next major release scheduled for late February 2021 will include this and many more improvements. Stay tuned!

Animation Compression Table of Contents

Windows Hyper-V woes

2020-12-27T00:00:00+00:00

This weekend, I noticed that AMD Ryzen Master (a great tool to control your CPU when profiling code performance) and VirtualBox both stopped working for me under Windows 10. I wasn’t too sure how it happened and I spent a good deal of time chasing what went wrong so I am documenting my process and result here in hope that it might help others as well.

Why aren’t they working?

Over the last few years, Windows has been roling out a hypervisor of its own: Hyper-V. This allows you to run virtual machines (without VMWare, VirtualBox, or some other similar program). It also allows isolation of various security components, further improving kernel safety. When the hypervisor is used, Windows runs inside it, much like a guest OS runs inside a host OS in a virtual machine setup. This is the root cause of my woes.

AMD Ryzen Master complains (see also this) that Virtualization Based Security (VPS) is enabled. From my understanding, what this means is that running this program within a virtual machine isn’t supported or recommended. Some hacks are floating around to patch the executable to allow the check to be skipped and it does appear to work (although I haven’t tested it myself). Attempting to disable VPS lead me to pain and misery and in the end nothing worked.

In turn, VirtualBox attempts to run a virtual machine inside the hypervisor virtual machine. This is problematic because to keep performance fast under virtualization, the CPU exposes hardware a feature to accelerate virtual memory translation and now that feature is used by Windows and it cannot be shared and used by VirtualBox. While the latest version will still run your guest OS (Ubuntu in my case), it will be terribly slow. So slow as to be unusable. Older VBox versions might simply refuse to start the guest OS. This thread is filled with sorrow.

I spent hours pouring over forum threads trying to piece together the puzzle.

How did it suddenly get turned on?

I was confused at first, because I hadn’t changed anything related to Hyper-V or anything of the sort. How then, did it suddenly start being used? As it turns out, I installed Docker which is built on top of this.

Fixing this once and for all

Finding a solution wasn’t easy. Microsoft documents the many security features that virtualization provides and since I use my desktop for work as well, I wanted to make sure that turning it off was safe for me. Many forum posts offer various suggestions on what to do from modifying the registry, to uninstalling various things.

The end goal for me was to be able to stop using the virtualization as it cannot coexist with Ryzen Master and VirtualBox. To do so, you must uninstall every piece of software that requires it. This VirtualBox forum post lists known components that use it:

Application Guard
Credential Guard
Device Guard
* Guard
Containers
Hyper-V
Virtual Machine Platform
Windows Hypervisor Platform
Windows Sandbox
Windows Server Containers
Windows Subsystem for Linux 2 (WSL2) (WSL1 does not enable Hyper-V)

To remove them, press the Windows Key + X -> App and Features -> Programs and Features -> Turn Windows features on or off.

You’ll of course need to uninstall Docker and other similar software.

To figure out if you are using Device Guard and Credential Guard, Microsoft provides this tool. Run it from a PowerShell instance with administrative privileges. Make sure the execution environment is unrestricted as well (as per the readme). And run it with the -Ready switch to see whether or not they are used. Those features might be required by your system administrator if you work in an office or perhaps required by your VPN to connect. In my case, the features weren’t used and as such it was safe for me to remove the virtualization.

Once everything that might require virtualization has been removed, it is time to turn it off. Whether Windows boots inside the virtualized environment or not is determined by the boot loader. See this thread for examples. You can create new bootloader entries if you like but I opted to simply turn it off by executing with administrative privileges in a command prompt: bcdedit /set hypervisorlaunchtype off. If you need to turn it back on, you can execute this: bcdedit /set hypervisorlaunchtype auto.

Simply reboot and you should be good to go!

Animation clip metadata packing

2020-08-11T00:00:00+00:00

In my previous blog post, I analyzed over 15000 animations. This offered a number of key insights about what animation data looks like for compression purposes. If you haven’t read it yet, I suggest you do so first as it covers the terminology and basics for this post. I’ll also be using the same datasets as described in the previous post.

As mentioned, most clips are fairly short and the metadata we retain ends up having a disproportionate impact on the memory footprint. This is because long and short clips have the same amount of metadata everything else being equal.

Packing range values

Each animated track has its samples normalized within the full range of motion for each clip. This ends up being stored as a minimum value and a range extent. Both are three full precision 32 bit floats. Reconstructing our original sample is done like this:

sample = (normalized sample * range extent) + range minimum

This is quick and efficient.

Originally, I decided to retain full precision here out of convenience and expediency during the original implementation. But for a while now, I’ve been wondering if perhaps we can quantize the range values as well on fewer bits. In particular, each sample is also normalized a second time within the range of motion of the segment it lies in and those per segment range values are quantized to 8 bits per component. This works great as range values do not need to be all that accurate. In fact, over two years ago I tried to quantize the segment range values on 16 bits instead to see if things would improve (accuracy or memory footprint) and to my surprise, the result was about the same. The larger metadata footprint did allow higher accuracy and fewer bits retained per animated sample but over a large dataset, the two canceled out.

In order to quantize our range values, we must first extract the total range: every sample from every track. This creates a sort of Axis Aligned Bounding Box for rotation, translation, and scale. Ideally we want to treat those separately since by their very nature, their accepted range of values can differ by quite a bit. For translation and scale, things are a bit complicated as some tracks require full precision and the range can be very dynamic from track to track. In order to test out this optimization idea, I opted to try with rotations first. Rotations are much easier to handle since quaternions have their components already normalized within [-1.0, 1.0]. I went ahead and quantized each component to 8 bits with padding to maintain the alignment. Instead of storing 6x float32 (24 bytes), we are storing 8x uint8 (8 bytes). This represents a 3x reduction in size.

Here are the results:

CMU	Before	After
Compressed size	70.61 MB	70.08 MB
Compressed size 50th percentile	15.35 KB	15.20 KB
Compression ratio	20.24 : 1	20.40 : 1
Max error	0.0725 cm	0.0741 cm
Track error 99th percentile	0.0089 cm	0.0089 cm
Error threshold percentile rank (0.01 cm)	99.86 %	99.86 %

Paragon	Before	After
Compressed size	208.04 MB	205.92 MB
Compressed size 50th percentile	15.83 KB	15.12 KB
Compression ratio	20.55 : 1	20.77 : 1
Max error	2.8824 cm	3.5543 cm
Track error 99th percentile	0.0099 cm	0.0111 cm
Error threshold percentile rank (0.01 cm)	99.04 %	98.89 %

Fortnite	Before	After
Compressed size	482.65 MB	500.95 MB
Compressed size 50th percentile	9.69 KB	9.65 KB
Compression ratio	36.72 : 1	35.38 : 1
Max error	69.375 cm	69.375 cm
Track error 99th percentile	0.0316 cm	0.0319 cm
Error threshold percentile rank (0.01 cm)	97.69 %	97.62 %

At first glance, it appears a small net win with CMU and Paragon but then everything goes downhill with Fortnite. Even though all three datasets see a win in the compressed size for 50% of their clips, the end result is a significant net loss for Fortnite. The accuracy is otherwise slightly lower. As I’ve mentioned before, the max error although interesting can be very misleading.

It is clear that for some clips this is a win, but not always nor overall. Due to the added complexity and the small gain for CMU and Paragon, I’ve opted not to enable this optimization nor push further at this time. It requires more nuance to get it right but it is regardless worth revisiting at some point in the future. In particular, I want to wait until I have rewritten how constant tracks are identified. Nearly every animation compression implementation out there that detects constant tracks (ACL included) does so by using a local space error threshold. This means that it ignores the object space error that it contributes to. In turn, this sometimes causes undesirable artifacts in very exotic cases where a track needs to be animated below the threshold where it is detected to be constant. I plan to handle this more holistically by integrating it as part of the global optimization algorithm: a track will be constant for the clip only if it contributes an acceptable error in object space.

Packing constant samples

Building on the previous range packing work, we can also use the same trick to quantize our constant track samples. Here however, 8 bits is too little so I quantized the constant rotation components to 16 bits. Instead of storing 3x float32 (12 bytes) for each constant rotation sample, we’ll be storing 4x uint16 (8 bytes): a 1.33x reduction in size.

Before results contain packed range values as described above.

CMU	Before	After
Compressed size	70.08 MB	72.54 MB
Compressed size 50th percentile	15.20 KB	15.72 KB
Compression ratio	20.40 : 1	19.70 : 1
Max error	0.0741 cm	0.0734 cm
Track error 99th percentile	0.0089 cm	0.0097 cm
Error threshold percentile rank (0.01 cm)	99.86 %	99.38 %

Paragon	Before	After
Compressed size	205.92 MB	213.17 MB
Compressed size 50th percentile	15.12 KB	15.43 KB
Compression ratio	20.77 : 1	20.06 : 1
Max error	3.5543 cm	5.8224 cm
Track error 99th percentile	0.0111 cm	0.0344 cm
Error threshold percentile rank (0.01 cm)	98.89 %	96.84 %

Fortnite	Before	After
Compressed size	500.95 MB	663.01 MB
Compressed size 50th percentile	9.65 KB	9.83 KB
Compression ratio	35.38 : 1	26.73 : 1
Max error	69.375 cm	5537580.0 cm
Track error 99th percentile	0.0319 cm	0.9272 cm
Error threshold percentile rank (0.01 cm)	97.62 %	88.53 %

Even though our clip metadata size does reduce considerably, overall it yields a significant net loss. The reduced accuracy forces animated samples to retain more bits. It seems that lossless compression techniques might work better here although it would still be quite hard since each constant sample is disjoint: there is little redundancy to take advantage of.

With constant rotation tracks, the quaternion W component is dropped just like for animated samples. I also tried to retain the W component with full precision along with the other three. The idea being that if reducing accuracy increases the footprint, would increasing the accuracy reduce it? Sadly, it doesn’t. The memory footprint ended up being marginally higher. It seems like the sweet spot is to drop one of the quaternion components.

Is there any hope left?

Although both of these optimizations turned out to be failures, I thought it best to document them here anyway. With each idea I try, whether it pans out or not I learn more about the domain and I grow wiser.

There still remains opportunities to optimize the clip metadata but they require a bit more engineering to test out. For one, many animation clips will have constant tracks in common. For example, if the same character is animated in many different ways over several animations, each of them might find that many sub-tracks are not animated. In particular, translation is rarely animated but very often constant as it often holds the bind pose. To better optimize these, animation clips must be compressed as a whole in a database of sorts. It gives us the opportunity to identity redundancies across many clips.

In a few weeks I’ll begin implementing animation streaming and to do so I’ll need to create such a database. This will open the door to these kind of optimizations. Stay tuned!

Animation Compression Table of Contents

Animation data in numbers

2020-08-09T00:00:00+00:00

For a while now, I’ve been meaning to take some time to instrument the Animation Compression Library (aka ACL) and look more in depth at the animations I have. Over the years I’ve had a lot of ideas on what could be improved and good data is critical in deciding where my time is best spent optimizing. I’m not sure if this will be of interest to many people out there but since I went ahead and did all that work, I might as well publish it!

The basics

An animation clip is made up of a number of tracks each containing an equal number of transform or scalar samples (we use uniform sampling). Transform tracks are further broken down into sub-tracks for rotation, translation, and scale. Each sub-track is in one of three states:

Animated: every sample is retained and quantized
Constant: a single sample is repeating and retained
Default: a single repeating sample equal to the sub-track identity, no sample retained

Each sub-track type has its own identity. Rotations have the quaternion identity, translations have 0.0, 0.0, 0.0, and scale tracks have 0.0, 0.0, 0.0 or 1.0, 1.0, 1.0 depending on whether the clip is additive or not and its additive type.

Being able to collapse constant and default tracks is an important optimization. They are fairly common and allow us to considerably trim down on the number of samples we need to compress.

Sub-tracks that are animated end up having their samples normalized within the full range of values within the clip. This range information is later used to reconstruct our original sample.

Note that constant sample values and range values are currently stored with full float32 precision.

The compression format

When an animation is compressed with ACL, it is first broken down into segments of approximately 16 samples per track. As such, we end up with data that is needed regardless where we sample our clip and data that is needed only when we need a particular segment.

The memory layout roughly breaks down like this:

Per clip metadata
- Headers
- Offsets
- Track sub-type states
- Per clip constant samples
- Per clip animated sample ranges
Our segments
Optional metadata

Each segment ends up containing the following:

Per segment metadata
- Number of bits used per sub-track
- Sub-track range information (we also normalize our samples per segment)
The animated samples packed on a variable number of bits

In order to figure out what to try and optimize next, I wanted to see where the memory goes within the above categories.

The datasets

I use three datasets for regression testing and for research and development:

We’ll focus more on Paragon and Fortnite since they are more representative and substantial but CMU is included regardless.

Special thanks to Epic for letting me use their animations for research purposes!

When calculating the raw size of an animation clip, I assume that each track is animated and that nothing is stripped. As such, the raw size can be calculated easily: raw size = num bones * num samples * 10 * sizeof(float). Each sample is made up of 10 floats: 4x for the rotation, 3x for the translation, and 3x for the scale.

Here is how they look at a glance:

	CMU	Paragon	Fortnite
Number of clips	2534	6558	8310
Raw size	1429.38 MB	4276.11 MB	17724.75 MB
Compressed size	71.00 MB	208.39 MB	483.54 MB
Compression ratio	20.13 : 1	20.52 : 1	36.66 : 1
Number of tracks	111496	816700	1559545
Number of sub-tracks	334488	2450100	4678635

Sub-track breakdown

	CMU	Paragon	Fortnite
Number of default sub-tracks	34.09%	37.26%	42.01%
Number of constant sub-tracks	52.29%	43.95%	50.33%
Number of animated sub-tracks	13.62%	18.78%	7.66%

CMU	Default	Constant	Animated
Number of rotation sub-tracks	0.00%	64.41%	38.59%
Number of translation sub-tracks	2.27%	95.46%	2.27%
Number of scale sub-tracks	100.00%	0.00%	0.00%

Paragon	Default	Constant	Animated
Number of rotation sub-tracks	10.85%	47.69%	41.45%
Number of translation sub-tracks	3.32%	82.98%	13.70%
Number of scale sub-tracks	97.62%	1.19%	1.19%

Fortnite	Default	Constant	Animated
Number of rotation sub-tracks	21.15%	62.84%	16.01%
Number of translation sub-tracks	7.64%	86.78%	5.58%
Number of scale sub-tracks	97.23%	1.38%	1.39%

Overall, across all three data sets, about half the tracks are constant. Translation tracks tend to be constant much more often. Most tracks aren’t animated but rotation tracks tend to be animated the most. Rotation tracks are 3x more likely to be animated than translation tracks and scale tracks are very rarely animated (~1.3% of the time). As such, segment data mostly contains animated rotation data.

Segment breakdown

Number of segments	CMU	Paragon	Fortnite
Total	51909	49213	121175
50th percentile	10	3	2
85th percentile	42	9	18
99th percentile	116	117	187

Half the clips are very short and only contain 2 or 3 segments for Paragon and Fortnite. Those clips are likely to be 50 frames or less, or about 1-2 seconds at 30 FPS. For Paragon, 85% of the clips have 9 segments or less and in Fortnite we have 18 segments or less.

Compressed size breakdown

Compressed size	CMU	Paragon	Fortnite
Total	71.00 MB	208.39 MB	483.54 MB
50th percentile	15.42 KB	15.85 KB	9.71 KB
85th percentile	56.36 KB	48.18 KB	72.22 KB
99th percentile	153.68 KB	354.54 KB	592.13 KB

Clip metadata size	CMU	Paragon	Fortnite
Total size	4.22 MB (5.94%)	24.38 MB (11.70%)	38.32 MB (7.92%)
50th percentile	9.73%	22.03%	46.06%
85th percentile	18.68%	46.06%	97.43%
99th percentile	37.38%	98.48%	98.64%

Segment metadata size	CMU	Paragon	Fortnite
Total size	6.23 MB (8.78%)	22.61 MB (10.85%)	54.21 MB (11.21%)
50th percentile	8.07%	6.88%	0.75%
85th percentile	9.28%	11.37%	10.95%
99th percentile	10.59%	21.00%	26.21%

Segment animated data size	CMU	Paragon	Fortnite
Total size	60.44 MB (85.13%)	160.98 MB (77.25%)	390.15 MB (80.69%)
50th percentile	81.92%	70.55%	48.22%
85th percentile	87.08%	79.62%	81.65%
99th percentile	88.55%	87.93%	89.25%

From this data, we can conclude that our efforts might be best spent optimizing the clip metadata where the constant track data and the animated range data will contribute much more relative to the overall footprint. Short clips have less animated data but just as much metadata as longer clips with an equal number of bones. Even though overall the clip metadata doesn’t contribute much, for the vast majority of clips it does contribute a significant amount (for half the Fortnite clips, clip metadata represented 46.06% or more of the total clip size).

The numbers are somewhat skewed by the fact that a few clips are very long. Their animated footprint ends up dominating the overall numbers hence why breaking things down by percentile is insightful here.

The clip metadata isn’t as optimized and it contains more low hanging fruits to attack. I’ve spent most of my time focusing on the segment animated data but as a result, pushing things further on that front is much harder and requires more work and a higher risk that what I try won’t pan out.

Animated data breakdown

	CMU	Paragon	Fortnite
Total compressed size	71.00 MB	208.39 MB	483.54 MB
Animated data size	60.44 MB (85.13%)	160.98 MB (77.25%)	390.15 MB (80.69%)
70% of animated data removed	28.69 MB (2.47x smaller)	95.70 MB (2.18x smaller)	210.44 MB (2.30x smaller)
50% of animated data removed	40.78 MB (1.74x smaller)	127.90 MB (1.63x smaller)	288.47 MB (1.68x smaller)
25% of animated data removed	55.89 MB (1.27x smaller)	168.15 MB (1.24x smaller)	386.00 MB (1.25x smaller)

A common optimization is to strip a number of frames from the animation data (aka sub-sampling or frame stripping). This is very destructive but can yield good memory savings. Since we know how much animated data we have and its relative footprint, we can compute ballpark numbers for how much smaller removing 70%, 50%, or 25% of our animated data might be. The numbers above represent the total compressed size after stripping and the reduction factor.

In my next post, I’ll explore the results of quantizing the constant sample values and the clip range values, stay tuned!

Animation Compression Table of Contents

Zero overhead backward compatibility

2020-07-18T00:00:00+00:00

The Animation Compression Library finally supports backward compatibility (going forward once 2.0 comes out). I’m really happy with how it turned out so I thought I would write a bit about how the ACL decompression is designed.

The API

At runtime, decompressing animation data could not be easier:

acl::decompression_context<custom_decompression_settings> context;
context.initialize(compressed_data);

// Seek 1.0 second into our compressed animation
// and don't use rounding to interpolate
context.seek(1.0f, acl::sample_rounding_policy::none);

custom_writer writer(output_data);
context.decompress_tracks(writer);

A small context object is created and bound to our compressed data. Its construction is cheap enough that it can be done on the stack on demand. It can then subsequently be used (and re-used) to seek and decompress.

A key design goal is to have as little overhead as possible: pay only for what you use. This is achieved through templating in two ways:

The custom_decompression_settings argument on the context object controls what features are enabled or disabled.
The custom_writer wraps whatever container you might be using in your game engine to represent the animation data. This is to make sure no extra copying is required.

The decompression settings are where the magic happens.

Compile time user control

There are many game engines out there each handling animation in their own specific way. In order to be able to integrate as seamlessly as possible, ACL exposes a small struct that can be overriden to control its behavior. By leveraging constexpr and templating, features that aren’t used or needed can be removed entirely at compile time to ensure zero cost at runtime.

Here is how it looks:

struct decompression_settings
{
  // Whether or not to clamp the sample time when `seek(..)`
  // is called. Defaults to true.
  static constexpr bool clamp_sample_time() { return true; }

  // Whether or not the specified track type is supported.
  // Defaults to true.
  // If a track type is statically known not to be supported,
  // the compiler can strip the associated code.
  static constexpr bool is_track_type_supported(track_type8 /*type*/)
  { return true; }

  // Other stuff ...

  // Which version we should optimize for.
  // If 'any' is specified, the decompression context will
  // support every single version with full backwards
  // compatibility.
  // Using a specific version allows the compiler to
  // statically strip code for all other versions.
  // This allows the creation of context objects specialized
  // for specific versions which yields optimal performance.
  static constexpr compressed_tracks_version16 version_supported()
  { return compressed_tracks_version16::any; }

  // Whether the specified rotation/translation/scale format
  // are supported or not.
  // Use this to strip code related to formats you do not need.
  static constexpr bool is_rotation_format_supported(rotation_format8 /*format*/)
  { return true; }

  // Other stuff ...

  // Whether rotations should be normalized before being
  // output or not. Some animation runtimes will normalize
  // in a separate step and do not need the explicit
  // normalization.
  // Enabled by default for safety.
  static constexpr bool normalize_rotations() { return true; }
};

Extending this is simple and clean:

struct default_transform_decompression_settings : public decompression_settings
{
  // Only support transform tracks
  static constexpr bool is_track_type_supported(track_type8 type)
  { return type == track_type8::qvvf; }

  // By default, we only support the variable bit rates as
  // they are generally optimal
  static constexpr bool is_rotation_format_supported(rotation_format8 format)
  { return format == rotation_format8::quatf_drop_w_variable; }

  // Other stuff ...
};

A new struct is created to inherit from the desired decompression settings and specific functions are defined to hide the base implementations, thus replacing them.

By templating the decompression_context with the settings structure, it can be used to determine everything needed at compile time:

Which memory representation is needed depending on whether we are decompressing scalar or transform tracks.
Which algorithm version to support and optimize for.
Which features to strip when they aren’t needed.

This is much nicer than the common C way to use #define macros. By using a template argument, multiple setting objects can easily be created (with type safety) and used within the same application or file.

Backward compatibility

By using the decompression_settings, we can specify which version we optimize for. If no specific version is provided (the default behavior), we will branch and handle all supported versions. However, if a specific version is provided, we can strip the code for all other versions removing any runtime overhead. This is clean and simple thanks to templates.

template<compressed_tracks_version16 version>
struct decompression_version_selector {};

// Specialize for ACL 2.0's format
template<> struct
decompression_version_selector<compressed_tracks_version16::v02_00_00>
{
  static bool is_version_supported(compressed_tracks_version16 version)
  { return version == compressed_tracks_version16::v02_00_00; }

  template<class decompression_settings_type, class context_type>
  ACL_FORCE_INLINE static bool initialize(context_type& context, const compressed_tracks& tracks)
  {
    return acl_impl::initialize_v0<decompression_settings_type>(context, tracks);
  }

  // Other stuff ...
};

// Specialize to support all versions
template<> struct
decompression_version_selector<compressed_tracks_version16::any>
{
  static bool is_version_supported(compressed_tracks_version16 version)
  {
    return version >= compressed_tracks_version16::first && version <= compressed_tracks_version16::latest;
  }

  template<class decompression_settings_type, class context_type>
  static bool initialize(context_type& context, const compressed_tracks& tracks)
  {
    // TODO: Note that the `any` decompression can be optimized further to avoid a complex switch on every call.
    const compressed_tracks_version16 version = tracks.get_version();
    switch (version)
    {
    case compressed_tracks_version16::v02_00_00:
      return decompression_version_selector<compressed_tracks_version16::v02_00_00>::initialize<decompression_settings_type>(context, tracks);
    default:
      ACL_ASSERT(false, "Unsupported version");
      return false;
    }
  }

  // Other stuff ...
};

This is ideal for many game engines. For example Unreal Engine 4 always compresses locally and caches the result in its Derived Data Cache. This means that the compressed format is always the latest one used by the plugin. As such, UE4 only needs to support a single version and it can do so without any overhead.

Other game engines might choose to support the latest two versions, emiting a warning to recompress old animations while still being able to support them with very little overhead: a single branch to pick which context to use.

More general applications might opt to support every version (e.g. a glTF viewer).

Note that backward compatibility will only be supported for official releases as the develop branch is constantly subject to change.

Conclusion

This C++ customization pattern is clean and simple to use and it allows a compact API with a rich feature set. It was present in a slightly different form ever since ACL 0.1 and the more I use it, the more I love it.

In fact, in my opinion the Animation Compression Library and Realtime Math contain some of the best code (quality wise) that I’ve ever written in my career. Free from time or budget constraints, I can carefully craft each facet to the best of my ability.

ACL 2.0 continues to progress nicely. It is still missing a few features but it is already an amazing step up from 1.3.

Realtime Math 2.0 is out, cleaner, and faster!

2020-06-28T00:00:00+00:00

A lot of work went into this latest release and here is the gist of what changed:

Added support for GCC10, clang8, clang9, clang10, VS 2019 clang, and emscripten
Added a lot of matrix math
Added trigonometric functions (scalar and vector)
Angle types have been removed
Lots of optimizations and improvements
Tons of cleanup to ensure a consistent API

It should now contain everything needed by most realtime applications. The one critical feature that is missing at the moment, is a proper browsable documentation. While every function is currently documented, the lack of a web browsable documentation makes using it a bit more difficult than I’d like. Hopefully I can remedy this in the coming months.

Migrating from 1.x

Most of the APIs haven’t changed materially and simply recompiling should work depending on what you use. Where compilation fails, a few minor fixes might be required. There are two main reasons why this release is a major one:

The anglef and angled types have been removed
Extensive usage of return type overloading

The angle types have been removed because I could not manage to come up with a clean API for angular constants that would work well without introducing a LOT more complexity while remaining optimal for code generation. Angular constants (and constants in general) are used with all sorts of code. In particular, SIMD code (such as SSE2 or NEON) often ends up needing to use them and I wanted to be able to efficiently do so. As such they are now simple typedefs for floating point types and can easily be used with ordinary scalar or SIMD code. The pattern used for constants is inspired from Boost.

I had originally introduced them in hope of providing added type safety but the constants weren’t really usable in RTM 1.1. For now, it is easier to document that all angles are represented as radians. The typedef remains to clarify the API.

Return type overloading

C++ doesn’t really have return type overloading but it can be faked. It looks like this in action:

vector4f vec1 = vector_set(1.0f, 2.0f, 3.0f);
vector4f vec2 = vector_set(5.0f, 6.0f, 7.0f);

scalarf dot_sse2 = vector_dot3(vec1, vec2);
float dot_x87 = vector_dot3(vec1, vec2);
vector4f dot_broadcast = vector_dot3(vec1, vec2);

Usage is very clean and the compiler can figure out what to do fairly easily in most cases. The implementation behind the scene is a bit complicated but it is worth it for the flexibility it provides:

// A few things omited for brevity
struct dot3_helper
{
    inline operator float()
    {
        return do_the_math_here1();
    }

    inline operator scalarf()
    {
        return do_the_math_here2();
    }

    inline operator vector4f()
    {
        return do_the_math_here3();
    }

    vector4f left_hand_side;
    vector4f right_hand_side;
};

constexpr dot3_helper vector_dot3(vector4f left_hand_side, vector4f right_hand_side)
{
    return dot3_helper{ left_hand_side, right_hand_side };
}

One motivating reason for this pattern is that very often we perform some operation and return a scalar value. Depending on the architecture, it might be optimal to return it as a SIMD register type instead of a regular float as those do not always mix well. ARM NEON doesn’t suffer from this issue and for that platform, scalarf is a typedef for float. But for x86 with SSE2 and for some PowerPC processors, this distinction is very important in order to achieve optimal performance. It doesn’t stop there though, even when floating point arithmetic uses the same registers as SIMD arithmetic (such as x64 with SSE2), there is sometimes a benefit to having a different type in order to improve code generation. VS2019 still struggles today to avoid extra shuffles when ordinary scalar and SIMD code are mixed. The type distinction allows for improved performance.

This pattern was present from day one inside RTM but it wasn’t as widely used. Usage of scalarf wasn’t as widespread. The latest release pushes its usage much further and as such a lot of code was modified to support both return types. This can sometime lead to ambiguous function calls (and those will need fixing in user code) but it is fairly rare in practice. It forces the programmer to be explicit about what types are used which is in line with RTM’s philosophy.

Quaternion math improvements

The Animation Compression Library (ACL) heavily relies on quaternions and as such I spend a good deal of time trying to optimize them. This release introduces the important quat_slerp function as well as many optimizations for ARM processors.

ARM NEON performance can be surprising

RTM supports both ARMv7 and ARM64 and very often what is optimal for one isn’t optimal for the other. Worse, different devices disagree about what code is optimal, sometimes by quite a bit.

I spent a good deal of time trying to optimize two functions: quaternion multiplication and rotating a 3D vector with a quaternion. Rotating a 3D vector uses two quaternion multiplications.

For quaternion multiplication, I tried a few variations:

The first two implementations are inspired from the classic SSE2 implementation. This is the same code used by DirectX Math on SSE2 and ARM as well.

The third implementation is a bit more clever. Instead of using constants that must be loaded and applied in order to align our signs to leverage fused-multiply-add, we use the floating point negation instruction. This is done once and mixed in with the various swizzle instructions that NEON supports. This ends up being extremely compact and uses only 12 instructions with ARM64!

I measured extensively using micro benchmarks (with Google Benchmark) as well as within ACL. The results turned out to be quite interesting.

On a Pixel 3 android phone, with ARMv7 the scalar version was fastest. It beat the multiplication variant by 1.05x and the negation variant by 1.10x. However, with ARM64, the negation variant was best. It beat the multiplication variant by 1.05x and the scalar variant by 1.16x.

On a Samsung S8 android phone, the results were similar: scalar wins with ARMv7 and negation wins with ARM64 (both by a significant margin again).

On an iPad Pro with ARM64 the results agreed again with the negation variant being fastest.

I hadn’t seen that particular variant used anywhere else so I was quite pleased to see it perform so well with ARM64. In light of these results, RTM now uses the scalar version with ARMv7 and the negation version with ARM64.

Since rotating a 3D vector with a quaternion is two quaternion multiplications back-to-back, I set out to use the same tricks as above with one addition.

vector4f quat_mul_vector3(vector4f vector, quatf rotation)
{
    quatf vector_quat = quat_set_w(vector_to_quat(vector), 0.0f);
    quatf inv_rotation = quat_conjugate(rotation);
    return quat_to_vector(quat_mul(quat_mul(inv_rotation, vector_quat), rotation));
}

We first extend our vector3 into a proper vector4 by padding it with 0.0. Using this information, we can strip out a few operations from the first quaternion multiplication.

Again, I tested all four variants and surprisingly, the scalar variant won out every time both with ARMv7 and ARM64 on both android devices. The iPad saw the negation variant as fastest. Code generation was identical yet it seems that the iPad CPU has very different performance characteristics. As a compromise, the scalar variant is used with all ARM flavors. It isn’t optimal on the iPad but it remains much better than the reference implementation.

I suspect that the scalar implementation performs better because more operations are independent. Despite having way more instructions, there must be fewer stalls and this leads to an overall win. It is possible that this information can be better leveraged to further improve things but that is a problem for another day.

Compiler bugs bonanza

Realtime Math appears to put considerable stress on compilers and often ends up breaking them. In the first 10 years of my career, I found maybe 2-3 C++ compiler bugs. Here are just some of the bugs I remember from the past year:

And those are just the ones I remember from the top of my head. I also found one or two with ACL not in the above list. Some of those will never get fixed because the compiler versions are too old but thankfully the Microsoft Visual Studio team has been very quick to address some of the above issues.

Keep an eye out for buffer security checks

2020-06-26T00:00:00+00:00

By default, when compiling C++, Visual Studio enables the /GS flag for buffer security checks.

In functions that the compiler deems vulnerable to the stack getting overwritten, it adds buffer security checks. To detect if the stack has been tampered with during execution of the function, it first writes a sentinel value past the end of the reserved space the function needs. This sentinel value is random per process to avoid an attacker guessing its value. Just before the function exits, it calls a small function that validates the sentinel value: __security_check_cookie().

The rules on what can trigger this are as follow (from the MSDN documentation):

The function contains an array on the stack that is larger than 4 bytes, has more than two elements, and has an element type that is not a pointer type.
The function contains a data structure on the stack whose size is more than 8 bytes and contains no pointers.
The function contains a buffer allocated by using the _alloca function.
The function contains a data structure that contains another which triggers one of the above checks.

This is all fine and well and you should never disable it program/file wide. But you should keep an eye out. In very performance critical code, this small overhead can have a sizable impact. I’ve observed this over the years a few times but it now popped up somewhere I didn’t expect it: my math library Realtime Math (RTM).

Non-zero cost abstractions

The SSE2 headers define a few types. Of interest to us today is __m128 but others suffer from the same issue as well (including wider types such as __m256). Those define a register wide value suitable for SIMD intrinsics: __m128 contains four 32 bit floats. As such, it takes up 16 bytes.

Because it is considered a native type by the compiler, it does not trigger buffer security checks to be inserted despite being larger than 8 bytes without containing pointers.

However, the same is not true if you wrap it in a struct or class.

struct scalarf
{
    __m128 value;
};

The above struct might trigger security checks to be inserted: it is a struct larger than 8 bytes that does not contain pointers.

Similarly, many other common math types suffer from this:

struct matrix3x3f
{
    __m128 x_axis;
    __m128 y_axis;
    __m128 z_axis;
};

Many years ago, in a discussion about an unrelated compiler bug, someone at Microsoft mentioned to me that it is best to typedef SIMD types than it is to wrap them in a concrete type; it should lead to better code generation. They didn’t offer any insights as to why that might be (and I didn’t ask) and honestly I had never noticed any difference until security buffer checks came into play, last Friday. Their math library DirectX Math uses a typedef for its vector type and so does RTM everywhere it can. But sometimes it can’t be avoided.

RTM also extensively uses a helper struct pattern to help keep the code clean and flexible. Some code such as a vector dot product returns a scalar value. But on some architectures, it isn’t desirable to treat it as a float for performance reasons (PowerPC, x86, etc). For example, with x86 float arithmetic does not use SSE2 unless you explicitly use intrinsics for it: by default it uses the x87 floating point stack (with MSVC at least). If this value is later needed as part of more SSE2 vector code (such as vector normalization), the value will be calculated from two SSE2 vectors, be stored on the x87 float stack, only to be passed back to SSE2. To avoid this roundtrip when using SSE2 with x86, RTM exposes the scalarf type. However, sometimes you really need the value as a float. The usage dictates what you need. To support both variants with as little syntactic overhead as possible, RTM leverages return type overloading (it’s not really a thing in C++ but it can be faked with implicit coercion). It makes the following possible:

vector4f vec1 = vector_set(1.0f, 2.0f, 3.0f);
vector4f vec2 = vector_set(5.0f, 6.0f, 7.0f);

scalarf dot_sse2 = vector_dot3(vec1, vec2);
float dot_x87 = vector_dot3(vec1, vec2);
vector4f dot_broadcast = vector_dot3(vec1, vec2);

This is very clean to use and the compiler can figure out what code to call easily. But it is ugly to implement; a small price to pay for readability.

// A few things omited for brevity
struct dot3_helper
{
    inline operator float()
    {
        return do_the_math_here1();
    }

    inline operator scalarf()
    {
        return do_the_math_here2();
    }

    inline operator vector4f()
    {
        return do_the_math_here3();
    }

    vector4f left_hand_side;
    vector4f right_hand_side;
};

constexpr dot3_helper vector_dot3(vector4f left_hand_side, vector4f right_hand_side)
{
    return dot3_helper{ left_hand_side, right_hand_side };
}

Side note, on ARM scalarf is a typedef for float in order to achieve optimal performance.

There is a lot of boilerplate but the code is very simple and either constexpr or marked inline. We create a small struct and return it, and at the call site the compiler invokes the right implicit coercion operator. It works just fine and the compiler optimizes everything away to yield the same lean assembly you would expect. We leverage inlining to create what should be a zero cost abstraction. Except, in rare cases (at least with Visual Studio as recent as 2019), inlining fails and everything goes wrong.

Side note, the above is one example why scalarf cannot be a typedef because we need it distinct from vector4f both of which are represented in SSE2 as a __m128. To avoid this issue, vector4f is a typedef while scalarf is a wrapping struct.

Many math libraries out there wrap SIMD types in a proper type and use similar patterns. And while generally most math functions are small and get inlined fine, it isn’t always the case. In particular, these security buffer checks can harm the ability of the compiler to inline while at the same time degrading performance of perfectly safe code.

8 instructions is too much

All of this worked well until I noticed, out of the blue, a performance regression in the Animation Compression Library when I updated to the latest RTM version. This was very strange as it should have only contained performance optimizations.

Here is where the code generation changed:

// A few things omited for brevity
__declspec(safebuffers) rtm::scalarf calculate_error_no_scale(const calculate_error_args& args)
{
    const rtm::qvvf& raw_transform_ = *static_cast<const rtm::qvvf*>(args.transform0);
    const rtm::qvvf& lossy_transform_ = *static_cast<const rtm::qvvf*>(args.transform1);

    const rtm::vector4f vtx0 = args.shell_point_x;
    const rtm::vector4f vtx1 = args.shell_point_y;

    const rtm::vector4f raw_vtx0 = rtm::qvv_mul_point3_no_scale(vtx0, raw_transform_);
    const rtm::vector4f raw_vtx1 = rtm::qvv_mul_point3_no_scale(vtx1, raw_transform_);

    const rtm::vector4f lossy_vtx0 = rtm::qvv_mul_point3_no_scale(vtx0, lossy_transform_);
    const rtm::vector4f lossy_vtx1 = rtm::qvv_mul_point3_no_scale(vtx1, lossy_transform_);

    const rtm::scalarf vtx0_error = rtm::vector_distance3(raw_vtx0, lossy_vtx0);
    const rtm::scalarf vtx1_error = rtm::vector_distance3(raw_vtx1, lossy_vtx1);

    return rtm::scalar_max(vtx0_error, vtx1_error);
}

Side note, as part of a prior effort to optimize that performance critical function, I had already disabled buffer security checks which are unnecessary here.

The above code is fairly simple. We take two 3D vertices in local space and transform them to world space. We do this for our source data (which is raw) and for our compressed data (which is lossy). We calculate the 3D distance between the raw and lossy vertices and it yields our compression error. We take the maximum value as our final result.

qvv_mul_point3_no_scale is fairly heavy instruction wise and it doesn’t get fully inlined. Some of it does but the quat_mul_vector3 it contains does not inline.

By the time the compiler gets to the vector_distance3 calls, the compiler struggles. Both VS2017 and VS2019 fail to inline 8 instructions (the vector_length3 it contains).

// const rtm::scalarf vtx0_error = rtm::vector_distance3(raw_vtx0, lossy_vtx0);
vsubps      xmm1,xmm0,xmm6
vmovups     xmmword ptr [rsp+20h],xmm1
lea         rcx,[rsp+20h]
vzeroupper
call        rtm::rtm_impl::vector4f_vector_length3::operator rtm::scalarf

Side note, when AVX is enabled, Visual Studio often ends up attempting to use wider registers when they aren’t needed, causing the addition of vzeroupper and other artifacts that can degrade performance.

// inline operator scalarf() const
// {
sub         rsp,18h
mov         rax,qword ptr [__security_cookie]
xor         rax,rsp
mov         qword ptr [rsp],rax
// const scalarf len_sq = vector_length_squared3(input);
vmovups     xmm0,xmmword ptr [rcx]
vmulps      xmm2,xmm0,xmm0
vshufps     xmm1,xmm2,xmm2,1
vaddss      xmm0,xmm2,xmm1
vshufps     xmm2,xmm2,xmm2,2
vaddss      xmm0,xmm0,xmm2
// return scalar_sqrt(len_sq);
vsqrtss     xmm0,xmm0,xmm0
// }
mov         rcx,qword ptr [rsp]
xor         rcx,rsp
call        __security_check_cookie
add         rsp,18h
ret

And that is where everything goes wrong. Those 8 instructions calculate the 3D dot product and the square root required to get the length of the vector between our two points. They balloon up to 16 instructions because of the security check overhead. Behind the scenes, VS doesn’t fail to inline 8 instructions, it fails to inline something larger than we see when everything goes right.

This was quite surprising to me, because until now I had never seen this behavior kick in this way. Indeed, try as I might, I could not reproduce this behavior in a small playground (yet).

In light of this, and because the math code within Realtime Math is very simple, I have decided to annotate every function to explicitly disable buffer security checks. Not every function requires this but it is easier to be consistent for maintenance. This restores the optimal code generation and vector_distance3 finally inlines properly.

I am now wondering if I should explicitly force inline short functions…

ACL right in your browser

2020-05-28T00:00:00+00:00

This week, the first beta release of the Animation Compression Library JavaScript module has been released. You can find it on NPM here and you can try it live in your browser right here by drag and dropping glTF files.

The library should be usable but keep in mind until ACL reaches version 2.0 with backwards compatibility you might have to recompress when you upgrade to future versions.

The module uses the powerful emscripten compiler toolchain to take C++ and compile it into WebAssembly. That means that as part of this effort, ACL and Realtime Math have been upgraded to support emscripten as well. Note that WASM SIMD isn’t supported yet (contributions welcome).

Both compression and decompression are supported although if you enable dead code stripping in your JavaScript bundler, you should be able to only pay for what you use.

Special thanks to Arseny Kapoulkine and his excellent meshoptimizer library. Using his blog posts and his code as a guide, I was able to get up and running fairly quickly.

Next steps

Progress continues towards ACL 2.0 but I will take a few weeks to finish up RTM 2.0 first. It is now used extensively by the main ACL development branch. A few new features will be introduced, some cleanup remains, and I want to double check some optimizations before releasing it.

Morph target animation compression

2020-05-04T00:00:00+00:00

The Animation Compression Library v1.3 introduced support for compressing animated floating point tracks but it wasn’t until now that I managed to get around to trying it inside the Unreal Engine 4 Plugin. Scalar tracks are used in various ways from animating the Field Of View to animating Inverse Kinematic blend targets. While they have many practical uses in modern video games, they are most commonly used to animate morph targets (aka blend shapes) in facial and cloth animations.

TL;DR: By using the morph target deformation information, we can better control how much precision each blend weight needs while yielding a lower memory footprint.

How do morph targets work?

Morph targets are essentially a set of meshes that get blended together per vertex. It is a way to achieve vertex based animation. Let’s use a simple triangle as an example:

In blue we have our reference mesh and in green our morph target. To animate our vertices, we apply a scaled displacement to every vertex. This scale factor is called the blend weight. Typically it lies between 0.0 (our reference mesh) and 1.0 (our target mesh). Values in between end up being a linear combination of both:

final vertex = reference vertex + (target vertex - reference vertex) * blend weight

Each morph target is thus controlled by a single blend weight and each vertex can end up having a contribution from multiple targets. With our blend weights being independent, we can compress them as a set of floating point curves. This is typically done by specifying a desired level of precision to retain on each curve. However, in practice this can be hard to control: blend weights have no units.

At the time of writing:

Unreal Engine 4 leaves its curves in a compact raw form (a spline) by default but they can be optionally compressed with various codecs.
Unity, Lumberyard, and CRYENGINE do not document how they are stored and they do not appear to expose a way to further compress them (as far as I know).

All of this animated data does add up and it can benefit from being compressed. In order to achieve this, the UE4 ACL plugin uses an intuitive technique to control the resulting quality.

Compressing blend weights

Our animated blend weights ultimately yield a mesh deformation. Can we use that information to our advantage? Let’s take a closer look at the math behind morph targets. Here is how each vertex is transformed:

final vertex = reference vertex + (target vertex - reference vertex) * blend weight

As we saw earlier, (target vertex - reference vertex) is the displacement delta to apply to achieve our deformation. Let’s simplify our equation a bit:

final vertex = reference vertex + vertex delta * blend weight

Let’s plug in some displacement numbers with some units and see what happens.

Reference Position	Target Position	Delta
5 cm	5 cm	0 cm
5 cm	5.1 cm	0.1 cm
5 cm	50 cm	45 cm

Let us assume that at a particular moment in time, our raw blend weight is 0.2. Due to compression, that value will change by the introduction of a small amount of imprecision. Let’s see what happens if our lossy blend weight is 0.22 instead.

Delta	Delta * Raw Blend Weight	Delta * Lossy Blend Weight	Error
0 cm	0 cm	0 cm	0 cm
0.1 cm	0.02 cm	0.022 cm	0.002 cm
45 cm	9 cm	9.9 cm	0.9 cm

From this, we can observe some important facts:

When our vertex doesn’t move and has no displacement delta, the error introduced doesn’t matter.
The error introduced is linearly proportional to the displacement delta: when the error is fixed, a small delta will have a small amount of error while a large delta will have a large amount of error.

This means that it is not suitable to set a single precision value (in blend weight space) for all our blend weights. Some morph targets require more precision than others and ideally we would like to take advantage of that fact.

Specifying the amount of precision that a blend weight should have isn’t easy if we don’t know how much deformation it ultimately drives. A better way to specify how much precision we need is by specifying it in meaningful units. If we assume that we want our vertices to have a precision of 0.01 cm instead, how much precision do their blend weights need?

blend weight precision = vertex precision / vertex displacement delta

Notice how the vertex precision and vertex displacement delta units cancel out.

Delta	Blend Weight Precision
0 cm	Division by zero!
0.1 cm	0.1
45 cm	0.00022

When a vertex doesn’t move, it doesn’t matter what precision our blend weight retains since the final vertex position will be identical regardless. However, smaller displacements require less precision while larger displacements require more precision retained.

This is what the ACL plugin does: when a Skeletal Mesh is provided to the compression codec, ACL will use a vertex displacement precision value instead of a generic scalar track precision value. This is intuitive to tune for an artist because the units now have meaning: we can easily find out how much 0.01 cm represents within the context of our 3D model. Underneath the hood, this translates into unique precision requirements tailored to each morph target blend weight curve. This allows ACL to attain an even lower memory footprint while providing a strict guarantee on the visual fidelity of the animated deformations.

A Boy and His Kite

In order to test this out, I used the GDC 2015 demo from Epic: A Boy and His Kite. The demo is available for free on the Unreal Marketplace under the Learning section.

It shows a boy running through a landscape along with his kite. He contains 811 animated curves within each of the 31 animation sequences that comprise the full cinematic. 692 of those curves end up driving morph targets for his facial and clothing deformations. Some of the shots have the camera very close to his face and as such retaining as much quality as possible is critical.

I decided to compare 5 codecs:

Compressed Rich Curves with an error threshold of 0.0 (default within UE 4.25)
Compressed Rich Curves with an error threshold of 0.001
Uniform Sampling
ACL with a generic precision of 0.001
ACL with a generic precision of 0.001 and a morph target deformation precision of 0.01 cm

The Compressed Rich Curves and Uniform Sampling codecs are built into UE4 and provide a trade-off between memory and speed. The rich curves will tend to have a lower memory footprint but evaluating them at runtime is slower than when uniform sampling is used.

ACL uses uniform sampling internally for very fast evaluation but it is also much more aggressive with its compression.

	Compressed Size	Compression Rate
Compressed Rich Curves 0.0	3620.66 KB	1.0x (baseline)
Compressed Rich Curves 0.001	1458.77 KB	2.5x smaller
Uniform Sampling	2052.82 KB	1.8x smaller
ACL 0.001	540.24 KB	6.7x smaller
ACL with morph 0.01 cm	381.10 KB	9.5x smaller

I manually inspected the visual fidelity of each codec and everything looked flawless. By leveraging our knowledge of the individual morph target deformations, ACL manages to reduce the memory footprint by an extra 159.14 KB (29%).

The new curve compression will be available shortly in the ACL plugin v0.6 (suitable for UE 4.24) as well as v1.0 (suitable for UE 4.25 and later). The ACL plugin v1.0 isn’t out yet but it will come to the Unreal Marketplace as soon as UE 4.25 is released.

ACL 1.3 is out: smaller, better, faster than ever

2019-11-18T00:00:00+00:00

After 7 months of work, the Animation Compression Library has finally reached v1.3 along with an updated v0.5 Unreal Engine 4 plugin. Notable changes in this release include:

Added support for VS2019, GCC 9, clang7, and Xcode 11
Optimized compression and decompression significantly
Added support for multiple root bones
Added support for scalar track compression

Compared to UE 4.23.1, the ACL plugin compresses up to 2.9x smaller, is up to 4.7x more accurate, up to 52.9x faster to compress, and up to 6.8x faster to decompress (results may vary depending on the platform and data).

This latest release is a bit more accurate than the previous one and it also reduces the memory footprint by about 4%. Numbers vary a bit but decompression is roughly 1.5x faster on every platform.

Realtime compression

The compression speed improvements are massive. Compared to the previous release, it is over 2.6x faster!

	ACL v1.3	ACL v1.2
CMU	2m 22.3s (10285.52 KB/sec)	6m 9.71s (3958.99 KB/sec)
Paragon	10m 23.05s (7027.87 KB/sec)	28m 56.48s (2521.62 KB/sec)
Matinee fight scene	4.89s (13074.59 KB/sec)	20.27s (3150.43 KB/sec)

It is so fast that when I tried it in Unreal Engine 4 and the popup dialog showed instantaneously after compression, I thought something was wrong. I got curious and decided to take a look at how fast it is for individual clips. The results were quite telling. In the Carnegie-Mellon University motion capture database, 50% of the clips compress in less than 29ms, 85% take less than 114ms, and 99% compress in 313ms or less. In Paragon, 50% of the clips compress in less than 31ms, 85% take less than 128ms, and 99% compress in 1.461 seconds or less. Half of ordinary animations compress fast enough to do so in realtime!

I had originally planned more improvements to the compression speed as part of this release but ultimately opted to stop there for now. I tried to switch from Arrays of Structures to Structures of Arrays and while it was faster, the added complexity was not worth the very minimal gain. There remains lots of room for improvement though and the next release should be even faster.

What’s next

The next release will be a major one: v2.0 is scheduled around Summer 2020. A number of significant changes and additions are planned.

While Realtime Math was integrated this release for the new scalar track compression API, the next release will see it replace all of the math done in ACL. Preliminary results show that it will speed up compression by about 10%. This will reduce significantly the maintenance burden and speed up CI builds. This will constitute a fairly minor API break. RTM is already included within every release of ACL (through a sub-module). Integrations will simply need to add the required include path as well as change the few places that interface with the library.

The error metric functions will also change a bit to allow further optimization opportunities. This will constitute a minor API break as well if integrations have implemented their own error metric (which is unlikely).

So far, ACL is what might be considered a runtime compression format. It was designed to change with every release and as such requires recompression whenever a new version comes out. This isn’t a very big burden as more often than not, game engines already recompress animations on demand, often storing raw animations in source control. Starting with the next release, backwards compatibility will be introduced. In order to do this, the format will be standardized, modernized, and documented. The decompression code path will remain optimal through templating. This step is necessary to allow ACL to be suitable for long term storage. As part of this effort, a glTF extention will be created as well as tools to pack and unpack glTF files.

Last but not least, a new javascript/web assembly module will be created in order to support ACL on the modern web. Through the glTF extension and a ThreeJS integration, it will become a first class citizen in your browser.

ACL is already in use on millions of consoles and mobile devices today and this next major release will make its adoption easier than ever.

If you use ACL and would like to help prioritize the work I do, feel free to reach out and provide feedback or requests!

Thanks to GitHub Sponsors, you can sponsor me! All funds donated will go towards purchasing new devices to optimize for as well as other related costs (like coffee). The best way to ensure that ACL continues to move forward is to sponsor me for specific feature work, custom integrations, through GitHub, or some other arrangement.

Realtime Math: faster than ever

2019-11-14T00:00:00+00:00

Today Realtime Math has finally reached v1.1.0! This release brings a lot of good things:

Added support for Windows ARM64
Added support for VS2019, GCC9, clang7, and Xcode 11
Added support for Intel FMA and ARM64 FMA
Many optimizations, minor fixes, and cleanup

I spent a great deal of time optimizing the quaternion arithmetic for NEON and SSE and as a result, many functions are now among the fastest out there. In order to make sure not to introduce regressions, the Google Benchmark library has been integrated and allows me to quickly whip up tests to try various ideas and variants.

RTM will be used in the upcoming Animation Compression Library release for the new scalar track compression API. The subsequent ACL release will remove all of its internal math code and switch everything to RTM. Preliminary tests show that it speeds up its compression by about 10%.

Is Intel FMA worth it?

As part of this release, support for AVX2 and FMA was added. Seeing how libraries like DirectX Math already use FMA, I expected it to give a measurable performance win. However, on my Haswell MacBook Pro and my Ryzen 2950X desktop, it turned out to be significantly slower. As a result of these findings, I opted to not use FMA (although the relevant defines are present and handled). If you are aware of a CPU where FMA is faster, please reach out!

It is also worth noting that typically when Fast Math type compiler optimizations are enabled, FMA instructions are often automatically generated when it detects a pattern where they can be used. RTM explicitly disables this behavior with Visual Studio (by forcing Precise Math with a pragma) and as such even when it is enabled, the functions retain the SSE/AVX instructions that showed the best performance.

ARM NEON performance notes

RTM, DirectX Math, and many other libraries make extensive use NEON SIMD intrinsics. However, while measuring various implementation variants for quaternion multiplication I noticed that using simple scalar math is considerably faster on both ARMv7 and ARM64 on my Pixel 3 phone and my iPad. Going forward I will make sure to always measure the scalar code path as a baseline.

Compiler bugs

Surprisingly, RTM triggered 3 compiler code generation bugs this year. #34 and #35 in VS2019 as well as #37 in clang7. As soon as those are fixed and validated by continuous integration, I will release a new version (minor or patch).

What’s next

The development of RTM is largely driven by my work with ACL. If you’d like to see specific things in the next release, feel free to reach out or to create GitHub issues and I will prioritize them accordingly. As always, contributions welcome!

Thanks to GitHub Sponsors, you can sponsor me! All funds donated will go towards purchasing new devices to optimize for as well as other related costs (like coffee).

Faster floating point arithmetic with Exclusive OR

2019-10-22T00:00:00+00:00

Today it’s time to talk about another floating point arithmetic trick that sometimes can come in very handy with SSE2. This trick isn’t novel, and I don’t often get to use it but a few days ago inspiration struck me late at night in the middle of a long 3 hour drive. The results inspired this post.

I’ll cover three functions that use it for quaternion arithmetic which have already been merged into the Realtime Math (RTM) library as well as the Animation Compression Library (ACL). ACL uses quaternions heavily and I’m always looking for ways to make them faster.

TL;DR: With SSE2, XOR (and other logical operators) can be leveraged to speed up common floating point operations.

XOR: the lesser known logical operator

Most programmers are familiar with AND, OR, and NOT logical operators. They form the bread and butter of everyday programming. And while we all learn about their cousin XOR, it doesn’t come in handy anywhere near as often. Here is a quick recap of what it does.

A	B	A XOR B
0	0	0
0	1	1
1	0	1
1	1	0

We can infer a few interesting properties from it:

Any input that we apply XOR to with zero yields the input
XOR can be used to flip a bit from true to false or vice versa by XOR-ing it with one
Using XOR when both inputs are identical yields zero (every zero bit remains zero and every one bit flips to zero)

Exclusive OR comes in handy mostly with bit twiddling hacks when squeezing every cycle counts. While it is commonly used with integer inputs, it can also be used with floating point values!

XOR with floats

SSE2 contains support for XOR (and other logical operators) on both integral (_mm_xor_si128) and floating point values (_mm_xor_ps). Usually when you transition from the integral domain to the floating point domain of operations on a register (or vice versa), the CPU will incur a 1 cycle penalty. By implementing a different instruction for both domains, this hiccup can be avoided. Logical operations can often execute on more than one execution port (even on older hardware) which can enable multiple instructions to dispatch in the same cycle.

The question then becomes, when does it make sense to use them?

Quaternion conjugate

For a quaternion A where A = [x, y, z] | w with real ([x, y, z]) and imaginary (w) parts, its conjugate can be expressed as follow: conjugate(A) = [-x, -y, -z] | w. The conjugate simply flips the sign of each component of the real part.

The most common way to achieve this is by multiplying a constant just like DirectX Math and Unreal Engine 4 do.

quatf quat_conjugate(quatf input)
{
	constexpr __m128 signs = { -1.0f, -1.0f, -1.0f, 1.0f };
	return _mm_mul_ps(input, signs);
}

This yields a single instruction (the constant will be loaded from memory as part of the multiply instruction) but we can do better. Flipping the sign bit can also be achieved by XOR-ing our input with the sign bit. To avoid flipping the sign of the w component, we can simply XOR it with zero which will leave the original value unchanged.

quatf quat_conjugate(quatf input)
{
	constexpr __m128 signs = { -0.0f, -0.0f, -0.0f, 0.0f };
	return _mm_xor_ps(input, signs);
}

Again, this yields a single instruction but this time it is much faster. See full results here but on my MacBook Pro it is 33.5% faster!

Quaternion interpolation

Linear interpolation for scalars and vectors is simple: result = ((end - start) * alpha) + start.

While this can be used for quaternions as well, it breaks down if both quaternions are not on the same side of the hypersphere. Both quaternions A and -A represent the same 3D rotation but lie on opposite ends of the 4D hypersphere represented by unit quaternions. In order to properly handle this case, we first need to calculate the dot product of both inputs being interpolated and depending its sign, we must flip one of the inputs so that it can lie on the same side of the hypersphere.

In code it looks like this:

quatf quat_lerp(quatf start, quatf end, float alpha)
{
	// To ensure we take the shortest path, we apply a bias if the dot product is negative
	float dot = vector_dot(start, end);
	float bias = dot >= 0.0f ? 1.0f : -1.0f;
	vector4f rotation = vector_neg_mul_sub(vector_neg_mul_sub(end, bias, start), alpha, start);
	return quat_normalize(rotation);
}

The double vector_neg_mul_sub trick was explained in a previous blog post.

As mentioned, we take the sign of the dot product to calculate a bias and simply multiply it with the end input. This can be achieved with a compare instruction between the dot product and zero to generate a mask and using that mask to select between the positive and negative value of end. This is entirely branchless and boils down to a few instructions: 1x compare, 1x subtract (to generate -end), 1x blend (with AVX to select the bias), and 1x multiplication (to apply the bias). If AVX isn’t present, the selection is done with 3x logical operation instructions instead. This is what Unreal Engine 4 and many others do.

Here again, we can do better with logical operators.

quatf quat_lerp(quatf start, quatf end, float alpha)
{
	__m128 dot = vector_dot(start, end);
	// Calculate the bias, if the dot product is positive or zero, there is no bias
	// but if it is negative, we want to flip the 'end' rotation XYZW components
	__m128 bias = _mm_and_ps(dot, _mm_set_ps1(-0.0f));
	__m128 rotation = _mm_add_ps(_mm_mul_ps(_mm_sub_ps(_mm_xor_ps(end, bias), start), _mm_set_ps1(alpha)), start);
	return quat_normalize(rotation);
}

What we really want to achieve is to do nothing (use end as-is) if the sign of the bias is zero and to flip the sign of end if the bias is negative. This is a perfect fit for XOR! All we need to do is XOR end with the sign bit of the bias. We can easily extract it with a logical AND instruction and a mask of the sign bit. This boils down to just two instructions: 1x logical AND and 1x logical XOR. We managed to remove expensive floating point operations while simultaneously using fewer and cheaper instructions.

Measuring is left as an exercise for the reader.

Quaternion multiplication

While the previous two use cases have been in RTM for some time now, this one is brand new and is what crossed my mind the other night: quaternion multiplication can use the same trick!

Multiplying two quaternions with scalar arithmetic is done like this:

quatf quat_mul(quatf lhs, quatf rhs)
{
	float x = (rhs.w * lhs.x) + (rhs.x * lhs.w) + (rhs.y * lhs.z) - (rhs.z * lhs.y);
	float y = (rhs.w * lhs.y) - (rhs.x * lhs.z) + (rhs.y * lhs.w) + (rhs.z * lhs.x);
	float z = (rhs.w * lhs.z) + (rhs.x * lhs.y) - (rhs.y * lhs.x) + (rhs.z * lhs.w);
	float w = (rhs.w * lhs.w) - (rhs.x * lhs.x) - (rhs.y * lhs.y) - (rhs.z * lhs.z);

	return quat_set(x, y, z, w);
}

Floating point multiplications and additions being expensive, we can reduce their number by converting this to SSE2 and shuffling our inputs to line everything up. A few shuffles can line the values up for our multiplications but it is clear that in order to use addition (or subtraction), we have to flip the signs of a few components. Again, DirectX Math does just this (and so does Unreal Engine 4).

quatf quat_mul(quatf lhs, quatf rhs)
{
	constexpr __m128 control_wzyx = { 1.0f,-1.0f, 1.0f,-1.0f };
	constexpr __m128 control_zwxy = { 1.0f, 1.0f,-1.0f,-1.0f };
	constexpr __m128 control_yxwz = { -1.0f, 1.0f, 1.0f,-1.0f };

	__m128 r_xxxx = _mm_shuffle_ps(rhs, rhs, _MM_SHUFFLE(0, 0, 0, 0));
	__m128 r_yyyy = _mm_shuffle_ps(rhs, rhs, _MM_SHUFFLE(1, 1, 1, 1));
	__m128 r_zzzz = _mm_shuffle_ps(rhs, rhs, _MM_SHUFFLE(2, 2, 2, 2));
	__m128 r_wwww = _mm_shuffle_ps(rhs, rhs, _MM_SHUFFLE(3, 3, 3, 3));

	__m128 lxrw_lyrw_lzrw_lwrw = _mm_mul_ps(r_wwww, lhs);
	__m128 l_wzyx = _mm_shuffle_ps(lhs, lhs,_MM_SHUFFLE(0, 1, 2, 3));

	__m128 lwrx_lzrx_lyrx_lxrx = _mm_mul_ps(r_xxxx, l_wzyx);
	__m128 l_zwxy = _mm_shuffle_ps(l_wzyx, l_wzyx,_MM_SHUFFLE(2, 3, 0, 1));

	__m128 lwrx_nlzrx_lyrx_nlxrx = _mm_mul_ps(lwrx_lzrx_lyrx_lxrx, control_wzyx); // flip!

	__m128 lzry_lwry_lxry_lyry = _mm_mul_ps(r_yyyy, l_zwxy);
	__m128 l_yxwz = _mm_shuffle_ps(l_zwxy, l_zwxy,_MM_SHUFFLE(0, 1, 2, 3));

	__m128 lzry_lwry_nlxry_nlyry = _mm_mul_ps(lzry_lwry_lxry_lyry, control_zwxy); // flip!

	__m128 lyrz_lxrz_lwrz_lzrz = _mm_mul_ps(r_zzzz, l_yxwz);
	__m128 result0 = _mm_add_ps(lxrw_lyrw_lzrw_lwrw, lwrx_nlzrx_lyrx_nlxrx);

	__m128 nlyrz_lxrz_lwrz_wlzrz = _mm_mul_ps(lyrz_lxrz_lwrz_lzrz, control_yxwz); // flip!
	__m128 result1 = _mm_add_ps(lzry_lwry_nlxry_nlyry, nlyrz_lxrz_lwrz_wlzrz);
	return _mm_add_ps(result0, result1);
}

The code this time is a bit harder to read but here is the gist:

We need 7x shuffles to line everything up
With everything lined up, we need 4x multiplications and 3x additions
3x multiplications are also required to flip our signs ahead of each addition (which conveniently can also be done with fused-multiply-add)

I’ll omit the code for brevity but by using -0.0f and 0.0f as our control values to flip the sign bits with XOR instead, quaternion multiplication becomes much faster. On my MacBook Pro it is 14% faster while on my Ryzen 2950X it is 10% faster! I also measured with ACL to see what the speed up would be in a real world use case: compressing lots of animations. With the data sets I measure with, this new quaternion multiplication accelerates the compression by up to 1.3%.

Most x64 CPUs in use today (including those in the PlayStation 4 and Xbox One) do not yet support fused-multiply-add and when I add support for it in RTM, I will measure again.

Is it safe?

In all three of these examples, the results are binary exact and identical to their reference implementations. Flipping the sign bit on normal floating point values (and infinities) with XOR yields a binary exact result. If the input is NaN, XOR will not yield the same output but it will yield a NaN with the sign bit flipped which is entirely valid and consistent (the sign bit is typically left unused on NaN values).

I also measured this trick with NEON on ARMv7 and ARM64 but sadly it is slower on those platforms (for now). It appears that there is indeed a penalty there for switching between the two domains and perhaps in time it will go away or perhaps something else is slowing things down.

ARM64 already uses fused-multiply-add where possible.

Progress update

It has been almost a year since the first release of RTM. The next release should happen very soon now, just ahead of the next ACL release which will introduce it as a dependency for its scalar track compression API. I am on track to finish both releases before the end of the year.

Thanks to the new GitHub Sponsors program, you can now sponsor me! All funds donated will go towards purchasing new devices to optimize for as well as other related costs (like coffee).

Pitfalls of linear sample reduction: Part 4

2019-07-31T00:00:00+00:00

A quick recap: animation clips are formed from a set of time series called tracks. Tracks have a fixed number of samples per second and each track has the same length. The Animation Compression Library retains every sample while Unreal Engine uses the popular method of removing samples that can be linearly interpolated from their neighbors.

The first post showed how removing samples negates some of the benefits that come from segmenting, rendering the technique a lot less effective.

The second post explored how sorting (or not) the retained samples impacts the decompression performance.

The third post took a deep dive into the memory overhead of the three techniques we have been discussing so far:

Retaining every sample with ACL
Sorted and unsorted linear sample reduction

This fourth and final post in the series shows exactly how many samples are removed in practice.

How often are samples removed?

In order to answer this question, I instrumented the ACL UE4 plugin to extract how many samples per pose, per clip, and per track were dropped. I then ran this over the Carnegie-Mellon University motion capture database as well as Paragon and Fortnite. I did this while keeping every sample with full precision (quantizing the samples can only make things worse) with and without error compensation (retargeting). The idea behind retargeting is to compensate the error by altering the samples slightly as we optimize a bone chain. While it will obviously be slower to compress than not using it, it should in theory reduce the overall error and possibly allow us to remove more samples as a result.

Note that constant and default tracks are ignored when calculating the number of samples dropped.

Carnegie-Mellon University	With retargeting	Without retargeting
Total size	204.40 MB	204.40 MB
Compression speed	6487.83 KB/sec	9756.59 KB/sec
Max error	0.1416 cm	0.0739 cm
Median dropped per clip	0.13 %	0.13 %
Median dropped per pose	0.00 %	0.00 %
Median dropped per track	0.00 %	0.00 %

As expected, the CMU motion capture database performs very poorly with sample reduction. By its very nature, motion capture data can be quite noisy as it comes from cameras or sensors.

Paragon	With retargeting	Without retargeting
Total size	575.55 MB	572.60 MB
Compression speed	2125.21 KB/sec	3319.38 KB/sec
Max error	80.0623 cm	14.1421 cm
Median dropped per clip	12.37 %	12.70 %
Median dropped per pose	13.04 %	13.41 %
Median dropped per track	2.13 %	2.25 %

Now the data is more interesting. Retargeting continues to be slower to compress but surprisingly, it fails to reduce the memory footprint as well as the number of samples dropped. It even fails to improve the compression accuracy.

Fortnite	With retargeting	Without retargeting
Total size	1231.97 MB	1169.01 MB
Compression speed	1273.36 KB/sec	2010.08 KB/sec
Max error	283897.4062 cm	172080.7500 cm
Median dropped per clip	7.11 %	7.72 %
Median dropped per pose	10.81 %	11.76 %
Median dropped per track	15.37 %	16.13 %

The retargeting trend continues with Fortnite. One possible explanation for these disappointing results is that error compensation within UE4 does not measure the error in the same way that the engine does after compression is done: it does not take into account virtual vertices or leaf bones. This discrepancy leads to the optimizing algorithm thinking the error is lower than it really is.

This is all well and good but how does the full distribution look? Retargeting will be omitted since it doesn’t appear to contribute much.

Note that Paragon and Fortnite have ~460 clips (7%) and ~2700 clips (32%) respectively with one or two samples and thus no reduction can happen in those clips.

The graph is quite telling: more often than not we fail to drop enough samples to match ACL with segmenting. Very few clips end up dropping over 35% of their samples: none do in CMU, 18% do in Paragon, and 23% in Fortnite.

This is despite using very optimistic memory overhead estimates. In practice, the overhead is almost always higher than what we used in our calculations and removing samples might negatively impact our quantization bit rates, further increasing the overall memory footprint.

Note that curve fitting might allow us to remove more samples but it would slow down compression and decompression.

Removing samples just isn’t worth it

I have been implementing animation compression algorithms in one form or another for many years now and I have grown to believe that removing samples just isn’t worth it: retaining every sample is the overall best strategy.

Games often play animations erratically or in unpredictable manner. Some randomly seek while others play forward and backward. Various factors control when and where an animation starts playing and when it stops. Clips are often sampled at various sample rates that differ from their runtime playback rates. The ideal default strategy must handle all of these cases equally well. The last thing animators want to do is mess around with compression parameters of individual clips to avoid an algorithm butchering their work.

When samples are removed, sorting what is retained and using a persistent context is a challenging idea in large scale games. Even if decompression has the potential to be the fastest under specific conditions, in practice the gains might not materialize. Regardless, whether the retained samples are sorted or not, metadata must be added to compensate and it eats away at the memory gains achieved. While the Unreal Engine codecs (which use unsorted sample reduction) could be optimized, the amount of cache misses cannot be significantly reduced and ultimately proves to be the bottleneck.

Furthermore, as ACL continues to show, removing samples is not necessary in order to achieve a low and competitive memory footprint. By virtue of having its data simply laid out in memory, very little metadata overhead is required and performance remains consistent and lightning fast. This also dramatically simplifies and streamlines compression as we do not need to consider which samples to retain while attempting to find the optimal bit rate for every track.

It is also worth noting that while we assumed that it is possible to achieve the same bit rates as ACL while removing samples, it might not be the case as the resulting error will combine in subtle ways. Despite being very conservative in our estimates with the sample reduction variants, ACL emerges a clear winner.

That being said, I do have some ideas of my own on how to tackle the problem of efficient sample reduction and maybe someday I will get the chance to try them even if only for the sake of research.

A small note about curves

It is worth nothing that by uniformly sampling an input curve, some amount of precision loss can happen. If the apex of a curve happens between two samples, it will be smoothed out and lost.

The most common curve formats (cubic) generally require 4 values in order to interpolate. This means that the context footprint also increases by a factor of two. In theory, a curve might need fewer samples to represent the same time series but that is not always the case. Animations that come from motion capture or offline simulations such as cloth or hair will often have very noisy data and will not be well approximated by a curve. Such animations might see the number of samples removed drop below 10% as can be seen with the CMU motion capture database.

Curves might also need arbitrary time values that do not fall on uniformly distributed values. When this is the case, the time cannot be quantized too much as it will lower the resulting accuracy, further complicating things and increasing the memory footprint. If the data is non-uniform, a context object is required in order to keep decompression fast and everything I mentioned earlier applies. This is also true of techniques that store their data relative to previous samples (e.g. a delta or velocity change).

Special thanks

I spend a great deal of time implementing ACL and writing about the nuggets of knowledge I find along the way. All of this is made possible, in part, thanks to Epic which is generously allowing me to use the Paragon and Fortnite animations for research purposes. Cody Jones, Martin Turcotte, and Raymond Barbiero continue to contribute code, ideas, and proofread my work and their help is greatly appreciated. Many others have contributed to ACL and its UE4 plugin as well. Thank you for keeping me motivated and your ongoing support!

Pitfalls of linear sample reduction: Part 3

2019-07-29T00:00:00+00:00

The first post showed how removing samples negates some of the benefits that come from segmenting, rendering the technique a lot less effective.

The second post explored how sorting (or not) the retained samples impacts the decompression performance.

This third post will take a deep dive into the memory overhead of the three techniques we have been discussing so far:

Retaining every sample with ACL
Sorted and unsorted linear sample reduction

Should we remove samples or retain them?

ACL does not yet implement a sample reduction algorithm while UE4 is missing a number of features that ACL provides. As such, in order to keep things as fair as possible, some assumptions will be made for the sample reduction variants and we will thus extrapolate some results using ACL as a baseline.

In order to find out if it is worth it to remove samples and how to best go about storing the remaining samples, we will use the Main Trooper from the Matinee fight scene. It has 541 bones with no animated 3D scale (1082 tracks in total) and the sequence has 1991 frames (~66 seconds long) per track. A total of 71 tracks are constant, 1 is default, and 1010 are animated. Where segmenting is concerned, I use the arbitrarily chosen segment #13 as a baseline. ACL splits this clip in 124 segments of about 16 frames each. This clip has a LOT of data and a high bone count which should highlight how well these algorithms scale.

We will track three things we care about:

The number of bytes touched during decompression for a full pose
The number of cache lines touched during decompression for a full pose
The total compressed size

Due to the simple nature of animation decompression, the number of cache lines touched is a good indicator of overall performance as it often can be memory bound.

All numbers will be rounded up to the nearest byte and cache line.

All three algorithms use linear interpolation and as such require two poses to interpolate our final result.

See the annex at the end of the post for how the math breaks down

The shared base

Some features are always a win and will be assumed present in our three algorithms:

If a track has a single repeating sample (within a threshold), it will be deemed a constant track. Constant tracks are collapsed into a single full resolution sample with 3x floats (even for rotations) with a single bit per track to tell them apart.
If a constant track is identical to the identity value for that track type (e.g. quaternion identity), it will be deemed a default track. Default tracks have no sample stored, just a single bit per track to tell them apart.
All animated tracks (not constant or default) will be normalized within their min/max values by performing range reduction over the whole clip. This increases the accuracy which leads to a lower memory footprint for the majority of clips. To do so, we store our range minimum and extent values as 3x floats each (even for rotations).

The current UE4 codecs do not have special treatment for constant and default tracks but they do support range reduction. There is room for improvement here but that is what ACL uses right now and it will be good enough for our calculations.

	Bytes touched	Cache lines touched	Compressed size
Default bit set	136	3	136 bytes
Constant bit set	136	3	136 bytes
Constant values	852	14	852 bytes
Range values	24240	379	24240 bytes
Total	25364	399	25 KB

In order to support variable bit rates where each animated track can have its own bit rate, ACL stores 1 byte per animated track (and per segment). This is overkill as only 19 bit rates are currently supported but it keeps things simple and in the future the extra bits will be used for other things. When segmenting is enabled with ACL, range reduction is also performed per segment and adds 6 bytes of overhead per animated track for the quantized range values.

	Bytes touched	Cache lines touched	Compressed size
Bit rates	1010	16	123 KB
Segment range values	6060	95	734 KB

The ACL results

We will consider two cases for ACL. The default compression settings have segmenting enabled which is great for the memory footprint and compression speed but due to the added memory overhead and range reduction, decompression is a bit slower. As such, we will also consider the case where we disable segmenting in order to bias for faster decompression.

With segmenting, the animated pose size (just the samples, without the bit rate and segment range values) for the segment #13 is 3777 bytes (60 cache lines). This represents about 3.74 bytes per sample or about 30 bits per sample (bps).

Without segmenting, the animated pose size is 5903 bytes (93 cache lines). This represents about 5.84 bytes per sample or about 46 bps.

Although the UE4 codecs do not support variable bit rates the way ACL does, we will assume that we use the same algorithm and as such these numbers of 30 and 46 bits per samples will be used in our projections. Because 30 bps is only attainable with segmenting enabled, we will also assume it is enabled for the sample reduction algorithms when using that bit rate.

	Bytes touched	Cache lines touched	Compressed size
With segmenting	39990	625	7695 KB
Without segmenting	38172	597	11503 KB

As we can see, while segmenting reduces considerably the overall memory footprint (by 33%), it does contribute to quite a few extra cache lines being touched (4.7% more) during decompression despite the animated pose being 36% smaller. This highlights how normalizing the samples within the range of each segment increases their overall accuracy and reduces the number of bits required to maintain it.

Unsorted sample reduction

In order to decompress when samples are missing, we need to store the sample time (or index). For every track, we will search for the two samples that bound the current time we are interpolating at and reconstruct the correct interpolation alpha from them. To keep things simple, we will store this time value on 1 byte per sample retained (for a maximum of 256 samples per clip or segment) along with the total number of samples retained per track on 1 byte. Supporting arbitrary track decompression efficiently also requires storing an offset map where each track begins. For simplicity’s sake, we will omit this overhead but UE4 uses 4 bytes per track. When decompressing, we will assume that we immediately find the two samples we need within a single cache line and that both samples are within another cache line (2 cache misses per track).

These estimates are very conservative. In practice, the offsets are required to support Levels of Detail as well as efficient single bone decompression and more than 1 byte is often required for sample indices and their count in larger clips. In the wild, the memory footprint is surely going to be larger than these projections will show.

Values below assume every sample is retained, for now.

	Bytes touched	Cache lines touched	Compressed size
Sample times	2020	1010	1964 KB
Sample count with segmenting	1010	16	123 KB
Sample values with segmenting	7575	1010	7346 KB
Total with segmenting	43039	2546	10315 KB
Sample count without segmenting	1010	16	1 KB
Sample values without segmenting	11615	1010	11470 KB
Total without segmenting	40009	2435	13583 KB

When our samples are unsorted, it becomes obvious why decompression is quite slow. The number of cache lines touched is staggering: 2435 cache lines which represents 153 KB! This scales linearly with the number of animated tracks. We can also see that despite the added overhead of segmenting, the overall memory footprint is lower (by 23%) but not by as much as ACL.

Despite my claims from the first post, segmenting appears attractive here. This is a direct result of our estimates being conservative and UE4 not supporting the aggressive per track quantization that ACL provides.

Sorted sample reduction

With our samples sorted, we will add 16 bits of metadata per sample to store the sample time, the track index, and track type. This is optimistic. In reality, some samples would likely require more than that.

Our context will only store animated tracks to keep it as small as possible. We will consider two scenarios: when our samples are stored with full precision (96 bits per sample) and when they are packed with the same format as the compressed byte stream (in which case we simply copy the values into our context when seeking, no unpacking occurs). As previously mentioned, linear interpolation requires us to store two samples per animated track.

	Bytes touched	Cache lines touched
Context values @ 96 bps	24240	379
Context values @ 46 bps	12808	198
Context values @ 30 bps	14626	230

Right off the bat, it is clear that if we want interpolation to be as fast as possible (with no unpacking), our context is quite large and requires evicting quite a bit of CPU cache. To keep the context footprint as low as possible, going forward we will assume that we store the values packed inside it. Storing packed samples into our context comes with challenges. Each segment will have a different pose size and as such we either need to resize the context or allocate it with the largest pose size. When packed in the compressed byte stream, each sample is often bit aligned and copying into another bit aligned buffer is more expensive than a regular memcpy operation. Keeping the context size low requires some work.

Note that the above numbers also include the bytes touched for the bit rates and segment range values (where relevant) because they are needed for interpolation when samples are packed.

	Bytes touched	Cache lines touched	Compressed size
Compressed values with segmenting	5798	91	11274 KB
Compressed values without segmenting	7919	123	15398 KB
Total with segmenting	45788	720	12156 KB
Total without segmenting	46091	720	15424 KB

Note that the compressed size above does not consider the footprint of the context required at runtime to decompress but the bytes and cache lines touched do.

Compared to the unsorted algorithm, the memory overhead goes up quite a bit: the constant struggle between size and speed.

The number of cache lines touched during decompression is quite a bit higher (15%) than ACL. In order to match ACL, 100% of the samples must be dropped which makes sense considering that we use the same bit rate as ACL to estimate and our context stores two poses which is also what ACL touches. Reading the compressed pose is added on top of this. As such, if every sample is dropped within a pose and already present in our context the decompression cost will be at best identical to ACL since the two will perform about the same amount of work. Reading compressed samples and copying them into our context will take some time and lead to a net win for ACL.

If instead we keep the context at full precision and unpack samples once into it, the picture becomes a bit more complicated. If no unpacking occurs and we simply interpolate from the context, we will be touching more cache lines overall but everything will be very hardware friendly. Decompression is likely to beat ACL but the evicted CPU cache might slow down the caller slightly. Whether this yields a net win is uncertain. Any amount of unpacking that might be required will slow things down further as well. Ultimately, even if enough samples are dropped and it is faster, it will come with a noticeable increase in runtime memory footprint and the complexity to manage a persistent context.

Three challengers enter, only one emerges victorious

	Bytes touched	Cache lines touched	Compressed size
Uniform with segmenting	39990	625	7695 KB
Unsorted with segmenting	43039	2546	10315 KB
Sorted with segmenting	45788	720	12156 KB
Uniform without segmenting	38172	597	11503 KB
Unsorted without segmenting	40009	2435	13583 KB
Sorted without segmenting	46091	720	15424 KB

ACL retains every sample and touches the least amount of memory and it applies the lower CPU cache pressure. However, so far our estimates assumed that all samples were retained and as such, we cannot make a determination as to whether or not it also wins on the overall memory footprint. What we can do however, is determine how many samples we need to drop in order to match it.

With unsorted samples, we have to drop roughly 30% of our samples in order to match the compressed memory footprint of ACL with segmenting and 15% without. However, regardless of how many samples we remove, the decompression performance will never come close to the other two techniques due to the extremely high number of cache misses.

With sorted samples, we have to drop roughly 40% of our samples in order to match the compressed memory footprint of ACL with segmenting and 25% without. The number of cache lines touched during decompression is now quite a bit closer to ACL compared to the unsorted algorithm. It may or may not end up being faster to decompress depending on how many samples are removed, the context format used, and how many samples need unpacking but it will always evict more of the CPU cache.

It is worth noting that most of these numbers remain true if cubic interpolation is used instead. While the context object will double in size and thus require more cache lines to be touched during decompression, the total compressed size will remain the same if the same number of samples are retained.

The fourth and last blog post in the series will look at how many samples are actually removed in Paragon and Fortnite. This will complete the puzzle and paint a clear picture of the strengths and weaknesses of linear sample reduction techniques.

Annex

Inputs:

num_segments = 124
num_samples_per_track = 1991
num_tracks = 1082
num_animated_tracks = 1010
num_constant_tracks = 71
bytes_per_sample_with_segmenting = 3.74
bytes_per_sample_without_segmenting = 5.84
num_animated_samples = num_samples_per_track * num_animated_tracks = 2010910
num_pose_to_interpolate = 2

Shared math:

bitset_size = num_tracks / 8 = 136 bytes
constant_values_size = num_constant_tracks * sizeof(float) * 3 = 852 bytes
range_values_size = num_animated_tracks * sizeof(float) * 6 = 24240 bytes
clip_shared_size = bitset_size * 2 + constant_values_size + range_values_size = 25364 bytes = 25 KB
bit_rates_size = num_animated_tracks * 1 = 1010 bytes
bit_rates_size_total_with_segmenting = bit_rates_size * num_segments = 123 KB
segment_range_values_size = num_animated_tracks * 6 = 6060 bytes
segment_range_values_size_total = segment_range_values_size * num_segments = 734 KB
animated_pose_size_with_segmenting = bytes_per_sample_with_segmenting * num_animated_tracks = 3778 bytes
animated_pose_size_without_segmenting = bytes_per_sample_without_segmenting * num_animated_tracks = 5899 bytes

ACL math:

acl_animated_size_with_segmenting = animated_pose_size_with_segmenting * num_samples_per_track = 7346 KB
acl_animated_size_without_segmenting = animated_pose_size_without_segmenting * num_samples_per_track = 11470 KB
acl_decomp_bytes_touched_with_segmenting = clip_shared_size + bit_rates_size + segment_range_values_size + acl_animated_pose_size_with_segmenting * num_pose_to_interpolate = 39990 bytes
acl_decomp_bytes_touched_without_segmenting = clip_shared_size + bit_rates_size + acl_animated_pose_size_without_segmenting * num_pose_to_interpolate = 38172 bytes
acl_size_with_segmenting = clip_shared_size + bit_rates_size_total_with_segmenting + segment_range_values_size_total + acl_animated_size_with_segmenting = 8106 KB (actual size is lower due to the bytes per sample changing from segment to segment)
acl_size_without_segmenting = clip_shared_size + bit_rates_size + acl_animated_size_without_segmenting = 11496 KB (actual size is higher by a few bytes due to misc. clip overhead)

Unsorted linear sample reduction math:

unsorted_decomp_bytes_touched_sample_times = num_animated_tracks * 1 * num_pose_to_interpolate = 1010 bytes
unsorted_sample_times_size_total = num_animated_samples * 1 = 1964 KB
unsorted_sample_counts_size = num_animated_tracks * 1 = 1010 bytes
unsorted_sample_counts_size_total = unsorted_sample_counts_size * num_segments = 123 KB
unsorted_animated_size_with_segmenting = acl_animated_size_with_segmenting = 7346 KB
unsorted_animated_size_without_segmenting = acl_animated_size_without_segmenting = 11470 KB
unsorted_total_size_with_segmenting = clip_shared_size + bit_rates_size_total_with_segmenting + segment_range_values_size_total + unsorted_sample_times_size_total + unsorted_sample_counts_size_total + unsorted_animated_size_with_segmenting = 10315 KB
unsorted_total_size_without_segmenting = clip_shared_size + bit_rates_size_total + unsorted_sample_times_size_total + unsorted_sample_counts_size + unsorted_animated_size_without_segmenting = 13583 KB
unsorted_total_sample_size_with_segmenting = unsorted_sample_times_size_total + unsorted_animated_size_with_segmenting = 9310 KB
unsorted_total_sample_size_without_segmenting = unsorted_sample_times_size_total + unsorted_animated_size_without_segmenting = 13434 KB
unsorted_drop_rate_with_segmenting = (unsorted_total_size_with_segmenting - acl_size_with_segmenting) / unsorted_total_sample_size_with_segmenting = 28 %
unsorted_drop_rate_without_segmenting = (unsorted_total_size_without_segmenting - acl_size_without_segmenting) / unsorted_total_sample_size_without_segmenting = 15.5 %

Sorted linear sample reduction math:

full_resolution_context_size = num_animated_tracks * num_pose_to_interpolate * sizeof(float) * 3 = 24240 bytes = 24 KB
with_segmenting_context_size = num_pose_to_interpolate * animated_pose_size_with_segmenting = 7556 bytes
without_segmenting_context_size = num_pose_to_interpolate * animated_pose_size_without_segmenting = 11798 bytes
with_segmenting_context_decomp_bytes_touched = with_segmenting_context_size + bit_rates_size + segment_range_values_size = 14626 bytes = 15 KB
without_segmenting_context_decomp_bytes_touched = without_segmenting_context_size + bit_rates_size = 12808 bytes = 13 KB
sorted_decomp_compressed_bytes_touched_with_segmenting = num_animated_tracks * sizeof(uint16) + animated_pose_size_with_segmenting = 5798 bytes
sorted_decomp_compressed_bytes_touched_without_segmenting = num_animated_tracks * sizeof(uint16) + animated_pose_size_without_segmenting = 7919 bytes
sorted_animated_size_with_segmenting = num_animated_samples * sizeof(uint16) + acl_animated_size_with_segmenting = 11274 KB
sorted_animated_size_without_segmenting = num_animated_samples * sizeof(uint16) + acl_animated_size_without_segmenting = 15398 KB
sorted_decomp_bytes_touched_with_segmenting = with_segmenting_context_decomp_bytes_touched + sorted_decomp_compressed_bytes_touched_with_segmenting + clip_shared_size = 45788 bytes = 45 KB
sorted_decomp_bytes_touched_without_segmenting = without_segmenting_context_decomp_bytes_touched + sorted_decomp_compressed_bytes_touched_without_segmenting + clip_shared_size = 46091 bytes = 46 KB
sorted_total_size_with_segmenting = clip_shared_size + bit_rates_size_total_with_segmenting + segment_range_values_size_total + sorted_animated_size_with_segmenting = 12156 KB
sorted_total_size_without_segmenting = clip_shared_size + bit_rates_size + sorted_animated_size_without_segmenting = 15424 KB
sorted_drop_rate_with_segmenting = (sorted_total_size_with_segmenting - acl_size_with_segmenting) / sorted_animated_size_with_segmenting = 40 %
sorted_drop_rate_without_segmenting = (sorted_total_size_without_segmenting - acl_size_without_segmenting) / sorted_animated_size_without_segmenting = 25.5 %

Pitfalls of linear sample reduction: Part 2

2019-07-25T00:00:00+00:00

A quick recap: animation clips are formed from a set of time series called tracks. Tracks have a fixed number of samples per second and each track has the same length. The Animation Compression Library retains every sample while the most commonly used Unreal Engine codecs use the popular method of removing samples that can be linearly interpolated from their neighbors.

The first post showed how removing samples negates some of the benefits that come from segmenting, rendering the technique a lot less effective.

Another area where it struggles is decompression performance. When we want to sample a clip at a particular point in time, we have to search and find the closest samples in order to interpolate them. Unreal Engine lays out the data per track: the indices for all the retained samples are followed by their sample values.

This comes with a major performance drawback: each track will incur a cache miss for the sample indices in order to find the neighbors and another cache miss to read the two samples we need to interpolate. This is very slow. Each memory access will be random, preventing the hardware prefetcher from hiding the memory access latency. Even if we manage to prefetch it by hand, we still touch a very large number of cache lines. Equally worse, each cache line is only used partially as they also contain data we will not need. In the end, a significant portion of our CPU cache will be evicted with data that will only be read once.

In contrast, ACL retains every sample and sorts them by time (sample index). This ensures that all the samples we need at a particular point in time are contiguous in memory. Sampling our clip becomes very fast:

We don’t need to search for our neighbors, just where the first sample lives
We don’t need to read indices, offsets, or the number of samples retained
Each cache line is fully used
The hardware prefetcher will detect our predictable access pattern and work properly

Sorting is clearly the key to fast decompression.

Sorting retained samples

Back in 2017, if you searched for ‘‘animation compression’’, the most popular blog posts were one by Bitsquid which advocates using curve fitting with sorted samples for fast and cache friendly decompression and a post by Riot Games about trying the same technique with some success.

Without getting into the details too much (the two posts above explain it quite well), you sort the samples by the time you need them at (NOT by their sample time) and you keep track from frame to frame where and what you last decompressed from the compressed byte stream. Once decompressed, samples are stored (often raw, unpacked) in a persistent context object that is reused from frame to frame. This allows you to touch the least possible amount of compressed contiguous data every frame by unpacking only the new samples that you need to interpolate at the current desired time. Once all your samples are unpacked inside the context, interpolation is very fast. You can use tons of tricks like Structure of Arrays, wider SIMD registers with AVX, and you can easily interpolate two or three samples at a time in order to use all available registers and minimize pipeline stalls. This requires keeping a context object around all the time but it is by far the fastest way to interpolate a compressed animation because you can avoid unpacking samples that have already been cached.

Sorting our samples keeps things contiguous and CPU cache friendly and as such it stood to reason that it was a good idea worth trying. Some Unreal Engine codecs already support linear sample reduction and as such the remaining samples were simply sorted in the same manner.

With this new technique, decompression was up to 4x faster on PC. It looked phenomenal in a gym.

Unfortunately, this technique has a number of drawbacks that are either skimmed briefly, downplayed, or not mentioned at all in the two blog posts advocating it. Those proved too significant to ignore in Fortnite. Sometimes being the fastest isn’t the best idea.

No U-turn allowed

By sorting the samples, the playback direction must now match the sort direction. If you attempt to play a sequence backwards that has its samples sorted forward, you must seek and read everything until you find the ones you need. You can sort the samples backwards to fix this but forward playback will now have the same issue. There is no optimal sort order for both playback directions. Similarly, random seeks in a sequence have equally abysmal performance.

As Bitsquid mentions, this can be mitigated by introducing a full frame of data at specific intervals to avoid fully reading everything or segments can be used for the same purpose. This comes at the cost of a larger memory footprint and it does not offset entirely the extra work done when seeking. One would think that most clips play forward in time and it isn’t that big a deal. Sure some clips play backward or randomly but those aren’t that common, right?

In practice, things are not always this way. Many prop animations will play forward and backward at runtime. For example, a chest opening and closing might have a single animation played in both directions. The same idea can be used with all sorts of other objects like doors, windows, etc. A more subtle problem are clips that play forward in time but that do not start playing at the beginning. With motion matching, often when you transition from one clip to another you will not start playing the new clip at frame 0. When playback starts, you will have to read and skip samples, creating a performance spike. This can also happen with clips that advance in time but do not decompress due to Level of Detail (LOD) constraints. As soon as those characters start animating, performance spikes. It is also worth noting that even if you start playing at the first frame, you need to unpack everything needed to prime the context which creates a performance spike regardless.

30 FPS is ‘more cinematic’

It is not unusual for clips to be exported with varying sample rates such as 30, 60, or 120 FPS (and everything in between). Equally common are animations that play at various rates. However, unlike other techniques, these properties combine and can become problematic. If we play an animation faster than its sample rate (e.g. a 60 FPS game with 30 FPS animations) we will have frames where no data needs to be unpacked from the compressed stream and we can interpolate entirely from the context. This is very fast but it does mean that our decompression performance is inconsistent as some frames will need to unpack samples while others will not. This typically isn’t that big a deal but things get much worse if we play back slower than the sample rate (e.g. a 30 FPS game with 60 FPS animations). Our clip contains many more samples that we will not need to interpolate and because they are sorted, we will have to read and skip them in order to reach the ones we need. When samples are removed and sorted, decompression performance becomes a function of the playback speed and the animation sample rate. Such a problematic scenario can arise if an animation (such as a cinematic) requires a very high sample rate to maintain its quality (perhaps due to cloth and hair simulations).

Just one please

Although not as common, single bone decompression performance is pretty bad. Because all the data is mixed together, decompressing a specific bone requires decompressing (or at least skipping) all the other data. This is fine if you rarely do it or if it’s done at the same time as sampling the full pose while sharing the context between calls but this is not always possible. In some cases you have to sample individual bones at runtime for various gameplay or AI purposes (e.g. to predict where a bone will land in the future). This same property of the data means that lowering the LOD does not speed up seeking nor does it reduce the context memory footprint as everything needs to be unpacked just in case the LOD changes suddenly (although you can get away with interpolating only what you need).

One more byte

Just one more bite

Sorting the samples means that there is no pattern to them and metadata per sample needs to be introduced. You need to be able to tell what type a sample is (rotation, translation, 3D scale, scalar), at what time the sample appears, and which bone it belongs to. This overhead being per sample adds up quickly. You can use all sorts of clever tricks to use fewer bits if the bone index and sample time index are small compared to the previous one but ultimately it is quite a bit larger than alternative techniques and it cannot be hidden or removed entirely.

With all of this data, the context object becomes quite large to the point where its memory footprint cannot be ignored in a game with many animations playing at the same time. In order to interpolate, you need at least two full poses with linear interpolation (four if you use cubic curves) stored inside along with other metadata to keep track of things. For example, if a bone transform needs a quaternion (rotation) and a vector3 (translation) for a total of 28 bytes, 100 bones will require 2.7 KB for a single pose and 5.5 KB for two (and often 3D scale is needed, adding even more data). With curves, those balloon to 11 KB and by touching them you evict over 30% of your CPU L1 cache (most CPUs have 32 KB of L1) for data that will not be needed again until the next frame. This is not cheap.

It is clear that while we touch less compressed memory and avoid the price of unpacking it, we end up accessing quite a bit of uncompressed memory, evicting precious CPU cache lines in the process. Typically, once decompression ends, the resulting pose will be blended with another intermediate pose later to be blended with another, and another. All of these intermediate poses benefit from remaining in the cache because they are needed over and over often reused by new decompression and pose blending calls. As such, the true cost of decompression cannot be measured easily: the cache impact can slow down the calling code as well. While sorting is definitely more cache friendly than not doing so when samples are removed, whether this is more so than retaining every sample is not as obvious.

You can keep the poses packed in some way within the context, either with the same format as the compressed stream or packed in a friendlier format at the cost of interpolation performance. Regardless, the overhead adds up and in a game like Fortnite where you can have 50 vs 50 players fighting with props, pets, and other things animating all at the same time, the overall memory footprint ended up too large to be acceptable on mobile devices. We attempted to not retain a context object per animation that was playing back, sharing them across characters and threads but this added a lot of complexity and we still had performance spikes from the higher amount of seeking. You can have a moderate amount of expensive seeks or a lower runtime memory footprint but not both.

This last point ended up being an issue even without any sorting (just with segmenting). Even though the memory overhead was not as significant, it still proved to be above what we would have liked, the complexity too high, and the decompression performance too unpredictable.

A lot of complexity for not much

Overall, sorting the samples retained increased slightly the compressed size, it increased the runtime memory footprint, decompression performance became erratic, while peak decompression speed increased. Overcoming these issues would have required a lot more time and effort. With the complexity already being very high, I was not confident I could beat the unsorted codecs consistently. We ultimately decided not to use this technique. The runtime memory footprint increased beyond what we considered acceptable for Fortnite and the decompression performance too erratic.

Although the technique appeared very attractive as presented by Bitsquid, it ended up being quite underwhelming. This was made all the more apparent by my parallel efforts with ACL that retained every sample yet achieved remarkable results with little to no complexity. ACL has consistent and fast decompression performance regardless of the playback rate, playback direction, or sample rate and it does this without the need for a persistent context.

When linear sample reduction is used, both sorted and unsorted algorithms have significant drawbacks when it comes to decompression performance and memory usage. While both techniques require extra metadata that increases their memory footprint, if enough samples are removed, the overhead can be offset to yield a net win. The next post will look into how many samples need to be removed in order to beat ACL which retains all of them.

Pitfalls of linear sample reduction: Part 1

2019-07-23T00:00:00+00:00

Two years ago I worked with Epic to try to improve the Unreal Engine animation compression codecs. While I managed to significantly improve the compression performance, despite my best intentions, a lot of research, and hard work, some of the ideas I tried failed to improve the decompression speed and memory footprint. In the end their numbers just didn’t add up when working at the scale Fortnite operates at.

I am now refactoring Unreal’s codec API to natively support animation compression plugins. As part of that effort, I am removing the experimental ideas I had added and I thought they deserved their own blog posts. There are many academic papers and posts about things that worked but very few about those that didn’t.

The UE4 codecs rely heavily on linear sample reduction: samples in a time series that can be linearly interpolated from their neighbors are removed to achieve compression. This technique is very common but it introduces a number of nuances that are often overlooked. We will look at some of these in this multi-part series:

Splitting animation sequences into independent segments doesn’t work too well
Sorting the samples in a CPU cache friendly memory layout has far reaching implications
The added memory overhead when samples are removed is fairly large
Empirical data shows that in practice not many samples are removed

TL;DR: As codecs grow in complexity, they can sometimes have unintended side-effects and ultimately be outperformed by simpler codecs.

What we are working with?

Animation data consists of a series of values at various points in time (a time series). These drive the rotation, translation, and scale of various bones in a character’s skeleton or in some object (prop). Both Unreal Engine and the Animation Compression Library work with a fixed number of samples (frames) per second. Storing animation curves where the samples can be placed at arbitrary points in time isn’t as common in video games. The series of rotation, translation, and scale are called tracks and all have the same length. Together they combine to form an animation clip that describes how multiple bone transforms evolve over time.

Compression is commonly achieved by:

Removing samples that can be reconstructed by interpolating between the remaining neighbors
Storing each sample using a reduced number of bits

Unreal Engine uses linear interpolation to reconstruct the samples it removes and it stores each track using 32, 48, or 96 bits per sample.

ACL retains every sample and stores each track using 9, 12, 16, … 96 bits per sample (19 possible bit rates).

Slice it up!

Segmenting a clip is the act of splitting it into a number of independent and contiguous segments. For example, we can split an animation sequence that has tracks with 31 samples each (1 second at 30 samples per second) into two segments with 16 and 15 samples per track.

Segmenting has a number of advantages:

Values within a segment have a reduced range of motion which allows fewer bits to be used to represent them, achieving compression.
When samples are removed we have to search for their neighbors to reconstruct their value. Segmenting allows us to narrow down the search faster, reducing the cost of seeking.
Because all the data needed to sample a time T is within a contiguous segment, we can easily stream it from a slower medium or prefetch it.
If segments are small enough, they fit entirely within the processor L1 or L2 cache which leads to faster compression.
Independent segments can trivially be compressed in parallel.

Around that time in the summer of 2017, I introduced segmenting into ACL and saw massive gains: the memory footprint reduced by roughly 36%.

ACL uses 16 samples per segment on average and having access to the animations from Paragon I looked at how many segments it had: 6558 clips turned into 49214 segments.

What a great idea to try with the UE4 codecs as well!

Segmenting in Unreal

Unfortunately, the Unreal Engine codecs were not designed with this concept in mind.

In order to keep decompression fast, each track stores an offset into the compressed byte stream where its data starts as well as the number of samples retained. This allows great performance if all you need is a single bone or when bones are decompressed in an order different from the one they were compressed in (this is quite common in UE4 for various reasons).

Unfortunately, this overhead must be repeated in every segment. To avoid the compressed clip’s size increasing too much, I settled on a segment size of 64 samples. But these larger segments came with some drawbacks:

They are less likely to fit in the CPU L1 cache when compressing
There are fewer opportunities for parallelism when compressing
They don’t narrow the sample search as much when decompressing

Most of the Paragon clips are short. Roughly 60% of them need only a single segment. Only 19% had more than two segments. This meant most clips consumed the same amount of memory while being slightly slower to decompress. Only long cinematics like the Matinee fight scene showed significant gains on their memory footprint, compression performance, and decompression performance. In my experience working on multiple games, short clips are by far the most common and Paragon isn’t an outlier.

Overall, segmenting worked but it was very underwhelming within the UE4 codecs. It did not deliver what I had hoped it would and what I had seen with ACL.

In an effort to fix the decompression performance regression, a context object was introduced, adding even more complexity. A context object persists from frame to frame for each clip being played back. It allows data to be reused from a previous decompression call to speed up the next call. It is also necessary in order to support sorting the samples which I tried next and will be covered in my next post.

Comparing past and present Unreal Engine 4 animation compression

2019-06-14T00:00:00+00:00

Ever since the Animation Compression Library UE4 Plugin was released in July 2018, I have been comparing and tracking its progress against Unreal Engine 4.19.2. For the sake of consistency, even when newer UE versions came out, I did not integrate ACL and measure from scratch. This was convenient and practical since animation compression doesn’t change very often in UE and the numbers remain fairly close over time. However, in UE 4.21 significant changes were made and comparing against an earlier release was no longer a fair and accurate comparison.

As such, I have updated my baseline to UE 4.22.2 and am releasing a new version of the plugin to go along with it: v0.4. This new plugin release does not bring significant improvements on the previous one (they both still use ACL v1.2) but it does bring the necessary changes to integrate cleanly with the newer UE API. One of the most notable changes in this release is the introduction of git patches for the custom engine changes required to support the ACL plugin. Note that earlier engine versions are not officially supported although if there is interest, feel free to reach out to me.

One benefit of the thorough measurements that I regularly perform is that not only can I track ACL’s progress over time but I can also do the same with Unreal Engine. Today we’ll talk about Unreal a bit more.

What changed in UE 4.21

Two years ago Epic asked me to improve their own animation compression. The primary focus was on improving the compression speed of their automatic compression codec while maintaining the existing memory footprint and decompression speed. Improvements to the other metrics was considered a stretch goal.

The automatic compression codec in UE 4.20 and earlier tried 35+ codec permutations and picked the best overall. Understandably, this could be quite slow in many cases.

To speed it up, two important changes were made:

The codecs tried were distilled into a white list of the 11 most commonly used codecs
The codecs are now evaluated in parallel

This brought a significant win. As mentioned at the Games Developer Conference 2019, the introduction of the white list brought the time to compress Fortnite (while cooking for the Xbox One) from 6h 25mins down to 1h 50mins, a speed up of 3.5x. Executing them in parallel lowered this even more, down to 40mins or about 2.75x faster. Overall, it ended up 9.6x faster.

While none of this impacted the codecs or the API in any way, UE 4.21 also saw significant changes to the codec API. The reason for this will be the subject of my next blog post: failed optimization attempts and the lessons learned from them. In particular, I will show why removing samples that can be reconstructed through interpolation has significant drawbacks. Sometimes despite your best intentions and research, things just don’t pan out.

The juicy numbers

For the GDC presentation we only measured a few metrics and only while cooking Fortnite. However, every release I use a much more extensive set of animations to track progress and many more metrics. Here are all the numbers.

Note: The ACL plugin v0.3 numbers are from its integration in UE 4.19.2 while v0.4 is with UE 4.22.2. Because the Unreal codecs didn’t change, the decompression performance remained roughly the same. The ACL plugin accuracy numbers are slightly different due to fixes in the commandlet used to extract them. The compression speedup for the ACL plugin largely comes from the switch to Visual Studio 2017 that came with UE 4.20.

Carnegie-Mellon University database performance

For details about this data set and the metrics used, see here.

	ACL Plugin v0.4.0	ACL Plugin v0.3.0	UE v4.22.2	UE v4.19.2
Compressed size	74.42 MB	74.42 MB	100.15 MB	99.94 MB
Compression ratio	19.21 : 1	19.21 : 1	14.27 : 1	14.30 : 1
Compression time	5m 10s	6m 24s	11m 11s	1h 27m 40s
Compression speed	4712 KB/sec	3805 KB/sec	2180 KB/sec	278 KB/sec
Max ACL error	0.0968 cm	0.0702 cm	0.1675 cm	0.1520 cm
Max UE4 error	0.0816 cm	0.0816 cm	0.0995 cm	0.0996 cm
ACL Error 99^th percentile	0.0089 cm	0.0088 cm	0.0304 cm	0.0271 cm
Samples below ACL error threshold	99.90 %	99.93 %	47.81 %	49.34 %

Paragon database performance

For details about this data set and the metrics used, see here.

	ACL Plugin v0.4.0	ACL Plugin v0.3.0	UE v4.22.2	UE v4.19.2
Compressed size	234.76 MB	234.76 MB	380.37 MB	392.97 MB
Compression ratio	18.22 : 1	18.22 : 1	11.24 : 1	10.88 : 1
Compression time	23m 58s	30m 14s	2h 5m 11s	15h 10m 23s
Compression speed	3043 KB/sec	2412 KB/sec	582 KB/sec	80 KB/sec
Max ACL error	0.8623 cm	0.8623 cm	0.8619 cm	0.8619 cm
Max UE4 error	0.8601 cm	0.8601 cm	0.6424 cm	0.6424 cm
ACL Error 99^th percentile	0.0100 cm	0.0094 cm	0.0438 cm	0.0328 cm
Samples below ACL error threshold	99.00 %	99.19 %	81.75 %	84.88 %

Matinee fight scene performance

For details about this data set and the metrics used, see here.

	ACL Plugin v0.4.0	ACL Plugin v0.3.0	UE v4.22.2	UE v4.19.2
Compressed size	8.77 MB	8.77 MB	23.67 MB	23.67 MB
Compression ratio	7.11 : 1	7.11 : 1	2.63 : 1	2.63 : 1
Compression time	16s	20s	9m 36s	54m 3s
Compression speed	3790 KB/sec	3110 KB/sec	110 KB/sec	19 KB/sec
Max ACL error	0.0588 cm	0.0641 cm	0.0426 cm	0.0671 cm
Max UE4 error	0.0617 cm	0.0617 cm	0.0672 cm	0.0672 cm
ACL Error 99^th percentile	0.0382 cm	0.0382 cm	0.0161 cm	0.0161 cm
Samples below ACL error threshold	94.61 %	94.52 %	94.23 %	94.22 %

Conclusion and what comes next

Overall, the new parallel white list is a clear winner. It is dramatically faster to compress and none of the other metrics measurably suffered. However, despite this massive improvement ACL remains much faster.

For the past few months I have been working with Epic to refactor the engine side of things to ensure that animation compression plugins are natively supported by the engine. This effort is ongoing and will hopefully land soon in an Unreal Engine near you.

Faster arithmetic by flipping signs

2019-05-08T00:00:00+00:00

Over the years, I picked up a number of tricks to optimize code. Today I’ll talk about one of them.

I first picked it up a few years ago when I was tasked with optimizing the cloth simulation code in Shadow of the Tomb Raider. It had been fine tuned extensively with PowerPC intrinsics for the Xbox 360 but its performance was lacking on XboxOne (x64). While I hoped to talk about it 2 years ago, I ended up side tracked with the Animation Compression Library (ACL). However, last week while introducing ARM64 fused muptiply-add support into ACL and the Realtime Math (RTM) library, I noticed an opportunity to use this trick when linearly interpolating quaternions and it all came back to me.

TL;DR: Flipping the sign of some arithmetic operations can lead to shorter and faster assembly.

Flipping for fun and profit on x64

To understand how this works, we need to give a bit of context first.

On PowerPC, ARM, and many other platforms, before an instruction can use a value it must first be loaded from memory into a register explicitly with a load type instruction. This is not always the case with x86 and x64 where many instructions can take either a register or a memory address as input. In the later case, the load still happens behind the scenes and while it isn’t really much faster by itself it does have a few benefits.

Not having an explicit load instruction means that we do not use one of the precious named registers. While under the hood the CPU has tons of registers (e.g modern processors have 50+ XMM registers), the instructions can only reference a few of them: only 16 named XMM registers can be referenced. This can be very important if a piece of code places considerable pressure on the amount of registers it needs. Fewer named registers used means a reduced likelihood that registers have to spill on the stack and spilling introduces quite a few instructions as well. Altogether, removing that single load instruction can considerably improve the chances of a function getting inlined.

Fewer instructions also means a lower code cache footprint and better overall performance although in practice, I have found this impact very hard to measure.

Two things are important to keep in mind:

An instruction that operates directly from memory will be slower than the same instruction working from a register: it has to perform the load operation and even if the value resides in the CPU cache, it still takes time.
Most arithmetic instructions that take two inputs only support having one of them come from memory: addition, subtraction, multiplication, etc. Typically, only the second input (on the right) can come from memory.

To illustrate how flipping the sign of arithmetic operations can lead to improved code generation, we will use the following example:

Both MSVC 19.20 and GCC 9 generate the above assembly. The 1.0f constant must be loaded from memory because subss only supports the second argument coming from memory. Interestingly, Clang 8 figures out that it can use the sign flip trick all on its own:

Because we multiply our input value by a constant, we are in control of its sign and we can leverage that fact to change our subss instruction into an addss instruction that can work with another constant from memory directly. Both functions are mathematically equivalent and their results are identical down to every last bit.

Short and sweet!

Not every mainstream compiler today is smart enough to do this on its own especially in more complex cases where other instructions will end up in between the two sign flips. Doing it by hand ensures that they have all the tools to do it right. Should the compiler think that performance will be better by loading the constant into a register anyway, it will also be able to do so (for example if the function is inlined in a loop).

But wait, there’s more!

While that trick is very handy on x64, it can also be used in a different context and I found a good example on ARM64: when linearly interpolating quaternions.

Typically linear interpolation isn’t recommended with quaternions as it is not guaranteed to give very accurate results if the two quaternions are far apart. However, in the context of animation compression, successive quaternions are often very close to one another and linear interpolation works just fine. Here is what the function looked like before I used the trick:

The code is fairly simple:

We calculate a bias and if both quaternions are on opposite sides of the hypersphere (negative dot product), we apply a bias to flip one of them to make sure that they lie on the same side. This guarantees that the path taken when interpolating will be the shortest.
We linearly interpolate: (end - start) * alpha + start
Last but not least, we normalize our quaternion to make sure that it represents a valid 3D rotation.

When I introduced the fused multiply-add support to ARM64, I looked at the above code and its generated assembly and noticed that we had a multiplication instruction followed by a subtraction instruction before our final fused multiply-add instruction. Can we do better?

While FMA3 has a myriad of instructions for all sorts of fused multiply-add variations, ARM64 does not: it only has 2 such instructions: fmla ((a * b) + c) and fmls (-(a * b) + c).

Here is the interpolation broken down a bit more clearly:

x = (end * bias) - start (fmul, fsub)
result = (x * alpha) + start (fmla)

After a bit of thinking, it becomes obvious what the solution is:

-x = -(end * bias) + start (fmls)
result = -(-x * alpha) + start (fmls)

By flipping the sign of our intermediate value with the fmls instruction, we can flip it again and cancel it out by using it once more while removing an instruction in the process. This simple change resulted in a 2.5 % speedup during animation decompression (which also does a bunch of other stuff) on my iPad.

Note: because fmls reuses one of the input registers for its output, a new mov instruction was added in the final inlined decompression code but it executes much faster than a floating point instruction.

You can find the change in RTM here.

The Animation Compression Library just got even faster

2019-04-15T00:00:00+00:00

Slowly but surely, the Animation Compression Library has now reached v1.2 along with an updated v0.3 Unreal Engine 4 plugin. The most notable changes in this release are as follow:

More compilers and architectures added to continuous integration
Accuracy bug fixes
Floating point sample rate support
Dramatically faster compression through the introduction of a compression level setting

TL;DR: Compared to UE 4.19.2, the ACL plugin compresses up to 1.7x smaller, is up to 3x more accurate, up to 158x faster to compress, and up to 7.5x faster to decompress (results may vary depending on the platform and data).

Note that UE 4.21 introduced changes that significantly sped up the compression with its Automatic Compression codec but I haven’t had the time to setup a new branch with it to measure.

UE 4 plugin support and progress

Now that ACL properly supports a floating point sample rate, the UE4 plugin has reached feature parity with the stock codecs.

As announced at the GDC 2019, work is ongoing to refactor the Unreal Engine to natively support animation compression plugins and is currently on track to land with UE 4.23. Once it does, the plugin will be updated once more, finally reaching v1.0 on the Unreal marketplace for free.

Lighting fast compression

One of the most common feedback I received from those that use ACL in the wild (both within UE4 and outside) was the desire for faster compression. The optimization algorithm is very aggressive and despite its impressive performance overall (as highlighted in prior releases), some clips with deep bone hierarchies could take a very long time to compress, prohibitively so.

In order to address this, a new compression level was introduced in the compression settings to better control how much time should be spent attempting to find an optimal bit rate. Higher levels take more time but yield a lower memory footprint. A total of five levels were introduced but the lowest three currently behave the same for now: Lowest, Low, Medium, High, Highest. The Highest level corresponds to what prior releases did by default. After carefully reviewing the impact of each level, a decision was made to make the default level be Medium instead. This translates in dramatically faster compression, identical accuracy, with a very small and acceptable increase in memory footprint. This should provide for a much better experience for animators during production. Once the game is ready to be released, the animations can easily and safely be recompressed with the Highest setting in order to squeeze out every byte.

In order to extract the following results, I compressed the Carnegie-Mellon University motion capture database, Paragon, and Fortnite in parallel with 4 threads using ACL standalone. Numbers in parenthesis represent the delta again Highest.

Compressed Size	Highest	High	Medium
CMU	67.05 MB	68.85 MB (+2.7%)	71.01 MB (+5.9%)
Paragon	206.87 MB	211.81 MB (+2.4%)	218.58 MB (+5.7%)
Fortnite	491.79 MB	497.60 MB (+1.2%)	507.11 MB (+3.1%)

Compression Time	Highest	High	Medium
CMU	24m 57.59s	11m 51.48s	6m 20.89s
Paragon	4h 55m 42.57s	1h 19m 36.01s	29m 21.65s
Fortnite	8h 13m 1.66s	2h 29m 59.37s	1h 3m 18.17s

Compression Speed	Highest	High	Medium
CMU	977.36 KB/sec	2057.24 KB/sec (+2.1x)	3842.79 KB/sec (+3.9x)
Paragon	246.79 KB/sec	916.82 KB/sec (+3.7x)	2485.58 KB/sec (+10.1x)
Fortnite	613.56 KB/sec	2016.82 KB/sec (+3.3x)	4778.65 KB/sec (+7.8x)

And here are the default settings in action on the animations from Paragon with the ACL plugin inside UE4:

	ACL Plugin v0.3.0	ACL Plugin v0.2.0	UE v4.19.2
Compressed size	234.76 MB	226.09 MB	392.97 MB
Compression ratio	18.22 : 1	18.91 : 1	10.88 : 1
Compression time	30m 14.69s	6h 4m 18.21s	15h 10m 23.56s
Compression speed	2412.94 KB/sec	200.32 KB/sec	80.16 KB/sec
Max ACL error	0.8623 cm	0.8590 cm	0.8619 cm
Max UE4 error	0.8601 cm	0.8566 cm	0.6424 cm
ACL Error 99^th percentile	0.0094 cm	0.0116 cm	0.0328 cm
Samples below ACL error threshold	99.19 %	98.85 %	84.88 %

The 99th percentile and the number of samples below the 0.01 cm error threshold are calculated by measuring the error of every bone at every sample in each of the 6558 animation clips. More details on how the error is measured can be found here.

In this new release, the decompression performance remains largely unchanged. It is worth noting that a month ago my Google Nexus 5X died abruptly and as such performance numbers will no longer be tracked on it. Instead, my new Google Pixel 3 will be used from here on out.

What’s next

The next release v1.3 currently scheduled for the Fall 2019 will aim to tackle commonly requested features:

Faster decompression in long clips by optimizing seeking
Multiple root transform support (e.g. rigid body simulation compression)
Scalar track support (e.g. float curves for blend shapes)
Faster compression in part by using multiple threads to compress a single clip (which will help the UE4 plugin a lot)

If you use ACL and would like to help prioritize the work I do, feel free to reach out and provide feedback or requests!

Compressing Fortnite Animations

2019-01-25T00:00:00+00:00

New year, new stats! A few months ago, Epic agreed to let me use their Fortnite animations for my open source research with the Animation Compression Library (ACL). Following months of work to refactor Unreal Engine 4 in order to natively support animation compression plugins, it has finally entered the review stage on Epic’s end. While I had hoped the changes could make it in time for Unreal Engine 4.22, due to unforeseen delays, 4.23 seems a much more likely candidate.

Even though the code isn’t public yet, the new updated ACL plugin kicks ass and Fortnite is a great title to showcase it with. The real game uses the classic UE4 codecs but I recompressed everything with the latest and greatest. After spending several hundred hours compressing the animations, fixing bugs, and iterating I can finally present the results.

TL;DR: Inside Fortnite, ACL shines bright with 2x faster compression, 2x smaller memory footprint, and higher accuracy. Decompression is 1.6x faster on desktop, 2.3x faster on a Samsung S8, and 1.2x faster on the Xbox One.

Methodology

For the UE4 measurements I used a modified UE 4.21 with its default Automatic Compression. It tries a list of codecs in parallel and selects the optimal result by considering both size and accuracy.

ACL uses a modified version of the open source ACL Plugin v0.2.2. It uses its own default compression settings and in the rare event where the error is above 1cm, it falls back automatically to safer settings.

Although the UE4 refactor doesn’t change the legacy codecs, it does speed up their decompression a bit compared to previous UE4 releases. That is one of many benefits everyone will get to enjoy as a result of my refactor work regardless of which codec is used.

Error measurements

While the UE4 and ACL error measurements never exactly matched, they historically have been very close for every single clip I had tried, until Fortnite. As it turns out, some exotic animations brought to light the fact that some subtle differences in how they both measure the error can lead to some large perceived discrepancies. This has now been documented in the plugin here.

Three differences stand out: how the error is measured, where the error is measured in space, and where the error is measured in time. You can follow the link above for the gory details but the jist is that ACL is more conservative and more accurate in how it measures the error and it should always be trusted over what UE4 reports in case of doubt or disagreement.

It is worth noting that because ACL does not currently support a floating point sample rate (e.g 28.3 FPS), those clips (and there are many) have a higher reported error with UE4 because by rounding, we are effectively time stretching those clips a tiny bit. They still look just as good though. This will be fixed in the next version.

The animations

I extracted all the non-additive animations regardless of whether they were used by the game or not: a grand total of 8304 clips! A total raw size of 17 GB and roughly 17.5 hours worth of playback.

Fortnite has a surprising number of exotic clips. Some take hours to compress with UE4 and others have a range of motion as wide as the distance between the earth and the moon! These allowed me to identify a number of very subtle bugs in ACL and to fix them.

Compression stats

	ACL Plugin	UE4
Compressed size	498.21 MB	1011.84 MB
Compression ratio	35.55 : 1	17.50 : 1
Compression time	12h 38m 04.99s	23h 8m 58.94s
Compression speed	398.72 KB/sec	217.62 KB/sec
Max ACL error	0.9565 cm	8392339 cm
Max UE4 error	108904.6797 cm	8397727 cm
ACL Error 99^th percentile	0.0309 cm	2.1856 cm
Samples below ACL error threshold	97.71 %	77.37 %

Once again, ACL performs admirably: the compression speed is twice as fast (1.83x), the memory footprint reduces in half (2.03x smaller), and the accuracy is right where we want it. This is also in line with the previous results from Paragon.

UE4’s accuracy struggles a bit with a few clips but in practice the error might not be visible as the overwhelming majority of samples are very accurate. This is consistent as well with previous results.

A handful of clips contribute to a large portion of the UE4 compression time and its high error. One clip in particular stands out: it has 1167 bones, 8371 samples at 120 FPS, and a total raw size of 372.66 MB. Its range of motion peaks at 477000 kilometers away from the origin! It truly pushes the codecs to their absolute limits.

	ACL Plugin	UE4
Compressed size	71.53 MB	220.87 MB
Compression ratio	5.21 : 1	1.69 : 1
Compression time	1m 38.07s	4h 51m 59.13s
Compression speed	3891.19 KB/sec	21.78 KB/sec
Max ACL error	0.0625 cm	8392339 cm
Max UE4 error	108904.6797 cm	8397727 cm

It takes almost 5 hours to compress with UE4! In comparison, ACL zips through in well under 2 minutes. While it tries its best with the default settings it ultimately ends up using the safety fallback and thus compresses twice in that amount of time.

Overall, if you added the ACL codec to the Automatic Compression list, here is how it would perform:

ACL is smaller for 7711 clips (92.86 %)
ACL is more accurate for 7576 clips (91.23 %)
ACL has faster compression for 5704 clips (68.69 %)
ACL is smaller, better, and faster for 5017 clips (60.42 %)
ACL wins Automatic Compression for 7863 clips (94.69 %)

Decompression stats

Fortnite has the handy ability to create replays. These make gathering deterministic profiling numbers a breeze. The numbers that follow are from a 50 vs 50 replay. On each platform, a section of the replay with some high intensity action was profiled.

Desktop

The performance on desktop looks pretty good. ACL is consistently faster, about 38% on average. It also appears a bit less noisy, a direct benefit of the improved cache friendliness of its algorithm.

Samsung S8

ACL really shines on mobile. On average it is 56% faster but that is only part of the story. On the S8 it appears that the core is hyperthreaded and another thread does heavy work and applies cache pressure. This causes all sorts of spikes with UE4 but in comparison, the cache aware ACL allows it to maintain a clean and consistent performance.

Hyperthreading on the CPU (and the GPU) works, roughly speaking, by the processor switching to another thread already executing when it notices that the current thread is stalled waiting on a slow operation, typically when memory needs to be pulled into the cache. Both threads are executing in the sense that they have data being held in registers but only one of them advances at a time on that core. When one stalls, the other executes.

When you have a piece of code that triggers a lot of cache misses, such as some of the legacy UE4 codecs, the processor will be more likely to switch to the other hyperthread. When this happens, execution is suspended and it will only resume once the other thread stalls or the time slice expires. This could be a long time especially if the other hyperthread is executing cache friendly code and doesn’t otherwise stall often.

This translates into the type of graph above where there is heavy fluctuation as the execution time varies widely from the noise of the neighbor hyperthread.

On the other hand, when the code is cache friendly, it doesn’t give the chance to the other thread to run. This gives a nice and smooth graph for that current thread as the risk of long interruptions is reduced. When the code is that optimized, hyperthreading typically doesn’t help speed things up much as both threads compete for the same time slice with few opportunities to hide stalling latency. This is also what I observed when measuring the compression performance. In theory due to the higher cache pressure, performance could even degrade with hyperthreading but in practice I haven’t observed it, not with ACL at least.

Xbox One

On the Xbox One ACL is about 13% faster on average. Both lines seem to have very similar shapes unlike the previous two platforms due in large part to the absence of hyperthreading. There are a few possibilities as to why the gain isn’t as significant on this platform:

The MSVC compiler does not generate assembly that is as clean as it generates on PC, it’s certainly sub-optimal on a few points. It fails to inline many trivial functions and it leaves around unnecessary shuffle instructions.
Perhaps the other threads that share the L2 thrash the hardware prefetcher, preventing it from kicking in. ACL benefits heavily from hardware prefetching.
The CPU is quite slow compared to the speed of its memory. This reduces the benefit of cache locality as it keeps L2 cache misses fairly cheap in comparison.

The last two points seems the most likely culprits. ACL does a bit more work per sample decompressed than UE4 but everything is cache friendly. This gives it a massive edge when memory is slow compared to the CPU clock as is the case on my desktop, the Samsung S8, and lots of other platforms.

Conclusion

With faster compression, faster decompression on every platform, a smaller memory footprint, and higher accuracy, ACL continues to shine in UE4 and it won’t be long now before everyone can find it on the marketplace for free.

In the meantime, in the next few months I will perform another release of ACL and its plugin with all the latest fixes made possible with Fortnite’s data.

Status update

To my knowledge, the first game released with ACL came out in November 2018 with the public UE4 plugin: OVERKILL’s The Walking Dead. I was told it reduced their animation memory footprint by over 50% helping them fit within their console budgets.

A number of people have also integrated it into their own custom game engines and although I have no idea if they are using it or not, Remedy Entertainment has forked ACL!

Last but not least, I’d like to extend a special shout-out to Epic for allowing me to do this and to the ACL contributors!

Introducing Realtime Math v1.0

2019-01-19T00:00:00+00:00

Almost two years ago now, I began writing the Animation Compression Library. I set out to build it to be production quality which meant I needed a whole lot of optimized math. At the time, I took a look at the landscape of math libraries and I opted to roll out my own. It has served me well, propelling ACL to success with some of the fastest compression and decompression performance in the industry. I am now proud to announce that the code has been refactored out into its own open source library: Realtime Math v1.0 (RTM) (MIT license).

There were a few reasons that motivated the choice to move the code out on its own:

A significant amount of the ACL Continuous Integration build time is compiling and running the math unit tests which slows things down a bit more than I’d like
It decouples code that will benefit from being on its own
I believe it has its place in the landscape of math libraries out there

In order to support that last point, I reviewed 9 other popular and usable math libraries for realtime applications. I looked at these with the lenses of my own needs and experience, your mileage may vary.

Disclaimer: the list of reviewed libraries is in no way exhaustive but I believe it is representative. Note that Unreal Engine 4 is included for informational purposes as it isn’t really usable on its own. Libraries are listed in no particular order and I tried to be as objective as possible. If you spot any inaccuracies, don’t hesitate to reach out.

The list: Realtime Math, MathFu, vectorial, VectorialPlusPlus, C OpenGL Graphics Math (CGLM), OpenGL Graphics Math (GLM), Industrial Light & Magic Base (ILMBase), DirectX Math, and Unreal Engine 4.

TL;DR: How Realtime Math stands out

I believe Realtime Math stands out for a few reasons.

It is geared for high performance, deeply hot code. Most functions will end up inlined but the price to pay is an API that is a bit more verbose as a result of being C-style. When the need arises to use intrinsics, it gets out of the way and lets you do your thing. Only two libraries had what I would call optimal inlinability: Realtime Math and DirectX Math. Only those two libraries properly support the __vectorcall calling convention explicitly and only RTM handles GCC and Clang argument passing explicitly.

While it still needs a bit of love, quaternions are a first class citizen and it is the only standalone open source library I could find that supports QVV transforms (a rotation quaternion, a 3d scale vector, and a translation vector).

Realtime Math uses a coding style similar to the C++ standard library and feels clean and natural to read and write.

It consists entirely of C++11 headers, it runs almost everywhere, it supports 64 bit floating point arithmetic, and it sports a very permissive MIT license.

License

ACL is open source and uses the MIT license. I am never keen on adding dependencies and if I really have to, I want a permissive license free of constraints.

Library	License
Realtime Math	MIT
MathFu	Apache 2.0
vectorial	BSD 2-clause
VectorialPlusPlus	BSD 2-clause
CGLM	MIT
GLM	Modified MIT
ILMBase	Custom but permissive
DirectX Math	MIT
Unreal Engine 4	UE4 EULA

Header only

For simplicity and ease of integration, I want ACL to be entirely made of C++11 headers. This also constrains any dependencies to the same requirement.

Library	Header Only
Realtime Math	Yes
MathFu	Yes
vectorial	Yes
VectorialPlusPlus	Yes
CGLM	Yes (optional lib)
GLM	Yes
ILMBase	No
DirectX Math	Yes
Unreal Engine 4	No

Verbosity, readability, and power

An important requirement for a math library is to be reasonably concise with average code without getting in the way if the need arises to dive right into raw intrinsics. In my experience, general math type abstractions take you very far but in order to squeeze out every cycle it is sometimes necessary to write custom per platform code. When this is required, it is important for the library to not hide its internals and leave the door open.

I am personally more a fan of C-style interfaces for a math library for various reasons: I can infer very well what happens under the hood (I have seen many libraries make fancy use of some operators that leave many newcomers to wonder what they do) and they are optimal for performance as we will discuss later. The downside of course is that they tend to be a bit more verbose. However, this largely boils down to a matter of personal taste.

vectorial is one of the few libraries that offers both a C-style interface and C++ wrappers and at the other end of the spectrum DirectX Math has both a namespace and a prefix for every type, constant and function.

Library	Verbosity
Realtime Math	Medium (C-style)
MathFu	Light (C++ wrappers)
vectorial	Light (C++ wrappers) and Medium (C-style)
VectorialPlusPlus	Light (C++ wrappers)
CGLM	Medium (C-style)
GLM	Light (C++ wrappers)
ILMBase	Light (C++ wrappers)
DirectX Math	Medium++ (C-style with prefix and namespace)
Unreal Engine 4	Light (C++ wrappers)

It is very common for C-style math APIs to typedef their types to the underlying SIMD type. Realtime Math, DirectX Math, and many others do this. While this is great for performance, it does raise one problem: type safety is reduced. While usually those interfaces will opt to not expose proper vector2 or vector3 types and instead rely on functions that simply ignore the extra components, it doesn’t work so well when vector4 and quaternions are mixed. Only Realtime Math, DirectX Math and CGLM have quaternions with C-style interfaces but only the first two have a distinct type for quaternions when SIMD intrinsics are disabled. This somewhat mitigates the issue because with both Realtime Math and DirectX Math you can compile without intrinsics and still have type safety validated there. Although at the end of the day, all three have functions with distinct prefixes for vector and quaternion math and as such type safety is unlikely to be an issue.

Type and feature support

By virtue or being an animation compression library, ACL’s needs are a bit different from a traditional realtime application. This dictated the need I had for specific types and features. I had no need for general 3x3 or 4x4 matrices as well as 2D vectors which are more commonly used in gameplay and rendering. However, 3x4 affine matrices, 3D and 4D vectors, quaternions, and QVV transforms (a quaternion, a vector3 translation, and a vector3 scale) are of critical importance. Those types are front and center in an animation runtime and I needed them to be fully featured and fast. Most of the libraries under review had way more features than I cared for (mostly for rendering) but generally missed proper or any support for quaternions and QVV transforms.

MathFu appears to have a bug where the Matrix 4x4 SIMD template specialization isn’t included by default and its quaternions are 32 bytes instead of the ideal 16 due to alignment constraints.

VectorialPlusPlus quaternions also take 32 bytes instead of 16 due to alignment constraints and most of their quaternion code appears to be scalar.

UE 4 is notable for being the only other library to support QVV and it does offer a VectorRegister type to support SIMD for Vector2/3/4 although most of the code written in the engine uses the scalar version.

Library	Vector2	Vector3	Vector4	Quaternion	Matrix 3x3	Matrix 4x4	Matrix 3x4	QVV
Realtime Math		SIMD	SIMD	SIMD	SIMD	SIMD	SIMD	SIMD
MathFu	SIMD	SIMD	SIMD	Partial SIMD	Scalar	SIMD	Scalar
vectorial		SIMD	SIMD			SIMD
VectorialPlusPlus	SIMD	SIMD	SIMD	Scalar	SIMD	SIMD
CGLM		SIMD	SIMD	SIMD	SIMD	SIMD	SIMD
GLM	SIMD	SIMD	SIMD	Partial SIMD	SIMD	SIMD	SIMD
ILMBase	Scalar	Scalar	Scalar	Scalar	Scalar	Scalar
DirectX Math	SIMD	SIMD	SIMD	SIMD		SIMD
Unreal Engine 4	Scalar	Scalar	Scalar	SIMD		SIMD		SIMD

SIMD architecture support

Equally important was the SIMD architecture support. I want to run ACL everywhere with the best performance possible, especially on mobile. SSE, AVX, and NEON are all equally important to me.

Worth noting that 2 years ago DirectX NEON support appeared almost exclusively to be for Windows ARM NEON and I have no idea if it runs on iOS or Android even today.

Library	SSE	AVX	NEON
Realtime Math	Yes	Yes	Yes
MathFu	Yes		Yes
vectorial	Yes		Yes
VectorialPlusPlus	Yes		Partial
CGLM	Yes	Yes	Partial
GLM	Yes
ILMBase
DirectX Math	Yes	Yes	Yes
Unreal Engine 4	Yes	Yes	Yes

Platform and compiler support

Here things are a bit more complicated as libraries will list platforms but not compilers or compilers but not platforms. I need ACL to run everywhere and this means limiting myself to C++11 features.

Realtime Math: Windows (VS2015, VS2017) x86 and x64, Linux (gcc5, gcc6, gcc7, gcc8, clang4, clang5, clang6) x86 and x64, OS X (Xcode 8.3, Xcode 9.4, Xcode 10.1) x86 and x64, Android clang ARMv7-A and ARM64, iOS (Xcode 8.3, Xcode 9.4, Xcode 10.1) ARM64
MathFu: Windows, Linux, OS X, Android
vectorial: Unlisted but probably Windows, Linux, OS X, Android, and iOS
VectorialPlusPlus: Unlisted but probably Windows
CGLM: Windows, Unix, and probably everywhere
GLM: VS2013+, Apple Clang 6, GCC 4.7+, ICC XE 2013+, LLVM 3.4+, CUDA 7+
ILMBase: Unlisted but probably Windows, Linux, OS X
DirectX Math: VS2015 and VS2017, possibly elsewhere
Unreal Engine 4: Windows (VS2015, VS2017) x64, Linux x64, OS X x64, Android ARMv7-A (no NEON) and ARM64, iOS ARM64

Continuous integration support

Continuous integration is a critical part of modern software development especially with C++ when multiple platforms are supported and maintained.

Library	Continuous Integration
Realtime Math	Yes
MathFu	No
vectorial	No
VectorialPlusPlus	No
CGLM	Yes
GLM	Yes
ILMBase	No
DirectX Math	No
Unreal Engine 4	Not public

Dependencies

I’m not personally a big fan of pulling in tons of dependencies, especially for a math library. As mentioned earlier, the Unreal Engine 4 math library isn’t really usable on its own because of this but is included regardless.

Library	Dependencies
Realtime Math
MathFu	vectorial (BSD 2-clause)
vectorial
VectorialPlusPlus	HandyCPP (custom license)
CGLM
GLM
ILMBase
DirectX Math
Unreal Engine 4	Unreal Engine 4

Floating point support

When I got started with ACL, I wasn’t sure at the time if 64 bit floating point arithmetic might offer superior accuracy or not and if it would be worth using. As a result, I needed the math code to support both float32 and float64 types for everything with a seamless API between the two for quick testing. It later turned out that the extra floating point precision isn’t helping enough to be worth using.

Library	Float 32 Support	Float 64 Support
Realtime Math	Yes	Yes (partial SIMD)
MathFu	Yes	Yes (no SIMD)
vectorial	Yes
VectorialPlusPlus	Yes	Yes (partial SIMD)
CGLM	Yes
GLM	Yes
ILMBase	Yes (no SIMD)	Yes (no SIMD)
DirectX Math	Yes
Unreal Engine 4	Yes

Inlinability

Due to the critical need for ACL to be as fast as possible on every platform, having the bulk of the math operations be inline is very important. Many things impact whether a function is inlined by the compiler but two stand out:

Simple and short functions inline better
Passing arguments by register needs fewer instructions which inlines better

Thankfully, most math function are fairly simple and short: add, mul, div, etc. C-style functions will generally have a slight advantage over C++ wrappers mainly because they also must track the implicit this pointer being passed around even if ultimately it is optimized out inside the caller. When the compiler needs to determine if it can inline a function, it uses a heuristic and the size of the intermediate assembly/IR/AST most likely plays a role. Generally speaking, C++ wrapper functions that are short will inline just fine but some operations have a harder time due to their size: matrix 4x4 multiplication, quaternion multiplication, and quaternion interpolation. For this reason, I personally favor a C-style API for this sort of code.

The second point is not to be underestimated. Most of the libraries in the list either take the arguments by value or by const reference. While passing SIMD types by value does the right thing on ARM and passes them by register (up to 4), it does not work for aggregate types like matrices and it does not work with the default x64 calling convention with MSVC. In order to be able to pass SIMD types by register with MSVC, you must use its __vectorcall calling convention. It also works for aggregate and wrapper types. Up to 6 registers can be used for this. On desktop and Xbox One, using __vectorcall is critical for high performance code and sadly, most libraries do not support it explicitly (and not all support it implicitly if the whole compilation unit is forced to use that calling convention). With Visual Studio 2015, __vectorcall is the difference between having quaternion interpolation getting inlined or not. When I added support for it in ACL, I measured a roughly 5% speedup during the decompression.

Note that once a function is inlined, whether the arguments are passed by register or not typically does not impact the generated assembly although it sometimes does (at least with MSVC especially when AVX is enabled).

Some libraries which use a generic vector template class with specializations for SIMD (like MathFu) sometime end up passing *float32 arguments by const-reference instead of by value which is often suboptimal when not inlined.*

Library	Inlinability	Register Passing
Realtime Math	Optimal (C-style + by register)	Explicit (everywhere)
MathFu	Decent (C++ wrappers)	None
vectorial	Good (C-style), Decent (C++ wrappers)	Implicit (C-style and ARM only)
VectorialPlusPlus	Decent (C++ wrappers)	None
CGLM	Good (C-style)	None
GLM	Decent (C++ wrappers)	None
ILMBase	Decent (C++ wrappers)	None
DirectX Math	Optimal (C-style + by register)	Explicit (vectorcall and ARM only)
Unreal Engine 4	Decent (C++ wrappers)	None

Multiplication order

An important point of contention is how things are multiplied. As the list below shows, the OpenGL way is by far the most popular for open source math libraries.

It all boils down to whether vectors are represented as a row or as a column. In the former case, multiplication with a matrix takes the form v' = vM while in the later case we have v' = Mv. Linear algebra typically treats vectors as columns and OpenGL opted to use that convention for that reason. If you think of matrices as functions that modify an input and return an output it ends up reading like this: result = object_to_world(local_to_object(input)). This reads right-to-left as is common with nested function evaluation. In my opinion, this is quite awkward to work with as most modern programming languages (and western languages) read left-to-right. Most linear algebra formulas use abstract letters and names for things which somewhat hides this nuance but when I write code, I try to keep my matrix names as clear as possible: what space are the input and output in. While you could technically reverse the naming result = world_from_object * object_from_local * input so it at least reads decently right-to-left, it’s still harder to reason with because just about everything we work with in the world goes from somewhere to somewhere else and not the other way around: trains, buses, planes, Monday to Friday, 5@7, etc.

On the other hand, DirectX uses row vectors and ends up with the much more natural: result = input * local_to_object * object_to_world. Your input is in local space, it gets transformed into object space before finally ending up in world space. Clean, clear, and readable. If you instead multiply the two matrices together on their own, you get the clear local_to_world = local_to_object * object_to_world instead of the awkward local_to_world = object_to_world * local_to_object you would get with OpenGL and column vectors.

At the end of the day, which way you choose largely boils down to a personal choice (or whatever library you use for rendering) as I don’t think there’s a big performance difference between the two on modern hardware. For ACL, all its output data is in local space and although we evaluate the error in world space internally, this is entirely transparent to the client application and it is free to use either convention.

Library	Multiplication Style
Realtime Math	DirectX
MathFu	OpenGL
vectorial	OpenGL
VectorialPlusPlus	OpenGL
CGLM	OpenGL
GLM	OpenGL
ILMBase	OpenGL
DirectX Math	DirectX
Unreal Engine 4	DirectX

Conclusion

Ultimately, which math library you choose for a particular project boils down to a matter of personal preference to a large extent. For the vast majority of the code you’ll write, the performance and code generation is likely to be very close if not identical. Two years ago, I knew regardless of which option I picked I would have to do a lot of work to add what was missing. This greatly motivated me to just start from scratch as many middleware do and I do not regret the experience or results.

My top two favorite libraries are Realtime Math and DirectX Math. Both are quite similar today although DirectX Math wasn’t quite as attractive when I started.

Next steps

Over the next few days I will populate various issues on GitHub to document things that are missing or that could benefit from some love.

A core part that is partially missing at the moment is the quantization and packing logic that ACL already contains. I have not migrated that code yet in large part because I am not sure how to best expose it in a clean and consistent API. I do believe it belongs in RTM where everyone can benefit from it.

ACL does not yet use RTM but that migration is planned for ACL v2.0.

Smaller, faster: ACL lets you cut your animation costs in half

2018-09-08T00:00:00+00:00

I am excited to announce the open source Animation Compression Library has reached v1.1 along with an updated Unreal Engine 4 plugin v0.2.

ACL now beats Unreal Engine 4 on all the important metrics. The plugin is 1.7x smaller, 3x more accurate, 2.5x faster to compress, and up to 4.8x faster to decompress!

What’s new

The latest release focused on decompression performance. ACL v1.1 is about 30% faster than the previous version on every platform when decompressing a whole pose. Decompressing a single bone, a common operation in Unreal, is now about 3-4x faster. Also, ARM-based products will now use NEON SIMD acceleration when available.

The UE4 plugin was reworked to be more tightly integrated and is about twice as fast compared to the previous version.

ACL has now reached a point where I can confidently say that it is the best overall animation compression algorithm in the video games industry. While other techniques might beat ACL on some of these metrics, beating it simultaneously on speed, size, and accuracy will prove to be very challenging. In particular, unlike other algorithms that offer very fast decompression, ACL has no extra runtime memory cost beyond the compressed clip data.

New data!

One year ago, Epic generously agreed to let me use the Paragon animations for research purposes. This helped me find and fix bugs in Unreal Engine and ACL, and see how well both animation compression approaches perform in a real game. Paragon also allows each release to be rigorously tested against a large, relevant, and varied data set.

I am excited to announce that Epic is allowing me to use Fortnite to further my research as well! While Paragon will continue to play its role in tracking compression performance and regression testing, Fortnite will allow me to measure decompression performance in real world scenarios much more easily. Testing with Fortnite should highlight new ways ACL can be improved further.

What’s next

I am shifting my focus to add animation compression plugin support to UE4 during the next few months. If everything goes well, when UE 4.22 is released next year, I will be able to add the ACL plugin to the Unreal Engine Marketplace for everyone to use, for free.

Proper plugin support will remove overhead and help make ACL’s in-game decompression faster still.

Due to the rigorous testing and extensive statistics extraction every release now requires, I expect the release cycle to slow down. I will aim to perform non-bug fix releases about twice a year.

Compression performance overview

Here is a quick glance of how well it performs on the animations from Paragon:

	ACL Plugin v0.2.0	UE v4.19.2
Compressed size	226.09 MB	392.97 MB
Compression ratio	18.91 : 1	10.88 : 1
Compression time	6h 4m 18.21s	15h 10m 23.56s
Bone Error 99^th percentile	0.0116 cm	0.0328 cm
Samples below 0.01 cm error threshold	98.85 %	84.88 %

The 99^th percentile and the number of samples below the 0.01 cm error threshold are calculated by measuring the world-space error of every bone at every sample in each of the 6558 animation clips. To put this into perspective, over 99 % of the compressed data has an error lower than the width of a human hair. More details on how the error is measured can be found here.

Decompression performance overview

Decompression performance is currently tracked with the Matinee fight scene. The troopers have around 70 bones each while the main trooper has 541.

Much care was taken to ensure that ACL has consistent decompression performance. The following two images show the time taken to decompress a pose at every point of the Matinee fight scene which highlights how regular ACL is.

It also has consistent decompression performance regardless of the playback direction and it works on every modern platform making it a safe choice when using it as the default algorithm in your games.

Overall, ACL is ideal for games with large amounts of animations playing concurrently such as those with large crowds, MMOs, and e-sports as well as those that run on mobile or slower platforms.

Animation Compression Library: Release 1.0.0

2018-07-21T00:00:00+00:00

The long awaited ACL v1.0 release is finally here! And it comes with the brand new Unreal Engine 4 plugin v0.1! It took over 15 months of late nights, days off, and weekends to reach this point and I couldn’t be more pleased with the results.

Recap

The core idea behind ACL was to explore a different way to perform animation compression, one that departed from classic methods. Unlike the vast majority of algorithms in the wild, it uses bit aligned values as opposed to naturally aligned integers. This is slower to unpack but I hoped to compensate by not performing any sort of key reduction. By retaining every sample, the data is uniform in memory and offsets are trivially calculated, keeping things fast, the memory touched contiguous, and the hardware happy. While the technique itself isn’t novel and is often used with compression algorithms in other fields, to my knowledge it had never been tried to the extent ACL pushes it with animation compression, at least not publicly.

Very early, the technique proved competitive and over time it emerged as a superior alternative over traditional techniques involving key reduction. I then spent about 8 months writing the necessary infrastructure to make ACL not only production ready but production quality: unit tests were written, extensive regression tests were introduced, documentation was added as well as comments, scripts to replicate the results, cross platform support (ACL now runs on every platform!), etc. All that good stuff that one would expect from a professional product.

But don’t take my word for it! Check out the 100% C++ code (MIT license), the statistics below, and take the plugin out for a spin!

Performance

While ACL provides various synthetic test hardnesses to benchmark and extract statistics, nothing beats running it within a real game engine. This is where the UE4 plugin comes in and really shines. Just as with ACL, three data sets are measured: CMU, Paragon, and the Matinee fight scene.

Note that there are small differences between measuring with the UE4 plugin and with the ACL test harnesses due to implementation choices in the plugin.

Carnegie-Mellon University (CMU)

	ACL Plugin v0.1.0	UE v4.19.2
Compressed size	70.60 MB	99.94 MB
Compression ratio	20.25 : 1	14.30 : 1
Max error	0.0722 cm	0.0996 cm
Compression time	34m 30.51s	1h 27m 40.15s

ACL was smaller for 2532 clips (99.92 %)
ACL was more accurate for 2486 clips (98.11 %)
ACL has faster compression for 2534 clips (100.00 %)
ACL was smaller, better, and faster for 2484 clips (98.03 %)

Would the ACL Plugin have been included in the Automatic Compression permutations tried, it would have won for 2534 clips (100.00 %)

Data tracked here by the plugin, and here by ACL.

Paragon

	ACL Plugin v0.1.0	UE v4.19.2
Compressed size	226.02 MB	392.97 MB
Compression ratio	18.92 : 1	10.88 : 1
Max error	0.8566 cm	0.6424 cm
Compression time	6h 35m 03.24s	15h 10m 23.56s

ACL was smaller for 6413 clips (97.79 %)
ACL was more accurate for 4972 clips (75.82 %)
ACL has faster compression for 5948 clips (90.70 %)
ACL was smaller, better, and faster for 4499 clips (68.60 %)

Would the ACL Plugin have been included in the Automatic Compression permutations tried, it would have won for 6098 clips (92.99 %)

Data tracked here by the plugin, and here by ACL.

Matinee fight scene

	ACL Plugin v0.1.0	UE v4.19.2
Compressed size	8.67 MB	23.67 MB
Compression ratio	7.20 : 1	2.63 : 1
Max error	0.0674 cm	0.0672 cm
Compression time	52.44s	54m 03.18s

ACL was smaller for 1 clip (20 %)
ACL was more accurate for 4 clips (80 %)
ACL has faster compression for 5 clips (100 %)
ACL was smaller, better, and faster for 0 clip (0 %)

Would the ACL Plugin have been included in the Automatic Compression permutations tried, it would have won for 3 clips (60 %)

Data tracked here by the plugin, and here by ACL.

Decompression performance

Data tracked here by the plugin, and here by ACL (they also include other platforms and more data).

Performance summary

As the numbers clearly show, ACL beats UE4 across every compression metric, sometimes by a significant margin: it is MUCH faster to compress, the quality is just as good, and the memory footprint is significantly reduced. ACL achieves all of this with default settings that animators rarely if ever need to tweak. What’s not to love?

However, the ACL decompression performance is sometimes ahead, sometimes behind, or the same. There are a few reasons for this, most of which I am hoping to fix in the next version to take the lead: NEON (SIMD) is not yet used on ARM, the ACL plugin needlessly performs MUCH more work than UE4 when decompressing, and many low hanging fruits were left to be fixed post-1.0 release.

ACL is just getting started!

How to use the ACL Plugin

As the documentation states here, a few minor engine changes are required in order to support the ACL plugin. These changes mostly consist of bug fixes and changes to expose the necessary hooks to plugins.

For the time being, the plugin is not yet on the marketplace as it is not fully plug-and-play. However, this summer I am working with Epic to introduce the necessary changes in order to publish the ACL plugin on the marketplace. Stay tuned!

Note that the ACL Plugin will reach v1.0 once it can be published on the marketplace but it is production ready regardless.

What’s new in ACL v1.0

Few things actually changed in between v0.8 and v1.0. Most of the changes revolved around minor additions, documentation updates, etc. There are two notable changes:

The first is visible in the decompression graphs: we now yield the thread before measuring every sample. This helps ensure more stable results by reducing the likelihood that the kernel will swap out the thread and interrupt it while executing the decompression code.
The second is visible in the compression stats for Paragon: a bug was causing the visible error to sometimes be partially hidden when 3D scale is present. While the new version is not less accurate than the previous, the measured error can be higher in very rare cases (only 1 clip is higher).

Regardless, the measuring should now be much more stable.

What’s next

The next release of ACL will focus on improving the compression and decompression performance. While ACL was built from the ground up to be fast to decompress; so far the focus has been on making sure things function properly and safely to establish a solid baseline to work with. Now that this work is done, the fun part can begin: making it the best it can be! I have many improvements planned and while some of them will make it in v1.1, others will have to wait for future versions.

Special care will be taken to make sure ACL performs at its best in UE4 but there is no reason why it couldn’t be used in your own favorite game engine or animation middleware. Developing with UE4 is easier for me in large part because of my past experience with it, my relationship with Epic, and the fact that it is open source. Other game engines like Unity explicitly forbid their use for benchmarking purposes in their EULA which prevents me from publishing any results without prior written agreement form their legal departement. Furthermore, without access to the source code, creating a plugin for it requires a lot more work. In due time, I hope to support Unity, Godot, and anyone else willing to try it out.

Animation Compression Library: Release 0.8.0

2018-05-12T00:00:00+00:00

Today marks the v0.8 release of the Animation Compression Library. It contains lots of goodies but by far the most significant point is the fact that it has now reached feature parity with Unreal 4. For the first time Unreal 4 games should be able to run exclusively with ACL. The focus for the next 2 months will be to validate this with my custom UE 4.15 integration and implement whatever might be missing as well as to create a proper and free plugin to bring ACL to the marketplace.

While I have already published some decompression performance numbers earlier this week, once a proper integration has been made, new numbers will be published to showcase how ACL performs against Unreal 4 within the game engine itself. The existing numbers for the Carnegie-Mellon University database, Paragon, and the Matinee fight scene already clearly show ACL to be ahead in terms of compression time, compression ratio, and accuracy. However, while it remains to be seen if it will also be ahead with its decompression performance, I fully expect that it will.

ACL decompression performance

2018-05-11T00:00:00+00:00

At long last I finally got around to measuring the decompression performance of ACL. This blog post will detail the baseline performance from which we will measure future progress. As I have previously mentioned, no effort has been made so far to optimize the decompression and I hope to remedy that following the v1.0 release scheduled around June 2018.

In order to establish a reliable data set to measure against, I use the same 42 clips used for regression testing plus 5 more from the Matinee fight scene. To keep things interesting, I measure performance on everything I have on hand:

The first two use both x86 and x64 while the later two use armv7-a and arm64 respectively. Furthermore, on the desktop I also compare VS 2015, VS 2017, GCC 7, and Clang 5. The more data, the merrier!

Decompression is measured both with a warm CPU cache to remove the memory fetches as much as possible from the equation as well as with a cold CPU cache to simulate a more realistic game engine playback scenario.

Three forms of playback are measured: forward, backward, and random.

Each clip is sampled 3 times at every key frame based on the clip sample rate and the smallest value is retained for that key.

Finally, two ways to decompress are profiled: decompressing a whole pose in one go (decompress_pose), and decompressing a whole pose bone by bone (decompress_bone).

The profiling harness is not perfect but I hope the extensive data pulled from it will be sufficient for our purposes.

Playback direction

In a real game, the overwhelming majority of clips play forward in time. Some clips play backwards (e.g. opening and closing a chest might use the same animation played in reverse) and a few others play randomly (e.g. driven by a thumb stick).

Not all algorithms will exhibit the same performance regardless of playback direction. In particular, forms of delta encoding as well as any caching of the last played position will severely degrade when the playback isn’t the one optimized for (as is often the case with key reduction techniques due to the data being sorted by time).

ACL currently uses the uniformly sampled algorithm which offers consistent performance regardless of the playback direction. To validate this claim, I hand picked 3 clips that are fairly long: 104_30 (44 bones, 11 seconds) from CMU, and Trooper_1 (71 bones, 66 seconds) and Trooper_Main (541 bones, 66 seconds) from the Matinee fight scene. To visualize the performance, I used a box and whiskers chart which shows concisely the min/max as well as the quartiles. Forward playback is shown in Red, backward in Green, and random in Blue.

As we can see, the performance is identical for all intents and purposes regardless of the playback direction on my desktop with VS 2015 x64. Let’s see if this claim holds true on my iPad as well.

Here again we see that the performance is consistent. One thing that shows up on this chart is that, surprisingly, the iPad performance is often better than my desktop! That is INSANE and I nearly fell off my chair when I first saw this. Not only is the CPU clocked at a lower frequency but the desktop code makes use of SSE and AVX where it can for all basic vector arithmetic while there is currently no corresponding NEON SIMD support. I double and triple checked the numbers and the code. Suspecting that the compiler might be playing a big part in this, I undertook to dump all the compiler stats on desktop; something I did not originally intend to do. Read on!

The CPU cache

Because animation clips are typically sampled once per rendered image, the CPU cache will generally always be cold during decompression. Fortunately for us, modern CPUs offer hardware prefetching which greatly helps when reads are linear. The uniformly sampled algorithm ACL uses is uniquely optimized for this with ALL reads being linear and split into 4 streams: constant track values, clip range data, segment range data, and the animated segment data.

Notes: ACL does not currently have any software prefetching and the constant track and clip range data will later be merged into a single stream since a track is one of three types: default (in which case there is neither constant nor range data), constant with no range data, or animated with range data and thus not constant.

For this reason, a cold cache is what will most interest us. That being said, I also measured with a warm CPU cache. This will allow us to see how much time is spent waiting on memory versus executing instructions. It will also allow us to compare the various platforms in terms of CPU and memory speed.

In the following graphs, the x86 performance was omitted because for every compiler it is slower than x64 (ranging from 25% slower up to 200%) except for my OS X laptop where the performance was nearly identical. I also omitted the VS 2017 performance because it was identical to VS 2015. Forward playback is used along with decompress_pose. The median decompression time is shown.

Two new clips were added to the graphs to get a better picture.

Again, we can see that the iPad outperforms almost everything with a cold cache except on the desktop with GCC 7 and Clang 5. It is clear that Clang does an outstanding job and plays an integral part in the surprising iPad performance. Another point worth noting is that its memory is faster than what I have in my desktop. My iPad has memory clocked at 1600 MHz (25 GB/s) while my desktop has its memory clocked at 1067 MHz (16.6 GB/s).

And now with a warm cache:

We can see that the iPad now loses out to VS 2015 with one exception: Trooper_Main. Why is that? That particular clip should easily fit within the CPU cache: only about 40KB is touched when sampling (or about 650 cache lines). Further research led to another interesting fact: the iPad A10X processor has a 64KB L1 data cache per core (and 8 MB L2 shared) while my i7-6850K has a 32KB L1 data cache and a 256KB L2 (with 15MB L3 shared). The clip thus fits entirely within the L1 on the iPad but needs to be fetched from the L2 on desktop.

Another takeaway from these graphs is that GCC 7 beats VS 2015 and Clang 5 beats both hands down on my desktop.

Finally, my Nexus 5X is really slow. On all the graphs, it exceeded any reasonable scale and I had to truncate it. I included it for the sake of completeness and to get a sense of how much slower it was.

Decompression method

ACL currently offers two ways to decompress: decompress_pose and decompress_bone. The former is more efficient if the whole pose is required but in practice it is very common to decompress specific bones individually or to decompress a pose bone by bone.

The following charts use the median decompression time with a cold CPU cache and forward playback.

Once more, we see very clearly how outstanding and consistent the iPad performance is. The numbers for the Nexus 5X are very noisy in comparison in large part because of the slower memory and larger footprint of some clips (decompress_bone is not shown for Android because it was far too slow and prevented a clean view of everything else).

We can clearly see that decompressing each bone separately is much slower and this is entirely because at the time of writing, each bone not required needs to be skipped over instead of using a direct look up with an offset. This will be optimized soon and the performance should end up much closer.

Conclusion

Despite having no external reference frame to compare them against, I could confirm and validate my hunches as well as observe a few interesting things:

My Nexus 5X is really slow …
Both GCC 7 and Clang 5 generate much better code than VS 2017
decompress_bone is much slower than it needs to be
The playback direction has no impact on performance

By far the most surprising thing to me was the iPad performance. Even though what I measure is not representative of ordinary application code, the numbers clearly demonstrate that the single core decompression performance matches that of a modern desktop. It might even exceed the single core performance of an Xbox One or PlayStation 4! Wow!!

I do have some baseline Unreal 4 numbers on hand but this blog post is already getting long and the next ACL version aims to be integrated into a native Unreal 4 plugin which will allow for a superior comparison to be made. However, they do show that ACL will be very close and will likely exceed the UE 4.15 decompression performance; stay tuned!

How much does additive bind pose help?

2018-05-08T00:00:00+00:00

A common trick when compressing an animation clip is to store it relative to the bind pose. The conventional wisdom is that this allows to reduce the range of motion of many bones, increasing the accuracy and the likelihood that constant bones will turn into the identity, and thus allowing a lower memory footprint as a result. I have implemented this specific feature many times in the past and the results were consistent: a memory reduction of 3-5% was generally observed.

Now that the Animation Compression Library supports additive animation clips, I thought it would be a good idea to test this claim once more.

How it works

The concept is very simple to implement:

Before compression happens, the bind pose is removed from the clip by subtracting it from every key.
Then, the clip is compressed as usual.
Finally, after we decompress a pose, we simply add back the bind pose.

The transformation is lossless aside from whatever loss happens as a result of floating point precision. It has two primary side effects.

The first is that bone translations end up having a much shorter range of motion. For example, a walking character might have the pelvic bone about 60cm up from the ground (and root bone). The range of motion will thus circle around this large value for the whole track. Removing the bind pose brings the track closer to zero since the bind pose value of that bone is likely very near 60cm. Smaller floating point values generally retain higher accuracy. The principle is identical to normalizing a track within its range.

The second impacts constant tracks. If the pelvic bone is not animated in a clip, it will retain some constant value. This value is often the bind pose itself. When this happens, removing the bind pose yields the identity rotation and translation. Since these values are trivial to reconstruct at runtime, instead of having to store the constant floating point values, we can store a simple bit set.

As a result, hand animated clips with the bind pose removed often find themselves with a lower memory footprint following compression.

Mathematically speaking, how the bind pose is added or removed can be done in a number of ways, much like additive animation clips. While additive animation clips heavily depend on the animation runtime, ACL now supports three variants:

Relative space
Additive space 0
Additive space 1

The last two names are not very creative or descriptive… Suggestions welcome!

Relative space

In this space, the clip is reconstructed by multiplying the bind pose with a normal transform_mul operation. For example, this is the same operation used to convert from local space to object space. Performance wise, this is the slowest: to reconstruct our value we end up having to perform 3 quaternion multiplications and if negative scale is present in the clip, it is even slower (extra code not shown below, see here).

Transform transform_mul(const Transform& lhs, const Transform& rhs)
{
	Quat rotation = quat_mul(lhs.rotation, rhs.rotation);
	Vector4 translation = vector_add(quat_rotate(rhs.rotation, vector_mul(lhs.translation, rhs.scale)), rhs.translation);
	return transform_set(rotation, translation, scale);
}

Additive space 0

This is the first of the two classic additive spaces. It simply multiplies the rotations, it adds the translations, and multiplies the scales. The animation runtime ozz-animation uses this format. Performance wise, this is the fastest implementation.

Transform transform_add0(const Transform& base, const Transform& additive)
{
	Quat rotation = quat_mul(additive.rotation, base.rotation);
	Vector4 translation = vector_add(additive.translation, base.translation);
	Vector4 scale = vector_mul(additive.scale, base.scale);
	return transform_set(rotation, translation, scale);
}

Additive space 1

This last additive space combines the base pose in the same way as the previous except for the scale component. This is the format used by Unreal 4. Performance wise, it is very close to the previous space but requires an extra instruction or two.

Transform transform_add1(const Transform& base, const Transform& additive)
{
	Quat rotation = quat_mul(additive.rotation, base.rotation);
	Vector4 translation = vector_add(additive.translation, base.translation);
	Vector4 scale = vector_mul(vector_add(vector_set(1.0f), additive.scale), base.scale);
	return transform_set(rotation, translation, scale);
}

It is worth noting that because these two additive spaces differ only by how they handle scale, if the animation clip has none, both methods will yield identical results.

Results

Measuring the impact is simple: I simply enabled all three modes one by one and compressed all of the Carnegie-Mellon University motion capture database as well as all of the Paragon data set. Decompression performance was not measured on its own but the compression time will serve as a hint as to how it would perform.

Everything has been measured with my desktop using Visual Studio 2015 with AVX support enabled with up to 4 clips being compressed in parallel. All measurements were performed with the upcoming ACL v0.8 release.

CMU has no scale and it is thus no surprise that the two additive formats perform the same. The memory footprint and the max error remain overall largely identical but as expected the compression time degrades. No gain is observed from this technique which further highlights how this data set differs from hand authored animations.

Paragon shows the results I was expecting. The memory footprint reduces by about 7.9% which is quite significant and the max error improves as well. Again, we can see both additive methods performing equally well. The relative space clearly loses out here and fails to show significant gains to compensate for the dramatically worse compression performance.

Conclusion

Overall it seems clear that any potential gains from this technique are heavily data dependent. A nearly 8% smaller memory footprint is nothing to spit at but in the grand scheme of things, it might no longer be worth it in 2018 when decompression performance is likely much more important, especially on mobile devices. It is not immediately clear to me if the reduction in memory footprint could save enough to translate into fewer cache lines being fetched but even so it seems unlikely that it would offset the extra cost of the math involved.

See also the related bind pose stripping optimization.

Back to table of contents

Animation Compression Library: Release 0.7.0

2018-04-02T00:00:00+00:00

Almost a year ago, I began working on ACL and it is now one step closer to being production ready with the new v0.7 release!

This new release is significant for several reasons but two stand out above all others:

Exhaustive automated testing
Full multi-platform support

Unlike previous releases, the performance remained unchanged since v0.6 but I went ahead and updated the stats and graphs regardless.

Testing, Testing, One, Two

This new release introduces extensive unit testing for all the core and math functions on top of which ACL is built. There is still lots of room for improvement here, contributions welcome! Continuous integration also executes the unit tests for every platform except iOS and Android where they must be executed manually for now.

More significant is the addition of exhaustive regression testing. A total of 42 clips from the Carnegie-Mellon University motion capture database are each compressed under a mix of 7 configurations. At the moment, these must be run manually for every platform but scripts are present to automate the whole process. The primary reason why it remains manual is that the data is too large for GitHub and I do not have a webserver to host it. The instructions can be found here.

It runs everywhere

ACL now officially supports 12 different compiler toolchains and all the major platforms: Windows, Linux, OS X, iOS, and Android. Both compression and decompression are supported and can easily be tested with the provided unit and regression tests.

But this is only the list of platforms I can reliably and easily test. In practice, since all the code is now pure C++11, if it compiles it should run just fine as-is. Although I cannot test them yet, I fully expect all major consoles to work out of the box: Nintendo Switch (ARM), PlayStation 4 (x64), and Xbox One (x64).

Paragon data

The data I obtained from Paragon under NDA last year may or may not differ from what has now been publicly released by Epic. As soon as I get the chance, I will update the published stats with the new public data. This also means that I will be able to include Paragon clips into the regression tests as well to increase our coverage.

Next steps

The next v0.8 release aims to achieve three goals (roadmap):

Document as much as possible
Add the remaining features to support real games
Add the necessary decompression profiling infrastructure

Decompression performance is one of the most important metric for modern games on both mobile and consoles. Measuring it accurately and reliably in an environment that is as close to a real game as possible is challenging which is why it was left last. However, ACL was built from the ground up in order to decompress as fast as possible: all memory reads are contiguous and linear and writes can be too depending on the host game engine integration. I am quite confident it will end up competitive with the state of the art codecs within UE4 and there are many opportunities left to optimize that I have delayed until I can measure their individual impact properly.

This upcoming release is likely to be the last before the first production release which aims to be a drop-in replacement within UE4. If everything goes according to plan and no delays surface, at the current pace, I should be able to reach this milestone around June 2018.

Animation Compression Library: Release 0.6.0

2018-01-10T00:00:00+00:00

Hot off the press, ACL v0.6 has just been released and contains lots of great things!

This release focused primarily on extending the platform support as well as improving the accuracy. Proper Linux and OS X support was added as well as the x86 architecture. As always, the list of supported platforms is in the readme. This was made possible thanks to continuous build integration which has been added and contributed in part by Michał Janiszewski!

Another notable mention is that the backlog and roadmap have been migrated to GitHub Issues. This ensures complete transparency with where the project is going.

Compiler battle royal

Now that we have all of these compilers and platforms supported, I thought it would make sense to measure everything at least once on the full data set from Carnegie-Mellon University.

Another thing I wanted to measure is how much do we gain from hyper-threading and last but not least, I thought it would be interesting to include x86 as well as x64.

Here is my setup to measure:

Windows 10 running on an Intel i7-6850K with 6 physical cores and 12 logical cores
Ubuntu 16.04 running in VirtualBox with 6 cores assigned
OS X running on an Intel i5-4288U with 2 physical cores and 4 logical cores

The acl_compressor.py script is used to compress multiple clips in parallel in independent processes. Each clip runs in its own process.

Every platform used a Release build with AVX enabled. The wall clock time is the cummulative time it took to run everything: compression, decompression to measure accuracy, reading the clip, writing the stats, etc. On the other hand, the total thread time measures the total sum of time the threads each spent on compression.

A number of things stand out:

x86 is slower for VS 2015 (66.5% slower) , VS 2017 (64.0% slower with 11 cores, 108.8% slower with 3 cores), and Clang 5 (36.0% slower) but it seems to be faster for GCC 5 (10.9% faster)
Hyper-threading barely helps at all: going from 6 cores to 11 with VS 2017 was only 7.8% faster but the total thread time increases by 69.9%
Clang 5 with x64 wins hands down, it is 25.2% faster than VS 2017 and 220.8% faster than GCC 5

GCC 5 performs so bad here that I am wondering if the default CMake compiler flags for Release builds are sane or if I made a mistake somewhere. Clang 5 really blew me away: despite running in a VM it significantly outperforms all the other compilers with both x86 and x64.

As expected, hyper-threading does not help all that much. When clips are compressed, most of the data manipulated can fit within the L2 or L3 caches. With so little IO made, animation compression is primarily CPU bound. Overall this leaves very little opportunity for a neighbor thread to execute since they hardly ever stall on expensive operations.

Accuracy improvements

As I mentioned when the Paragon data set was announced, some exotic clips brought to the surface some unusual accuracy issues. These were all investigated and they generally fell into one or both of these categories:

Very small and very large scale coupled with very large translation caused unacceptable accuracy loss when using affine matrices to calculate the error
Very large translations in a long bone chain can lead to significant accuracy loss

In order to fix the first issue, how we handle the error metric was refactored to better allow a game engine to supply their own. This is documented here. Ultimately what is most important about the error metric is that it closely approximates how the error will look in the host game engine. Some game engines use affine matrices to convert the local space bone transform into object or world space while others use Vector-Quaternion-Vector (VQV). ACL now supports both ways to calculate the error and the default we will be using for all of our statistics is the later as it more closely matches what Unreal 4 does. This did not measurably impact the compression results but it did improve the accuracy of the more exotic clips and the overall compression time is faster.

However, the problem of large translations in long bone chains has not been addressed. I compared how the error looked in Unreal 4 and it does a much better job than ACL for the time being on those few clips. This is because they implement error compensation which is something that ACL has not implemented yet. In the meantime, ACL is perfectly safe for production use and if these rare clips with a visible error do pop up, less aggressive compression settings can be used. Only 3 clips within the Paragon data set suffer from this.

Ultimately a lot of the error introduced for both ACL and Unreal 4 comes from the rotation format we use internally: we drop the quaternion W component. This works well enough when its value is close to 1.0 as the square-root used to reconstruct it is accurate in that range but it fails spectacularly when the value is very small and close to 0.0. I already have plans to try two other rotation formats to help resolve this issue: dropping the largest quaternion component and using the quaternion logarithm instead.

Updated stats

While investigating the accuracy issues and comparing against Unreal 4 I noticed that a fix I previously made locally was partially incorrect and in rare cases could lead to bad things happening. This has been fixed and the statistics and graphs for UE 4.15 were updated for CMU and Paragon. The results are very close to what they were before.

The accuracy improvements from this release are a bit more visible on the Paragon data set.

Next steps

At this point, I can pretty confidently say that ACL is ready for production use but many things are still missing for the library to be of production quality. While the performance and accuracy are good enough, iOS support is still missing, support for additive animations is missing, as well as lots of unit testing, documentation, and clean up.

The next release will focus on:

Cleaning up
Adding lots of unit tests
iOS support
Better Android support
Many other things

Arithmetic Accuracy and Performance

2017-12-29T00:00:00+00:00

As I mentioned in my previous post, ACL still suffers from some less then ideal accuracy in some exotic situations. Since the next release will have a strong focus on fixing this, I wanted to investigate using float64 and fixed point arithmetic. It is general knowledge that float32 arithmetic incurs rounding and can lead to severe accuracy loss in some cases. The question I hoped to answer was whether or not this had a significant impact on ACL. Originally ACL performed the compression entirely with float64 arithmetic but this was removed because it caused more issues than it was worth but I did not publish numbers to back this claim up. Now we revisit it once and for all.

To this end, the first research branch was created. Research branches will play an important role in ACL. Their aim is to explore small and large ideas that we might not want to support in their entirety in the main branches while keeping them close. Unless otherwise specified, research branches will not be actively maintained. Once their purpose is complete, they will live on to document what worked and didn’t work and serve as a starting point for anyone hoping to investigate them further.

float64 vs float32 arithmetic

In order to fully test float64 arithmetic, I templated a few things to abstract the arithmetic used between float32 and float64. This allowed easy conversion and support of both with nearly the same code path. The results proved somewhat underwhelming:

As it turns out, the small accuracy loss from float32 arithmetic has a barely measurable impact on the memory footprint for CMU and a 0.6% reduction for Paragon. However, the compression (and decompression) time is much faster.

With float64, the max error for CMU and Paragon is slightly improved for the majority of the clips but not by a significant margin and 4 exotic Paragon clips end up with a worse error.

Consequently, it is my opinion that float32 is the superior choice between the two. The small increase in accuracy and reduction in memory footprint is not significant enough to outweigh the degradation of the performance. Even though the float64 code path isn’t as optimized, it will remain slower due to the individual instructions being slower and the increased number of registers needed. It’s possible the performance might be improved considerably with AVX and so this is something we’ll keep in mind going forward.

Fixed point arithmetic

Another popular alternative to floating point arithmetic is fixed point arithmetic. Depending on the situation it can yield higher accuracy and depending on the hardware it can also be faster. Prior to this, I had never worked with fixed point arithmetic. There was a bit of a learning curve but it proved intuitive soon enough.

I will not explain in great detail how it works but intuitively, it is the same as floating point arithmetic minus the exponent part. For our purposes, during the decompression (and part of the compression), most of the values we work with are normalized and unsigned. This means that the range is known ahead of time and fixed which makes it a good candidate for fixed point arithmetic.

Sadly, it differs so much from floating point arithmetic that I could not as easily support it in parallel with the other two. Instead, I created an arithmetic_playground and tried a whole bunch of things within.

I focused on reproducing the decompression logic as close as possible. The original high level logic to decompress a single value is simple enough to include here:

Not quite 1.0

The first obstacle to using fixed point arithmetic is the fact that our quantized values do not map 1:1. Many engines dequantize with code that looks like this (including Unreal 4 and ACL):

This is great in that it allows us to exactly represent both 0.0 and 1.0, we can support the full range we care about: [0.0 .. 1.0]. A case could be made to use a multiplication instead but it doesn’t matter all that much for the present discussion. With fixed point arithmetic we want to use all of our bits to represent the fractional part between those two values. This means the range of values we support is: [0.0 … 1.0). This is because both 0.0 and 1.0 have the same fractional value of 0 and as such we cannot tell them apart without an extra bit to represent the integral part.

In order to properly support our full range of values, we must remap it with a multiplication.

Fast coercion to float32

The next hurdle I faced was how to convert the fixed point number into a float32 value efficiently. I independently found a simple, fast, and elegant way and of course it turned out to be very popular for those very reasons.

For all of our possible values, we know their bit width and a shift can trivially be calculated to align it with the float32 mantissa. All that remains is or-ing the exponent and the sign. In our case, our values are between [0.0 … 1.0[ and thus by using a hex value of 0x3F800000 for exponent_sign, we end up with a float32 in the range of [1.0 … 2.0[. A final subtraction yields us the range we want.

Using this trick with the float32 implementation gives us the following code:

It does lose out a tiny bit of accuracy but it is barely measurable. In order to be sure, I tried exhaustively all possible sample and segment range values up to a bit rate of 16 bits per component. The up side is obvious, it is 14.9% faster!

32 bit vs 64 bit variants

Many variants were implemented: some performed the segment range expansion with fixed point arithmetic and the clip range expansion with float32 arithmetic and others do everything with fixed point. A mix of 32 bit and 64 bit arithmetic was also tried to compare the accuracy and performance tradeoff.

Generally, the 32 bit variants had a much higher loss of accuracy by 1-2 orders of magnitude. It isn’t clear how much this would impact the overall memory footprint on CMU and Paragon. The 64 bit variants had comparable accuracy to float32 arithmetic but ended up using more registers and more instructions. This often degraded the performance to the point of making them entirely uncompetitive in this synthetic test. Only a single variant came close to the original float32 performance but it could never beat the fast coercion derivative.

The fastest 32 bit variant is as follow:

Despite being 3 instructions shorter and using faster instructions, it was 14.4% slower than the fast coercion float32 variant. This is likely a result of pipelining not working out as well. It is entirely possible that in the real decompression code things could end up pipelining better making this a faster variant. Other processors such as those used in consoles and mobile devices also might perform differently and proper measuring will be required to get a definitive answer.

The general consensus seems to be that fixed point arithmetic can yield higher accuracy and performance but it is highly dependent on the data, the algorithm, and the processor it runs on. I can corroborate this and conclude that it might not help out all that much for animation compression and decompression.

Next steps

All of this work was performed in a branch that will NOT be merged into develop. However, some changes will be cherry picked by hand. In the short term, the conclusions reached here will not be integrated just yet into the main branches. The primary reason for this is that while I have extensive scripts and tools to track the accuracy, memory footprint, and compression performance; I do not have robust tooling in place to track decompression performance on the various platforms that are important to us.

Once we are ready, the fast coercion variant will land first as it appears to be an obvious drop-in replacement and some fixed point variants will also be tried on various platforms.

The accuracy issues will have to be fixed some other way and I already have some good ideas how: idea 1, idea 2, idea 3, idea 4, idea 5.

Animation Compression Library: Paragon Results

2017-12-05T00:00:00+00:00

While working for Epic to improve Unreal 4’s own animation compression and decompression, I asked for permission to use the Paragon animations for research purposes and they generously agreed. Today I have the pleasure to report the findings from that new data set!

This is significant for two reasons:

It allows for extensive stress testing with new data
Paragon is a real game with high animation quality

Carnegie-Mellon University

Thus far, the Carnegie-Mellon University data set has been the performance benchmark.

The data set contains 2534 clips. Each clip contains an animated character with 44 bones. The version of the data that I found comes from the Unity store where it is distributed in FBX form but sampled at 24 FPS. The total duration of the database is 09h 49m 37.58s. It does not contain any 3D scale and its raw size is 1429.38 MB. It exclusively contains motion capture animation data. It is publicly available and well known within the animation compression research community.

While the database is valuable, it is not entirely representative of all the animation assets that a AAA game might use for a few reasons:

Most AAA games today have well over 100 bones per character and sometimes as high as 500
The sample rate is lower than the 30 FPS typically used in games
Motion capture data is often very noisy
Games often animate things other than characters such as cloth, objects, destruction, etc.
Many games make use of 3D scale

For these reasons, this data set is wonderful for unit testing and establishing a baseline for comparison but it falls a bit short with what I would ideally like.

You can see how Unreal and ACL compare against it here.

Paragon

The Paragon data set contains 6558 clips for a total duration of 07h 00m 45.27s and a raw size of 4276.11 MB. As you can see, despite being shorter than CMU, it is about 3x larger in size.

The data set contains among other things:

Lots of characters with varying number of bones
Animated objects of various shape and form
Very short and very long clips
Clips with unusual sample rate (as low as 2 FPS!)
World space clips
Lots of 3D scale
Lots of other exotic clips

This is great to stress test any compression algorithm and the results will be very representative of what could be expected in a AAA game.

To extract the animation clips, I used the Unreal 4 animation recompression commandlet and modified it to skip clips that ACL does not yet support (e.g. additive animations). I did my best to retain as many clips as possible. Every clip was saved in the ACL file format allowing a binary exact representation.

Sadly, I am not at liberty to share this data set as I am only allowed to use it under a non-disclosure agreement. All hope is not lost though, Epic has expressed interest in perhaps making a small subset of the data publicly available for research purposes. Stay tuned!

Bugs!

The value of undertaking this quickly became obvious when an exotic clip from the data set highlighted a bug in the variable bit rate selection that ACL used. A fix was made and the results were breathtaking: CMU reduced in size by 19% (and Paragon reduced by 20%)! You can read about it here in my previous blog post.

Three clips stress tested the accuracy of ACL and ended up with an unacceptable error as a result. This will be made evident by the graphs and numbers below. I am hoping to fix a number of accuracy issues in the next ACL release now that I have new data to validate against.

The bugs I found were not exclusively within ACL: two were found and still present in the latest Unreal 4 version. Thankfully, I was able to get in touch with Epic and these should be fixed in a future release.

In order to make the comparison as fair as possible, I had to locally disable the down-sampling variants within the Unreal 4 automatic compression method. One of the two bugs caused these variants to sometime crash. While down-sampling isn’t often selected by the algorithm as the optimal choice for any given clip, disabling it means that compression is faster and possibly a bit larger as a result. Out of the 600 clips I managed to compress before finding the bug, only 3 ended up down-sampled. There are 9 down-sampled variants out of 27 in total (33%).

Bottom line

UE 4.15 took 19h 56m 50.37s single threaded to compress. It yielded a compressed size of 496.24 MB for a compression ratio of 8.62 : 1. The max error is 0.8619cm.

ACL 0.5 took 19h 04m 25.11s single threaded to compress (01h 53m 42.84s with 11 threads). It yielded a compressed size of 205.69 MB for a compression ratio of 20.79 : 1. The max error is 9.7920cm.

On the surface, the compression time remains faster with ACL even with a significant portion of the variants disabled in the Unreal automatic compression. However, the memory footprint is dramatically smaller, a whooping 58.6% smaller! As will be made apparent in the graphs below, once again the maximum error proves to be a poor metric of the true performance: 3 clips have an error above 0.8cm with ACL.

The results in images

All the results and many more images are also on GitHub here for Paragon just like they are for CMU here. I will only show a few selected images in this post for brevity.

As expected, ACL outperforms Unreal by a significant margin. Some clips on the right are truncated with unusually high compression ratios as high as 900 : 1 for some exotic clips but those are likely very long with little to no animated data and aren’t too interesting or representative.

Here again ACL outperforms Unreal over the overwhelming majority of the data set. On the right there are a small number of clips that perform somewhat poorly with both compression methods: a total of 101 clips have an error above 0.1cm with ACL and 153 clips for Unreal.

As I have previously mentioned, the max clip error is a poor measure of accuracy. Once again the full picture is much better and tells a different story.

ACL continues to shine, crossing the 0.01cm threshold at the 99.23th percentile. Unreal crosses the same threshold at the 89th percentile.

Despite having a maximum error that is entirely unacceptable, it turns out that only 0.77% of the compressed samples (out of 112 million) exceed a sub-millimeter threshold. Aside from the 3 worst offending clips, everything else is cinematic and production quality. Not bad!

Conclusion

As is apparent now, ACL performs admirably in a myriad of scenarios and continues to improve month after month. Real world data now confirms it. Half the memory footprint of Unreal is not insignificant even for a PC or PS4 game: less data to load into memory means faster streaming, less data to transfer means faster game download and installation times, and it can correlate with faster decompression performance too. For many PS4 and XB1 games, 200 MB is perhaps small enough to load them all into memory up front and never stream them from disk afterwards.

As I continue to improve ACL, I will update the graphs and numbers with the latest significant releases. I also expect the improvements that I made to Unreal’s own animation compression over the last few months to be part of a future release and when that happens I will again update everything.

Special thanks to Raymond Barbiero for his very valuable feedback and to the continued support of many others!

Animation Compression Library: Release 0.5.0

2017-11-23T00:00:00+00:00

Today marks the release of ACL v0.5. Once again, lots of great things were included in this release but three things stand out:

Full 3D scale support
Android support (tested within Unreal Engine 4.15)
A fix to the variable quantization optimization algorithm

The third point in particular needs explaining. Initially, I did not intend to make significant changes in this release to the way compression was done beyond the scale support and whatever fixes Android required. However, while investigating accuracy issues within an exotic clip, I noticed a bug. Upon fixing it (and very unexpectedly), everything kicked into overdrive.

Performance results

On the Carnegie-Mellon University (CMU) data set, the memory footprint reduced by 18.4% with little to no change to the accuracy of the overwhelming majority of clips and a slight accuracy increase to some of them! Sadly, the compression speed suffered a bit as a result and it is now about 1.5x slower than v0.4. In my opinion, this is an entirely acceptable trade-off!

Compared to UE 4.15, ACL now stands 37.8% smaller and 2.82x faster (single threaded) to compress on CMU. No small feat!

In light of these new numbers, all the charts have been updated and can be found here. Here are the most interesting:

I also extracted two new charts: the distribution of clip durations within CMU and the distribution of which bit rates ended up selected by the algorithm. A bit rate of 6 means that 6 bits per component are used. Every track (rotation, translation, and scale) sample has 3 components (X, Y, Z) which means 18 bits per sample.

Next steps

The focus of the next few months will be more platform support (Linux, OS X, and iOS in that order) as well as improving the accuracy. A new data set I got my hands on showed edge cases that are not too uncommon from real video games where the accuracy is not good enough. Part of the accuracy loss comes from storing the segment range on 8 bits per component and the fact that we use 32 bit floats to perform the decompression arithmetic. As such, a new research branch will be created to investigate using 64 bit floats to perform the arithmetic and a fixed point represetation as well. A separate blog post will be written with the conclusion of this research.

Animation Compression Library: Unreal 4 Integration

2017-10-05T00:00:00+00:00

As mentioned in my previous post, I started working on integrating ACL into Unreal 4.15 locally. Today I can finally confirm that not only does it work but it rocks!

Matinee fight scene

In order to stress test ACL in a real game engine with real content, I set out to test it on the Matinee fight scene that can be found on the Unreal 4 Marketplace.

ACL in action

This is a very complex sequence with fast movements and LOTS of data. The main character (the white trooper) has over 540 bones because the whole cloth motion is baked. The sequence lasts about 66 seconds. The secondary characters move in and out of view and overall spend the overwhelming majority of the sequence completely idle and off screen.

Here is a short extract of the sequence using ACL for every character. This marks the first visual test and confirmation that ACL works.

Your browser does not support this video.

The data

The video isn’t too interesting but once again the numbers tell a story of their own. Packaged As Is is the default settings used when you first open it up in the editor, as packaged on the marketplace. For ACL, the integration is somewhat dirty for now and uses the same settings as for the CMU database: the error is measured 3cm away from the bones, the error threshold is 0.1mm, and the segments are 16 frames long.

ACL completely surpassed my own expectations here. The whole sequence is 59.5% smaller! The main trooper is a whopping 64.5% smaller! That’s nearly 3x smaller! Compression time is also entirely reasonable sitting at just over 1 minute. While the packaged settings are decent here sitting at around 5 minutes, the automatic compression setting is not practical with almost 3 hours. The error shown is what Unreal 4 reports in the dialog box after compression, it thus uses the Unreal 4 error metric and here again we can see that ACL is superior.

However, ACL does not perform as good on the secondary characters and ends up significantly larger. This is because they are mostly idle. Idle bones compress extremely well with linear key reduction but because ACL uses short segments, it is forced to retain at least a single key per segment. With some sort of automatic segment partitioning the memory footprint could reduce quite a bit here or even by simply using larger segments.

What happens now?

The integration that I have made will not be public or published for quite some time. Until we reach version 1.0, I wouldn’t want to support actual games while I am still potentially making large changes to the library. Once ACL is production ready and robust, I will see with Epic how we can go about making ACL a first-class citizen in their engine. In the meantime, I will maintain it locally and use it to test and validate ACL on the other platforms supported by Unreal.

For the time being, all hope is not lost! For the past 2 months, I have been working with Epic on improving the stock Unreal 4 animation compression. Our primary focus has been to improve decompression speed and reduce the compression time without compromising the already excellent memory footprint and accuracy. If all goes well these changes should make it in the next release and once that happens, I will update the relevant charts and graphs published here as well as in the ACL documentation.

Animation Compression Library: Release 0.4.0

2017-09-10T00:00:00+00:00

This marks the fourth release of ACL. It contains a lot of good stuff but most notable is the addition of segmenting support. I have not had the chance to play with the settings much yet but using segments of 16 key frames reduces the memory footprint by about 13% with variable quantization under uniform sampling. Adding range reduction on top of it (per segment), further reduces the memory footprint by another 10%. This is very significant!

Some optimizations also made it in to the compression time, reducing it by 4.3x with no compromise to quality.

You can see the latest numbers here as well as how they compare against the previous releases here. Note that the documentation contains more graphs than I will share here.

This also represents the first release where graphs have been generated allowing us an unprecedented view into how the ACL and Unreal algorithms perform. As such, I will detail what is note-worthy and thus this blog post will be a bit long. Grab a coffee and buckle up!

TL;DR:

ACL compresses better than Unreal for nearly every clip in the CMU database.
ACL is much smaller than Unreal (23.4%), is more accurate (2x+), and compresses much faster (4.68x).
ACL performs as expected and optimizes properly for the error threshold used, validating our assumptions.
A threshold of 0.1cm is good enough for production use in Unreal as the overwhelming majority (98.15%) of the samples have an error smaller than 0.02cm.

Why compare against Unreal?

As I have previously mentioned, Unreal 4 has a very solid error metric and good implementations of common animation compression techniques. It most definitely is well representative of the state of animation compression in game engines everywhere.

NOTE: In the images that follow, the results for an error threshold of UE4 @ 1.0cm were nearly identical to 0.1cm and were thus omitted for brevity

Performance results

ACL 0.4 compresses the CMU database down to 82.25mb in 50 minutes single-threaded and 5 minutes multi-threaded with a maximum error of 0.0635cm. Unreal 4.15 compresses it down to 107.94mb in 3 hours and 54 minutes single-threaded with a maximum error of 0.0850cm (1.0cm threshold used). Importantly, this is achieved with no compromise to decompression speed (although not yet measured, is estimated to be faster or just as fast with ACL).

As can be seen on the above image, ACL performs quite well here. The error is very low and the compression quite high in comparison to Unreal.

Here we see the full distribution of the compression ratio over the CMU database. UE4 @ 0.01cm fails to do better than dropping the quaternion W and storing everything as full precision most of the time which is why the compression ratio is so consistent. UE4 @ 0.1cm performs similarly in that key reduction fails very often on this database and as a result simple quantization is most often selected.

Here is a snapshot of the bottom 10% (10th percentile and lower). We can see some similarities in shape at the bottom and top 10%.

We can see on the above image that Unreal performs consistently regardless of the animation clip duration but ACL performs slightly better the longer the clip is. This is most likely a direct result of using range reduction twice: once per clip, and once per segment.

Both algorithms perform similarly for the shortest clips.

How accurate are we?

The above image gives a good view of how accurate the algorithms are. We can see ACL @ 0.01cm and UE4 @ 0.01cm quickly reach the error threshold and only about 10% of the clips exceed it. UE4 @ 0.1cm is less accurate but still pretty good overall.

The biggest source of error in both ACL and Unreal comes from the usage of the simple quaternion format consisting of dropping the W component to later reconstruct it at runtime. As it turns out, this is terribly inaccurate when that component is very small. Better formats exist and will be implemented later.

ACL performs worse on a larger number of clips likely as a result of range reduction sometimes causing a precision loss for some clips. At some point ACL should be able to detect this and turn it off if it isn’t needed.

There does not appear to be any correlation between the max error in a clip and its duration, as expected. One thing stands out though, the longer a clip is, the noisier the error appears to be. This is because the longer a clip is the more likely it is to contain a bad quaternion W that fails to reconstruct properly.

Over the years, I’ve read my fair share of animation compression papers and posts. And while they all measure the error differently the one thing they have in common is that they only talk about the worst error within a clip (or whole set of clips). As I have previously mentioned, how you measure the error is very important and must be done carefully but that is not all. Using the worst error within a given clip does not give a full picture. What about the other bones in the clip? What about the other key frames? Do I have a single bone on a single key frame that violates my threshold or do I have many?

In order to get a full and clear picture, I dumped the error of every bone at every key frame in the original clips. This represents over 37 million samples for the CMU database.

The above image is amazing!

The above two images clearly show how terrible the max clip error is at giving insight into the true error. Here are some numbers visible only in the exhaustive graphs:

ACL crosses the 0.01cm error threshold at the 99.85th percentile (only 0.15% of our values exceed the threshold!)
UE4 @ 0.01cm crosses 0.01cm at the 99.57th percentile, almost just as good
UE4 @ 0.1cm crosses 0.01cm at the 49.8th percentile
UE4 @ 0.1cm crosses 0.02cm at the 98.15th percentile

This clearly shows why 0.1cm might be good enough for production use in Unreal: half our values remain at or below 0.01cm and 98% of the values are below 0.02cm.

The previous images also clearly show how aggressive ACL is at reducing the memory footprint and at maximizing the error up to the error threshold. Therefore, the error threshold must be very conservative, much more so than for Unreal.

Why ACL is re-inventing the wheel

As some have commented in the past, ACL is largely re-inventing the wheel here. As such I will detail the rational for it a bit further.

Writing a whole animation blending middleware such as Granny or Morpheme would not have been practical. Just to match production quality implementations out there would have taken 1+ year part time. Even assuming I could have managed to implement something compelling, the cost of switching to a new animation runtime for a game team is very very high. Animators need to learn new tools and workflow, the engine integration might be tightly coupled, and there is no clear way to migrate old assets to the new format. Middlewares are also getting deprecated increasingly frequently. In that regard, the market has largely spoken: most games released today do so either with one of the major engines (Unreal, Unity, Lumberyard, Stingray, etc.) or large studios such as Activision, Electronic Arts, and Ubisoft routinely have in-house custom engines with their own custom animation runtime. Regardless of the quality or feature set, it would have been highly unlikely that it would ever have been used for something significant.

On the other hand, animation compression is a much smaller problem. Integration is easy: everything is pure C++ headers and most engines out there already support more than one animation compression algorithm. This makes migrating existing assets a trivial task providing the few required features are supported (e.g. 3D scale). Any engine or middleware could integrate ACL with few to no issues to be expected once it is production ready.

Animation compression is also a wheel that NEEDS re-inventing. Of all my blog posts, a single post receives the overwhelming majority of my traffic: animation compression in Unity. Why is it so popular? Because as I mention in said post, accuracy issues will be common in Unity and the memory footprint large for high accuracy settings as a direct result of their error metric. Unity is also not alone, Stingray and Lumberyard both use the same metric. It is a VERY common error metric and it is terrible. Academic papers on this topic are often using different and poor error metrics and show very little to no data to back their results and claims. This makes evaluating these papers for real world usage in games very problematic.

Take this paper for example. They use the CMU database as well. Their error metric uses the leaf bone positions in object/world space as a measure of accuracy. This entirely ignores the rotational error of the leaf bone. They show a single graph of their results and two short tables. They do not detail the data further. Compare this with the wealth of information I was able to pull out and publish here. Even though ACL is much stricter when measuring the error, it is obvious that wavelets fail terribly to compete at the same level of accuracy (which barely makes it in their published findings). Note that they make no mention of what is an acceptable quality level that one might be able to realistically use.

Here is another recent paper published by someone I have met and have great respect for. The paper does not mention which error metric was used to compared against what they had prior nor does it mention how competitive their previous implementation was. It does not publish any concrete data either and only claims that the memory footprint reduces by 65% on average against their previous in-house techniques. It does provide a supplemental video which shows a small curated list of clips along with some statistics but without further information, it is impossible to objectively evaluate how it performs and where it lies on the spectrum of published techniques. Despite these shortcomings, it looks very promising (David knows his stuff!) and I am certainly looking forward to implementing this within ACL.

ACL does not only strive to improve on existing techniques; it will also establish a much-needed baseline to compare against and set a standard for how animation compression should be measured.

Next steps

The results so far clearly show that ACL is one step closer to being production ready. The next few months will focus on bridging that gap towards reaching v1.0.0. In the coming releases, scale support will be added as well as support for other leading platforms. This will be done through a rudimentary Unreal 4 integration to make sure it is tested in a real engine and thus real world settings.

No further effort on my part will be made towards improving the above results until our first production release is made. However, Cody Jones is working on integrating curve key reduction in the meantime.

Special thanks to Cody and Martin Turcotte for their constant feedback and contributions!

Math accuracy: Normalizing quaternions

2017-08-30T00:00:00+00:00

While investigating precision issues with ACL, I ran into two problems that I hadn’t seen documented elsewhere and that slightly surprised me.

Dot product

Calculating the dot product between two vectors is a very common operation used for all sorts of things. In an animation compression library, it’s primary use is normalizing quaternions. Due to the nature of the code, accuracy is very important as it can impact the final compressed size as well as the resulting decompression error.

SSE 4 introduced a dot product instruction: DPPS. It allows the generated code to be more concise and compact by using fewer registers and instructions. I won’t speak to its performance here but sadly; its accuracy is not good enough for us by a tiny, yet important, sliver.

For the purpose of this blog post, we will use the following nearly normalized quaternion as an example: { X, Y, Z, W } = { -0.6767403483, 0.7361232042, 0.0120376134, -0.0006215832 }. This is a real quaternion from a real clip of the Carnegie-Mellon University (CMU) motion capture database that proved to be problematic. With doubles, the dot product is 1.0000001612809224.

Using plain C++ yields the following code and assembly (compiled with AVX support under Visual Studio 2015 with an x64 target):

The result is: 1.00000024. Not quite the same but close.

Using the SSE 4 dot product instruction yields the following code and assembly:

The result is: 1.00000024.

Using a pure SSE 2 implementation yields the following assembly:

The result is: 1.00000012.

These are all nice but it isn’t immediately obvious how big the impact can be. Let’s see how they perform after taking the square root (note that the SSE 2 SQRT instruction is used here):

C++: 1.00000012
SSE 4: 1.00000012
SSE 2: 1.00000000

Again, these are all pretty much the same. What happens when we take the square root reciprocal after 2 iterations of Newton-Raphson?

C++: 0.999999881
SSE 4: 0.999999881
SSE 2: 0.999999940

With this square root reciprocal, here is how our quaternions look after being multiplied to normalize them and their associated dot product.

C++: { -0.676740289, 0.736123145, 0.0120376116, -0.000621583138 } = 0.999999940
SSE 4: { -0.676740289, 0.736123145, 0.0120376116, -0.000621583138 } = 1.00000000
SSE 2: { -0.676740289, 0.736123145, 0.0120376125, -0.000621583138 } = 0.999999940

Here is the dot product calculated with doubles:

C++: 0.99999999381912441
SSE 4: 0.99999999381912441
SSE 2: 0.99999999384079208

And the new square root:

C++: 0.999999940
SSE 4: 1.00000000
SSE 2: 0.999999940

Now the new reciprocal square root:

C++: 1.00000000
SSE 4: 1.00000000
SSE 2: 1.00000000

After all of this, our delta from a true length of 1.0 before (as calculated with doubles) was 1.612809224e-7 before normalization. Here is how they fare afterwards:

C++: 6.18087559e-9
SSE 4: 6.18087559e-9
SSE 2: 6.15920792e-9

And thus, the difference between using SSE 4 and SSE 2 is just 2.166767e-11.

As it turns out, the SSE 2 implementation appears the most accurate one and yields the lowest decompression error as well as a smaller memory footprint (by a tiny bit).

Normalizing a quaternion

There are two mathematically equivalent ways to normalize a quaternion: taking the dot product, calculating the square root, and dividing the quaternion with the result, or taking the dot product, calculating the reciprocal square root, and multiplying the quaternion with the result.

Are the two methods equivalent with floating point mathematics? Again, we will not discuss the performance implications as we are only concerned with accuracy here. Using the previous example quaternion and using the SSE 2 dot product yields the following result with the first method:

Dot product: 1.00000012
Length: sqrt(1.00000012) = 1.00000000
Normalized quaternion using division: { -0.6767403483, 0.7361232042, 0.0120376134, -0.0006215832 }
New dot product: 1.00000012
New length: 1.00000000

And now using the reciprocal square root with 2 Newton-Raphson iterations:

Dot product: 1.00000012
Reciprocal square root: 0.999999940
Normalized quaternion using multiplication: { -0.676740289, 0.736123145, 0.0120376125, -0.000621583138 }
New dot product: 0.999999940
New length: 0.999999940
New reciprocal square root: 1.00000000

By using the division, normalization fails to yield us a more accurate quaternion because of square root is 1.0. The reciprocal square root instead allows us to get a more accurate quaternion as demonstrated in the previous section.

Conclusion

It is hard to see if the numerical difference is meaningful but over the entire CMU database, both tricks together help reduce the memory footprint by 200 KB and lower our error by a tiny bit.

For most game purposes, the accuracy implication of these methods does not matter all that much and rarely have a measurable impact. Picking whichever method is fastest to execute might just be good enough.

But when accuracy is of a particular concern, special care must be taken to ensure every bit of precision is retained. This is one of the motivating reasons for ACL having its own internal math library: granular control over performance and accuracy.

Animation Compression Library: Release 0.3.0

2017-07-29T00:00:00+00:00

This release marks an important milestone. It now supports a fully variable bit rate and it performs admirably so far. The numbers don’t lie. Without using any form of key reduction, we match the compression ratio of Unreal 4 (which uses a mix of linear key reduction with a form of variable quantization) and many more tricks will follow to push this even further. It is worth noting that this new variable bit rate algorithm is entirely different from the one I presented at the GDC 2017 and it should outperform it. In due time, more stats and graphs will be published to outline how the data looks across the whole dataset.

While v0.3.0 remains a pre-release, we are quickly approaching a production ready state. Already for the vast majority of clips the error introduced is invisible to the naked eye and the performance is there to match. Major features missing to reach the production ready state are: scale support (sadly the Carnegie-Mellon data set does not contain any scale as such testing this will be problematic), and proper multi-platform support (iOS, OS X, android, clang, gcc, etc.). Both of these things are easily solved problems which is why they were deferred into future releases.

Version 0.4.0 will aim to introduce clip segmenting and hopefully curve based key reduction. Segmenting should improve our accuracy further and at the same time allow us to reduce the memory footprint even further. Curve key reduction will of course allow us to reduce the memory footprint further as well, perhaps dramatically so. Stay tuned!

Introducing ACL

2017-06-25T00:00:00+00:00

Over the years, I’ve had my fare share of discussions about animation compression and two things became obvious over time: we were all (re-)doing similar things and none of us had access to a state of art implementation to compare against. This lead to rampant speculation about which algorithm was superior or inferior. Having implemented a few algorithms in the past, I have finally decided to redo all that work once more and in the open this time. Say ‘Hello’ to the Animation Compression Library (ACL for short).

To quote the readme:

This library has two primary goals:

Implement state of the art and production ready animation compression algorithms

Serve as a benchmark to compare various techniques against one another

Over the next few months, I hope to implement state of the art versions of common algorithms and to surpass what current game engines currently offer. It is my hope that this library can serve as the foundation for an industry standard so that together we may be able to move forward and well past the foot sliding issues of yester-year!

Optimizing 4x4 matrix multiplication

2017-04-13T00:00:00+00:00

In modern video games, the 4x4 matrix multiplication is an important cornerstone. It is used for a very long list of things: moving individual character joints, physics simulation, rendering, etc. To generate a single video game image (and we typically generate between 25 and 60 per second), several thousand matrix multiplications will take place. Today, we will take an in-depth look at such a fundamental piece of code.

As I will show, we can improve on some of the most common implementations out in the wild. I will use DirectX Math as a reference here but I have seen identical implementations in many state of the art game engines.

This blog post will be broken down into four sections:

The test cases that will guide our optimization
Various ways we can write our function signature and their implications
Six different implementations
The results

Note that all the code for this can be found here. Feel free to toy with it and replicate the results or contribute your own.

Our test cases

In order to keep our observations grounded in reality, we will use three test cases that represent common heavy usage of 4x4 matrix multiplication. These are very synthetic in nature but they will make profiling and measuring immensely easier. Modern game engines do many things with many threads which can make profiling on PC somewhat a bit more complicated especially for something as short and simple as matrix multiplication.

Test case #1

Our first test case applies a constant matrix to an array of 64 matrices. Aside from our constant matrix, each input will be read from memory (here everything fits in our processor cache but it doesn’t matter much) and each output will be written back to memory. This code is meant to simulate the common operation of transforming an array of object space matrices into world space matrices, perhaps for skinning purposes.

Test case #2

Our second test case transforms an array of 64 local space matrices into an array of 64 object space matrices. To perform this operation, each local space matrix is multiplied by the parent object space matrix. The root matrix is trivial and equal in both local and object space as it has no parent. In this contrived example the parent matrix is always the previous entry in the array but in practice it would be some arbitrary index previously transformed. This operation is common at the end of the animation runtime where the pose generated will typically be in local space.

Test case #3

Our third test case takes two constant matrices and writes the result to a static array. The array is made static to prevent the compiler from stripping the code. This code is synthetic and meant to try and profile one off multiplications that happen everywhere in gameplay code. We perform the operation 64 times to help us measure the impact since the code is very fast to begin with.

Function signature variations

Our reference implementation taken from DirectX Math has the following signature:

There a few things that are noteworthy here. The function is marked inline but due to its considerable size, the function is generally never inlined. It also uses the __vectorcall calling convention with the macro XM_CALLCONV. This allows up to 6 SIMD input arguments to be passed by register (the default calling convention passes them by value on the stack, unlike with PowerPC) and the return value can also be up to 4 SIMD outputs passed by register. This also works for aggregate types such as XMMatrix. The function takes 2 arguments: M1 is passed by register with the help of FXMMATRIX and M2 is passed by const & with the help of CXMMATRIX.

This function signature will be called in our data: reg

We can vary the function signature in a number of ways and it will be interesting to compare the results. I came up with a number of variations. They are as follow.

Force inline

As mentioned, since our function is very large, inlining will typically fail to happen. However in very hot code, it still makes sense to inline the function.

This function signature will be called: inl

Pass everything from memory

An obvious change we can make is to pass both input arguments as const &. In many cases our matrices might not be cached in local registers to begin with and we have to load them from memory anyway (such as in test case #2).

This function signature will be called: mem

Flip our matrix arguments

In our matrix implementation, the rows of M2 are multiplied whole with each matrix element from M1. The code ends up looking like this:

This repeats 4 times for each row of M1. It is obvious that we can cache our 4 values from M2 and indeed the compiler typically does so for us in our reference implementation. Each of those 4 rows will be needed again and again but the same cannot be said of the 4 rows of M1 which are only needed temporarily. It would thus make sense to pass the matrix arguments in the opposite order: M2 first by register and M1 second by const &.

Note that we use a macro to perform the flip cleanly. I would have preferred a force inlined function but the compiler was not generating clean assembly from it.

This function signature will be called: flip

Expanded matrix argument

Even though the __vectorcall calling convention conveniently passes our matrix in 4 registers, it might help the compiler make different decisions if we are explicit about our intentions.

Our expanded variant will always use the flipped argument ordering. Measuring the non-flipped ordering is left as an exercise to the reader.

This function signature will be called: exp

Return value by argument

Another thing that is very common for a matrix multiplication implementation is to have the return value as a pointer or reference in a third argument.

Again this might help the compiler make different optimization choices. Note as well that implementations with this variant must explicitly cache the rows of M2 in order to have the correct result in case where the result is written to M2. It also improves the generated assembly as otherwise the output matrix would alias the arguments causing the compiler to not perform the caching automatically for you.

This function signature can be applied to all our variants and it will add the suffix 2

Permute all the things!

Taking all of this together and permuting everything yields 12 variants as follow:

In our data, they are called:

reg
reg2
reg_flip
reg_flip2
reg_exp
reg_exp2
mem
mem2
inl
inl2
inlexp
inlexp2

Hopefully this covers a large extent of common and sensible variations.

Our competing implementations

I was able to come up with six distinct implementations of the matrix multiplication, including the original reference. Note that I did not attempt to make the fastest implementation possible, there are other things we could try to make them faster. I also made sure that each version gave a result that was a exactly the same as the reference implementation, down to the last bit (binary exact).

Reference

The reference implementation is quite large as such I will not include the full source here but the code can be found here.

The reference regexp2 variant uses 10 XMM registers and totals 70 instructions.

Broadcast

In our reference implementation, an important part can be tweaked a bit.

We load a row from M1, extract each component, and replicate it into the 4 lanes of our SIMD register. This will compile down to 1 load instruction followed by 4 shuffle instructions. This was very common on older consoles: loads from memory were very expensive and none of the other instructions could work directly from memory. However, on SSE and in particular with AVX, we can do a bit better. We can use the _mm_broadcast_ss instruction. It takes as input a pointer to a scalar floating point value and it will output the replicated value over our 4 SIMD lanes. We thus save and avoid our load instruction.

The code for this variant can be found here.

The version 0 regexp2 variant uses 7 XMM registers and totals 58 instructions.

Looping

Another change we can perform is inspired from this post on StackOverflow. I rewrote the assembly into C++ code that uses intrinsics to try and keep it comparable.

Two versions were written: version 2 uses load/shuffle (code here) and version 1 uses broadcast (code here).

Branching was notoriously slow on the old consoles, it will be interesting to see how newer hardware performs.

The version 1 regexp2 variant uses 7 XMM registers and totals 23 instructions. The version 2 regexp2 variant uses 10 XMM registers and totals 37 instructions.

Handwritten assembly

Similar to our looping versions, I also kept the hand written assembly version referenced. I made a few tweaks to make sure the results were binary exact. Sadly, the tweaks required the usage of one extra register. Having run out of volatile registers, I elected to load the first row of M2 directly from memory with the multiply instruction during every iteration.

Only two variants were implemented: regexp2 and mem2.

Two versions were written: version 3 uses load/shuffle (code here) and version 4 uses broadcast (code here).

The version 3 regexp2 variant uses 5 XMM registers and totals 21 instructions. The version 4 regexp2 variant uses 5 XMM registers and totals 17 instructions.

The results

For our purposes, each test will be run 1000000 times and the cumulative time will be considered to be 1 sample. We will repeat this to gather 100 samples. To avoid skewing in our data that might result from various external sources (CPU frequency changes, other OS work, etc.), we will retain and use the 80th percentile from our dataset. Due to the simple nature of the code, this should be good enough for us to draw meaningful conclusions. All measurements are in milliseconds.

All of my raw results are parsed with a simple python script to extract the desired percentile and format it in a table form. The script can be found here.

I ran everything on my desktop computer which has an Intel i7-6850K processor. While my CPU differs from what the Xbox One and PlayStation 4 use, they both support AVX and could benefit from the same changes.

I also measured running the test cases with multithreading enabled to see if the results were consistent. Since the results were indeed consistent, we will only talk about the single threaded case but all the data for the test results can be found here:

Single threaded: raw data, parsed data, charts
One thread per physical core: raw data, parsed data, charts
One thread per logical core: raw data, parsed data, charts

Test case results

Here are the results for our 3 test cases:

A few things are immediately obvious:

Version 1 and 2, the looping intrinsic versions, are terribly slow. I moved them to the right so we can focus on the left part.
Version 0 is consistently faster than our reference implementation.

Here are the same results but only considering the best 3 variants (regexp2, mem2, and inlexp2) and the best 4 versions (reference, version 0, version 3, and version 4).

Load/shuffle versus broadcast

Overwhelmingly, we can see that the versions that use broadcast are faster than their counterpart that uses load/shuffle. This is not too surprising: we use fewer registers, as a result fewer registers spill on the stack, and fewer instructions are used. This is more significant when the function isn’t force inlined since in our test cases, whatever we spill on the stack ends up hoisted outside of our loops when inlined.

The fact that we use fewer registers and instructions also has other side effects, namely it can help the compiler to inline functions. In particular, this is the case for version 1 and 2: version 1 uses broadcast and gets inlined automatically while version 2 uses load/shuffle and does not get inlined.

Output in registers versus memory

For test cases #1 and #3, passing our return value as an argument is a net win when there is no inlining. This remains true to a lesser extent even when the functions are force inlined which means it helps the compiler make better choices.

However, for test case #2, it can sometimes be a bit slower. It seems that the assembly generated at the call site isn’t as clean as it could be. It’s possible that by tweaking the test case code a bit, performance could be improved.

Flipped versus expanded

Looking only at version 0, the behaviour seems to differ depending when the result is passed as an argument or by register. In the regflip and regexp variants, performance can be faster (test case #2), the same (test case #1), or slower (test case #3). It seems there is high variability with what the compiler chooses to do. On the other hand, with the regflip2 and regexp2 variants, performance is generally faster. Test case #2 has about equal performance but as we have seen, that test case seems to favour results being returned by register.

Inlining

As it turns out, inlining sometimes gives a massive performance gain and sometimes it comes down to about the same. In general, it is best to let the compiler make inlining decisions but sometimes in very hot code, it is desirable to manually force the inlining for performance reasons. It thus makes sense to provide at least 2 versions of matrix multiplication: with and without force inlining.

Looping

The looping versions are quite interesting. The 2 versions that use intrinsics perform absolutely terribly. They are worse by far, generally breaking out of the charts above. Strangely, they seem to benefit massively from passing the result as an argument (not shown on the graph above). Even with the handwritten assembly versions, we can see that they are generally slower than our unrolled intrinsic version 0. As it turns out, branching is still not a great idea in hot code even with modern hardware.

Is handwritten assembly worth it?

Looking at our looping versions, it is obvious that carefully crafting the assembly by hand can still give significant results. However, we must be careful when doing so. In particular, with Visual Studio, hand written assembly functions will never be inlined in x64, even by the linker. Something to keep in mind.

Best of the best

In our results, a clear winner stands above all others: version 0 inlexp2:

In test case #1, it is 34% faster then the reference implementation
In test case #2, it is 16% faster then the reference implementation
In test case #3, it is 31% faster then the reference implementation

Even when it isn’t the fastest implementation, it is within measuring error of the leading alternative. And that leading alternative is always a variant of version 0.

Conclusion

As demonstrated by our data, even a hyper optimized piece of code such as matrix multiplication can sometimes be improved by new hardware features such as the AVX broadcast instruction. In particular, the broadcast instruction allows us to reduce register pressure which avoids spilling registers on the stack and saves on the corresponding instructions that do so. On a platform such as x64, register pressure is a real and important problem that must be taken into account for complex and hot code.

From our results, it seems to make sense to provide 2 implementations for matrix multiplication:

One of regflip2, regexp2, or mem2 that does not force inlining, suitable for everyday usage
inlexp2 that forces inlining, perfect for that piece of hot code that needs to save every cycle

This keeps things simple for the user: all variants return the result in a 3rd argument. Macros can be used to keep things clean and fast.

As always with optimizations, it is important to measure often and to never blindly make a change without measuring first.

DirectX Math commit d1aa003

Modern SIMD Programming in the SSE Era

2017-04-11T00:00:00+00:00

For a very long time now with every new gaming console generation came a new set of hardware. New hardware meant new constraints which meant we had to revisit old optimization tricks, rules, and popular beliefs.

With the latest generation, both leading consoles (Xbox One and PlayStation 4) have opted for x64 and AMD CPUs. These are indeed significantly different from the old PowerPC CPUs used by Xbox 360 and PlayStation 3.

I spent a significant amount of time last year optimizing cloth simulation code that had been written in mind for the previous generation and with a few tweaks I was able to get decent gains on the latest hardware. Unfortunately that code was proprietary and as such I cannot use it to show the changes and lessons I learned. Instead, I will take the decent and well respected DirectX Math public library and highlight as well as optimize specific examples.

Due to the large amount of code, charts, and information required to properly explain everything, this topic will be split into several parts:

Hardware highlights
Matrix multiplication
And much more to come…

Hardware highlights

The PowerPC era

In the Xbox 360 and PlayStation 3 generation, both CPUs were quite different, in particular the PlayStation 3 CPU was a significant departure from modern trends at the time. We will focus on it today only because it has been more publicly documented than its counterpart and the lessons we’ll learn from it applied to both consoles.

The PlayStation 3 console sported a fancy Cell microprocessor. We will not focus on the multi-threading aspects in this post series and instead take a deeper look at the execution of a single thread at the assembly level. One important characteristic of PowerPC chips is that they are based on the RISC model. RISC hardware is generally known to have more user addressable registers than CISC, the variant used by Intel and AMD for modern x64 CPUs. In fact, PowerPC CPUs have 32 general purpose registers, 32 floating point registers, and 32 AltiVec SIMD registers (note that the Cell SPEs had 128 registers!). Both consoles also had cache lines that were 128 bytes wide and their CPUs did not support out-of-order execution.

This gave rise to two common techniques to optimize code at the assembly level that we will revisit in this series.

First, the large amount of registers meant that we could leverage them easily to mitigate the fact that loading values from memory was slow and could not be easily hidden due to the in-order nature of the execution. This lead to register packing: a technique where values are loaded first in bulk as part of a vector value and specific components are extracted on demand.

Second, because the register sets for floating point and SIMD math were different, moving a value from one set to another involved storing the value in memory (generally the stack) and reloading it. This led to what is commonly known as load-hit-store stalls. To mitigate this, whenever SIMD math was mixed with scalar math it was generally best to treat simple scalar floating point math as if it were SIMD: the same scalar was replicated to every SIMD lane.

It is also worth noting that the calling convention used on those consoles favours passing many arguments by register, including vector arguments. There are also many other peculiarities that affected how code should best be written for the hardware such as avoiding to stress the poor branch prediction but we will not cover these at this time.

The x64 SSE era

The modern Xbox One and PlayStation 4 consoles opted for x64 AMD CPUs. These are state of the art processors with out-of-order execution, powerful branch prediction, and a large set of instructions from SSE and AVX. These CPUs depart from the previous generation significantly in the number of registers they have: 16 general purpose registers and 16 SIMD registers. The cache lines are also standard in size: 64 bytes wide.

The greatly reduced number of registers means that x64 code is much more prone to register pressure issues. Whenever the number of registers needed goes above 16, registers will spill on the stack, degrading performance. In practice, it isn’t a huge issue in large part because x64 supports instructions that directly work from memory, avoiding the need for a separate load instructions (unlike PowerPC CPUs). For example, in the expression C = add(A, B), A and B can both be residing in registers or B could optionally be a memory address. Internally the CPU has much more than 16 registers and thanks to register renaming and compound CISC instructions, our add instruction will end up performing a load for us behind the scenes. Leveraging this can be very important in hot code as we will see in this series.

Another important characteristic of x64 is that scalar floating point math uses the SSE registers. This means that unlike with PowerPC, converting a floating point value into a SIMD value is very cheap (it could be free if you hand write assembly but compilers will generally generate a shuffle instruction regardless of whether or not you need all components). On the other hand, converting a SIMD value into a floating point value is entirely free and no instruction is generated by modern compilers. This is an important factor to take into account and as we will see later in this series, it can be leveraged to great effect.

Up next

In the next post we will take a deep look at how 4x4 matrix multiplication is performed in DirectX Math and how we can speed it up in various ways.

Animation Compression: Advanced Quantization

2017-03-12T00:00:00+00:00

I first spoke about this technique at the Game Developers Conference (GDC) in 2017 and the slides for that presentation can be found here. The idea is to take everything that is good about simple quantization and super charge it. Since simple quantization is generally a foundation for the other algorithms, this new variant can be used just as well for the same purpose.

It is worth noting that the algorithm I presented used uniform segmenting as well as range reduction both per clip and per block.

The key insight is that with simple quantization, using a hard coded bit rate is very naive and overly simplistic. We can do better, here’s how!

Variable Bit Rate

In the real world, the nature of the data can change quite dramatically from clip to clip, track to track, or even within a track through time. It is thus important to have a solution that can properly adapt to ever changing environmental conditions. A variable bit rate solves this problem for us.

It is ideal for our hierarchical data. As previously mentioned, our bone track data is stored in local space of their parent bone. This means that each bone incurs a small amount of error as a result of the lossy compression and this error will accumulate as we go down the hierarchy. The error accumulates because we need our final bone transforms in world space to perform skinning and thus the final error is visible on our visual mesh. This makes it obvious that bones higher up in the hierarchy such as the root and pelvis will contribute more error and thus require more bits to retain higher accuracy while on the other hand, bones further down such as finger tips or leaf bones do not require as many bits to maintain the same accuracy. A variable bit rate nicely solves this for us.

It is not unusual for some tracks within a clip to be exotic or to vastly differ from the others. Those tracks might require many more bits to retain sufficient accuracy or perhaps they might need far fewer. Again a variable bit rate is ideal for thus.

Some tracks can vary quite a bit through time as well. For example, during a cinematic it is quite common for all characters to be animated simultaneously. However not all characters might be visible on the screen at the same time. Characters that are not visible would typically remain animated but simply not move and thus their animation data remains constant during this time. A variable bit rate again allows us to exploit temporal coherence.

How It Works

As we saw with simple quantization, it is very common for the bit rate to be hardcoded. Instead, we want our bit rate to vary and adapt and it thus makes sense to calculate it in some way.

Per Clip

One morning, I realized that it might not be all that hard to try and brute force all bit rates within a certain range and find the smallest value that met some defined accuracy threshold. This allowed fast clips to use more bits when they were needed and slower clips to use fewer while still maintaining high accuracy.

A bit rate was considered superior to another if it was lower (yielding a lower memory footprint) and the accuracy as measured with our error function remained within an acceptable threshold.

Per Track Type

Not long after, I got thinking and realized that each track type is very different. Their units are different and the data also behaves very differently. It thus made sense to attempt and brute force every bit rate for each of our three track types: rotation, translation, and scale. I thus ended up with three bit rates per clip instead of a single value. It later became obvious that if I can do this per clip, I can also do it per block trivially and thus ended up with three bit rates per block. The added overhead was very small compared to the huge memory gains.

Per Track

It soon dawned on me that ideally, we wanted to find the best bit rate for each track within each block. The holy grail of our variable rate solution. Unfortunately, the search space for this solution is massive and it is no longer practical to perform a brute force search. A common character clip might easily have 50 to 80 tracks that are animated and each track can have one of several bit rates.

To solve this problem, we needed to find a smart heuristic.

The Algorithm

To trim the search space, I opted for a two phase approach: the first phase finds an approximate solution while the second phase refines it to a local minimum. This is an important distinction, the following algorithm does not find the globally optimal solution, instead it settles on an acceptable local minimum. It perhaps could be further improved.

The Search Space

In order to keep things simple, I settled on 16 possible bit rates that we used internally: 0, 3, 4, …, 16, and 23. These are the number of bits per track key component (e.g. per quaternion component). We need a higher value than 16 in some uncommon cases such as world space cinematic clips where the accuracy needs of the root are very high. As I mention in my presentation, 23 bits might have been chosen prematurely. The idea was to use the same number of bits as the floating point mantissa minus the sign bit but as it turns out, in order to de-quantize, we must be able to accurately and uniquely represent our quantized integers as 32 bit floating point numbers. Sadly, they can only accurately and uniquely represent integers up to 6 significant digits. The end result is thus that we have some rounding happening. I do not know wether or not this is a real issue or not. Perhaps 19 bits might be a more sensible choice. More work is required here to find the optimal value and evaluate the impact of the rounding.

We do have one edge case: when a track has 0 bits per component. This means our track is constant within our evaluation range and when this happens, we no longer need the track range information within that particular block since our range extent is now 0. We instead store our repeating constant key value within the 48 bits we had available for our range. This allows us to use 16 bits per component for our constant key.

Initial State

In order to start our algorithm, we first initialize all our tracks to use the highest bit rate possible. The goal of the algorithm being to lower these values as much as possible while remaining within our accuracy threshold.

First Phase

The first phase will lower the bit rate of every track in lock step as much as possible until we cannot do so without exceeding our error threshold. This allows us to quickly trim the search space by finding the best bit rate globally for a given block. However, there is one exception, during the duration of the first phase, the root tracks remain locked to the highest bit rate and thus the highest accuracy. This is done because some clips, such as a world space cinematic, have an unusually high accuracy requirement on the root tracks and if we process them as we do the others, they will negatively bias the rest of the tracks.

Second Phase

The second phase iterates over all of our tracks and aims to individually lower their bit rate as much as possible up until we reach our error threshold. The algorithm terminates once all tracks have been processed.

The order in which the tracks are processed matters. I tried two different orderings: starting at the root bone and going down the hierarchy towards the children and the opposite ordering, starting at the leaf bones and going back up towards the root. As it turns out, the distribution of which bit rate was selected by our algorithm changed quite a bit. Note that in the image, the right-most bit rates are very uncommon but they do happen (they are too rare to show up).

Starting at the leaf bones, a higher incidence of lower bit rates were selected. This intuitively makes sense since we have more children than we have parents in our hierarchy. On the other hand, we also have a higher incidence of higher bit rates. Again this makes sense since by lowering the children first, we force the parents to retain more bits in order to preserve the accuracy.

Overall, in the data that I had on hand, starting at the leaf bones and going towards the root resulted in a memory footprint that was roughly 2% smaller without impacting the compression time, accuracy, and our decompression time.

Performance

The compression speed remains acceptable and it can easily be optimized by evaluating all blocks to compress in parallel.

On the decompression side, everything we do is very simple and fast. Since our data is uniform, all of our data is linearly contiguous in memory and a full key frame typically fits within a handful of cache lines (2-5). This is extremely dense and contributes to our very fast decompression. Sadly due to our variable bit rates, we can easily end up performing unaligned reads but their performance remains entirely acceptable on x64 and even with SSE. In fact, with minimal work, everything can be done in SSE and we could even use the streaming read (non-temporal read) intrinsics if we wanted to.

Up next: Sub-sampling

Back to table of contents

Animation Compression: GDC 2017 Presentation

2017-03-08T00:00:00+00:00

In early 2016 I had the opportunity to write a novel animation compression algorithm for Eidos Montreal as part of the game engine previously used by Rise of the Tomb Raider (2015). Later in February 2017, I had the pleasure to present it at the Game Developers Conference (GDC).

The slides for the presentation can be found here and a blog post detailing the technique can be found here.

Back to table of contents

Animation Compression: Unity 5

2017-01-30T00:00:00+00:00

Unity 5 is a very popular video game engine on mobile devices and other platforms. Being a state of the art game engine, it supports everything you might need when it comes to character animation including compression.

The relevant FBX Importer and Animation Clip documentation is very sparse. It’s worth mentioning that Unity 5 is a closed source software and as such, there is some amount of uncertainty and speculation. However, I was able to get in touch with an old colleague working at Unity to clarify what happens under the hood.

Before we dig into what each compression setting does we must first briefly cover the data representations that Unity 5 uses internally.

Track Data Encoding

The engine uses one of three encodings to represent an animation track regardless of the track data type (quaternion, vector, float, etc.):

Rotation tracks are always encoded as four curves to represent a full quaternion (one curve per component). An obvious win here could be to instead encode rotations as quaternion logarithms or by dropping the quaternion W component or the largest component. This would of course immediately reduce the memory footprint for rotation tracks by 25% at the expense of a few instructions to reconstruct the original quaternion.

Legacy Curve

Legacy curves are a strange beast. The source data is sampled uniformly at a fixed interval such as 30 FPS and is kept unsorted in full precision. Using discreet samples during decompression a Hermite curve is constructed on the fly and interpolated. It is unclear to me how this format emerged but it has since been superseded by the other two and it is not frequently used.

It must have been quite slow to decompress and should probably be avoided.

Streaming Curve

Streaming curves are proper curves that use Hermite coefficients. A track is split into intervals and each interval is encoded as a distinct spline. This allows discontinuities between intervals. For example, a camera cut or teleporting the root in a cinematic. Each interval has a small header of 8 bytes and each control point is stored in full floating point precision plus an index. This is likely overkill. Full floating point precision is typically far too much for encoding rotations and using simple quantization to store them on 16 bits per component or less could provide significant memory savings.

The resulting control points are sorted by time followed by track to render them as cache as efficient as possible which is a very good thing. At decompression, a cursor or cache is used to avoid repeatedly searching for our control points when playback is continuous and predictable. For these two reasons streaming curves are very fast to decompress in the average use case.

Dense Curve

What Unity 5 calls a dense curve I would call a raw format. The original source data is sampled at a fixed interval such as 30 FPS and nothing more is done to it as far as I am aware. The data is sorted to make it cache efficient by time and track. No linear key reduction is performed or attempted. The sampled values are not quantized and are simply stored with full precision.

Dense curves will typically have a smaller memory footprint than streaming curves only for very short tracks or for tracks where the data is very noisy such as a motion capture. For this reason, they are unlikely to be used in practice.

Overall their implementation is simple but perhaps a bit naive. Using simple quantization would give significant memory gains here without degrading decompression performance and might even speed it up! On the upside decompression speed is very likely to be faster than with streaming curves.

Compression Settings

At the time of writing, Unity 5 supports three compression settings:

No Compression

The most detailed quote from the documentation about what this setting does is:

Disables animation compression. This means that Unity doesn’t reduce keyframe count on import, which leads to the highest precision animations, but slower performance and bigger file and runtime memory size. It is generally not advisable to use this option - if you need higher precision animation, you should enable keyframe reduction and lower allowed Animation Compression Error values instead.

From what I could gather the originally imported clip is sampled uniformly (e.g. 30 FPS) and each track is converted into a streaming curve. This ensures everything remains smooth and accurate but the overhead can be very significant since all samples are retained. To make things worse nothing is done for constant tracks with this setting.

Keyframe Reduction

The most detailed quote from the documentation is:

Removes redundant keyframes.

When this setting is used constant tracks will be collapsed to a single value and redundant control points in animated tracks which will be removed within the specified error threshold. This setting uses streaming curves to represent track data.

Only three thresholds are exposed for this compression setting. One for each track type: rotation, translation, and scale. This is very likely to lead to the problems discussed in my post on measuring accuracy. And indeed, a quick search yields this gem. Even though it dates from Unity 3(2010), I doubt the animation compression has changed much. Unfortunately, the problems it raises are both very telling and common with local space error metric functions. Here are some relevant excerpts:

Now, you may be asking yourself, why would this guy turn off the key reducer in the first place? The answer is simple. The key reducer sucks. Here’s why.

Every animation I have completed for this project uses planted keys to anchor the feet (and sometimes hands) to the floor. This allows me to grab any part of the body and animate it, knowing that the feet will not move. When I export the FBX, the keys stay intact. I can bring the animation back into Max or into Maya using the keyframe reducer for either software, and the feet remain anchored. When I bring the same FBX into Unity the feet slide around. Often quite noticably. The only way to stop the feet from sliding is to turn off the key reducer.

This is a very common problem with local space error functions. Tweaking them is hard! The end result is that very often a weaker compression or no compression at all is used when issues are found on a clip by clip basis. I have seen this exact behavior from animators working on Unreal 3 back in the day and very recently in a proprietary AAA game engine. Even though from the user’s perspective the issue is the animation compression algorithm, in reality, the issue is almost entirely due to the error function.

What I would really like to see is some options within Unity’s animation importer. A couple ideas:

1) Max’s FBX keyframe reduction has several precision threshold settings that dictate how accurate the keyframe reduction should be. In Unity, it’s all or nothing. I would love the ability to adjust the threshold in which a keyframe gets added. I could turn up the sensitivity on some animations to reduce sliding and possibly turn it down on areas that need fewer keys than are given by the current value.

2) I’m not sure if this is possible, but it would be great to set keyframe reductions on some bones and not others. That way I can keep the arm chain in the proper location without having to bloat the keyframes of every bone in the whole skeleton.

Exposing a single error threshold per track type is very common and provides a source of frustration for animators. They often know which bones need higher accuracy but are unable to properly tweak per bone thresholds. Sadly, when this feature is present the settings often end up being copy & pasted with overly conservative values which yield a higher than needed memory footprint. Nobody has time to tweak a large number of thresholds repeatedly.

Unity 3 actually corrects the problem by giving us direct control over the keyframe reduction vs allowable error. If you find your animation is sliding to much, dial down the Position Error and/or Rotation Error settings in the animation import settings.

Unfortunately, I didn’t find any satisfying setup :/

I got some characters that move super fast in their anim, and I need the player to see the move correctly for gameplay / balance reasons.

So it can works for some anims, but not for others (making them feel like they are teleporting).

And under a certain reduction threshold, the memory size benefit is too small to resolve loading times problem :/

In fact, the only reduction setting I found that didn’t caused teleportations was :

Position : 0.1
Rotation : 0.1
Scale : 0 (as there is never any animated scale)

But this is still causing huge file sizes :(

A single error threshold per track type also means that the error threshold has to be as low as your most sensitive bone requires. This will, in turn, retain higher accuracy that might otherwise be needed yielding again a higher memory footprint that is often unacceptable.

Optimal

The most detailed quote from the documentation is:

Let unity decide how to compress. Either by keyframe reduction or by using dense format. Unity will pick the most optimal of the two.

This is very vague and judging from the results of a quick search, a lot of people are curious.

Thankfully, I was able to get some answers!

If a track is very short or very noisy (which could happen with motion capture clips or baked simulations), the key reduction algorithm might not give appreciable gains and it is possible that a dense curve might end up having a smaller memory footprint than a streaming curve. When this happens for a particular track it will use the curve with the smallest memory footprint. As such, within a single clip, we can have a mix of dense and streaming curves.

Conclusion

The Unity 5 documentation is sparse and at times unclear. It leads to rampant speculation as to what might be going on under the hood and a lot of confusing results.

Its error function is poor, exposing a single value per track type. This leads to classic issues such as turning compression off to retain accuracy and using an overly conservative threshold to retain accuracy at the expense of the memory footprint. It perpetuates the stigma that animation compression can be painful to work with and can easily butcher an animator’s work without manual tweaking. Fixing the error function could be a reasonably simple task.

The optimal compression setting seems to be a very reasonable default value but it is not clear why the other two are exposed at all. Users are very likely to use one of the other settings instead of tweaking the error function thresholds which is probably a bad idea.

All curve types encode the data in full precision with 32-bit floating point numbers. This is likely overkill in a very large number of scenarios and implementing some form of simple quantization could provide huge memory gains with little work and little effort. Due to the reduced memory footprint, decompression timings might even improve.

Furthermore, rotation tracks could be encoded in a better format than a full quaternion further reducing the memory footprint for minimal work.

From what I could find nobody seemed to complain about animation decompression performance at runtime. This is mostly likely a direct result of the cache friendly data format and the usage of a cursor for streaming curves.

Up next: GDC 2017 Presentation

Back to table of contents

Animation Compression: Unreal 4

2017-01-11T00:00:00+00:00

Unreal 4 offers a wide range of documented character animation compression settings.

It offers the following compression algorithms:

Note: The following review was done with Unreal 4.15

Least Destructive

Reverts any animation compression, restoring the animation to the raw data.

This setting ensures we use raw data. It sets all tracks to use no compression which means they will have full floating point precision where None is used for translation and Float 96 for rotation, explained below. Interestingly it does not appear to revert the scale to a sensible default raw compression setting which is likely a bug.

Remove Every Second Key

Keyframe reduction algorithm that simply removes every second key.

This setting does exactly what it claims; it removes every other key. It is a variation of sub-sampling. Each remaining key is further bitwise compressed.

Note that this compression setting also removes the trivial keys.

Remove Trivial Keys

Removes trivial frames of tracks when position or orientation is constant over the entire animation from the raw animation data.

This takes advantage of the fact that very often tracks are constant and when it happens a single key is kept and the rest is discarded.

Bitwise Compress Only

Bitwise animation compression only; performs no key reduction.

This compression setting aims to retain every single key and encodes them with various variations. It is a custom flavor of simple quantization and will use the same format for every track in the clip. You can select one format per track type: rotation, translation, and scale.

These are the possible formats:

None: Full precision.
Float 96: Full precision is used for translation and scale. For rotations, a quaternion component is dropped and the rest are stored with full precision.
Fixed 48: For rotations, a quaternion component is dropped and the rest is stored on 16 bits per component.
Interval Fixed 32: Stores the value on 11-11-10 bits per component with range reduction. For rotations, a quaternion component is dropped.
Fixed 32: For rotations, a quaternion component is dropped and the rest is stored on 11-11-10 bits per component.
Float 32: This appears to be a semi-deprecated format where a quaternion component is dropped and the rest is encoded onto a custom float format with 6 or 7 bits for the mantissa and 3 bits for the exponent.
Identity: The identity quaternion is always returned for the rotation track.

Only three formats are supported for translation and scale: None, Interval Fixed 32, and Float 96. All formats are supported for rotations.

If a track has a single key it is always stored with full precision.

When a component is dropped from a rotation quaternion, W is always selected. This is safe and simple since it can be reconstructed with a square root as long as our quaternion is normalized. The sign is forced to be positive during compression by negating the quaternion if W was originally negative. A quaternion and its negated opposite represent the same rotation on the hypersphere.

Sadly, the selection of which format to use is done manually and no heuristic is used to perform some automatic selection with this setting.

Note that this compression setting also removes the trivial keys.

Remove Linear Keys

Keyframe reduction algorithm that simply removes keys which are linear interpolations of surrounding keys.

This is a classic variation on linear key reduction with a few interesting twists.

They use a local space error metric coupled with a partial virtual vertex error metric. The local space metric is used classically to reject key removal beyond some error threshold. On the other hand, they check the impacted end effector (e.g. leaf) bones that are children of a given track. To do so a virtual vertex is used but no attempt is made to ensure it isn’t co-linear with the rotation axis. Bones in between the two bones (the one being modified and the leaf) are not checked at all and should they end up with unacceptable error, it will be invisible and not accounted for by the metric.

There is also a bias applied on tracks to attempt to remove similar keys on children and their parent bone tracks. I’m not entirely certain why they would opt to do this but I doubt it can harm the results.

On the decompression side no cursor or acceleration structure is used; meaning that each time we sample the clip we will need to search for which keys to interpolate between and we do so per track.

Note that this compression setting also removes the trivial keys.

Compress each track independently

Keyframe reduction algorithm that removes keys which are linear interpolations of surrounding keys, as well as choosing the best bitwise compression for each track independently.

This compression setting combines the linear key reduction algorithm along with the simple quantization bitwise compression algorithm.

It uses various custom error metrics which appear local in nature with some adaptive bias. Each encoding format will be tested for each track and the best result will be used based on the desired error threshold.

Automatic

Animation compression algorithm that is just a shell for trying the range of other compression schemes and picking the smallest result within a configurable error threshold.

A large mix of the previously mentioned algorithms are tested internally. In particular, various mixes of sub-sampling are tested along with linear key reduction and simple quantization. The best result is selected given a specific error threshold. This is likely to be somewhat slow due to the large number of variations tested (35 at the time of writing!).

Conclusion

In memory, every compression setting has a layout that is naive with the data sorted by tracks followed by keys. This means that during decompression each track will incur a cache miss to read the compressed value.

Performance wise, decompression times are likely to be high in Unreal 4 in large part due to the memory layout not being cache friendly and the lack of an acceleration structure such as a cursor when linear key reduction is used; and it is likely used often with the Automatic setting. This is an obvious area where it could improve.

Memory wise, the linear key reduction algorithm should be fairly conservative but it could be minimally improved by ditching the local space error metric. Coupled with the bitwise compression it should yield a reasonable memory footprint; although it could be further improved by using more advanced quantization techniques.

Up next: Unity 5

Back to table of contents

Animation Compression: Case Studies

2016-12-23T00:00:00+00:00

In order to keep the material concrete, we will take a deeper look at some popular video game engines. We will highlight the techniques that they use for character animation compression and where they shine or come short.

Up next: Unreal 4

Back to table of contents

Animation Compression: Error Compensation

2016-12-22T00:00:00+00:00

Unless you use the curves directly authored by the animator for your animation clip, the animation compression you use will end up being lossy in nature and some amount of fidelity will be lost (possibly visible to the naked eye or not). To combat the loss of accuracy, three error compensation techniques emerged over the years to help push the memory footprint further down while keeping the visual results acceptable: in-place correction, additive correction, and inverse kinematic correction.

In-place Correction

This technique has already been detailed in Game Programming Gems 7. As such, I won’t go into implementation details unless there is significant interest.

As we have seen previously, our animated bone data is stored in local space which means that any error on a parent bone will propagate to its children. Fundamentally this technique aims to compensate our compression error by applying a small correction to each track to help stop the error propagating in our hierarchy. For example, if we have two parented bones, a small error in the parent bone will offset the child in object space. To account for this and compensate, we can apply a small offset on our child in local space such that the end result in object space is as close to the original animation as possible.

Up Sides

The single most important up side of this technique is that it adds little to no overhead to the runtime decompression. The bulk of the overhead remains entirely on the offline process of compression. We only add overhead on the decompression if we elect to introduce tracks to compensate for the error.

Down Sides

There are four issues with this technique.

Missing Tracks

We can only apply a correction on a particular child bone if it is animated. If it is not animated, adding our correction will end up adding more data to compress and we might end up increasing the memory footprint. If the translation track is not animated, our correction will be partial in that with rotation alone we will not be able to match the exact object space position to reach in order to mirror the original clip.

Noise

Adding a correction for every key frame will tend to yield noisy tracks. Each key frame will end up with a micro-correction which in turn will need somewhat higher accuracy to keep it reliable. For this reason, tracks with correction in them will tend to compress a bit more poorly.

Compression Overhead

To properly calculate the correction to apply for every bone track, we must calculate the object space transforms of our bones. This adds extra overhead to our compression time. It may or may not end up being a huge deal, your mileage may vary.

Track Ranges

Due to the fact that we add corrections to our animated tracks, it is possible and probable that our track ranges might change as we compress and correct our data. This needs to be properly taken into account, further adding more complexity if range reduction is used.

Additive Correction

Additive correction is very similar to the in-place correction mentioned above. Instead of modifying our track data in-place by incorporating the correction, we can instead store our correction separately as extra additive tracks and combine it during the runtime decompression.

This variation offers a number of interesting trade-offs which are worth considering:

Our compressed tracks do not change and will not become noisy nor will their range change
Missing tracks are not an issue since we always add separate additive tracks
Adding the correction at runtime is very fast and simple
Additive tracks are compressed separately and can benefit from a different accuracy threshold

However by its nature, the memory overhead will most likely end up being higher than with the in-place variant.

Inverse Kinematic Correction

The last form of error compensation leverages inverse kinematics. The idea is to store extra object space translation tracks for certain high accuracy bones such as feet and hands. Bones that come into contact with things such as the environment tend to make compression inaccuracy very obvious. Using these high accuracy tracks, we run our inverse kinematic algorithm to calculate the desired transforms of a few parent bones to match our desired pose. This will tend to spread the error of our parent bones, making it less obvious while keeping our point of contact fixed and accurate.

Besides allowing our compression to be more aggressive, this technique does not have a lot of up sides. It does have a number of down sides though:

Extra tracks means extra data to compress and decompress
Even a simple 2-bone inverse kinematic algorithm will end up slowing down our decompression since we need to calculate object space transforms for our bones involved
By its very nature, our parent bones will no longer closely match the original clip, only the general feel might remain depending on how far the inverse kinematic correction ends up putting us.

Conclusion

All three forms of error correction can be used with any compression algorithm but they all have a number of important down sides. For this reason, unless you need the compression to be very aggressive, I would advise against using these techniques. If you choose to do so, the first two appear to be the most appropriate due to their reduced runtime overhead. Note that if you really wanted to, all three techniques could be used simultaneously but that would most likely be very extreme.

Note: I profiled with and without error compensation in Unreal Engine 4 and the results were underwhelming, see here.

Up next: Case Studies

Back to table of contents

Animation Compression: Signal Processing

2016-12-19T00:00:00+00:00

Signal processing algorithm variants come in many forms but the most common and popular approach is to use Wavelets. Having utilized this method for all character animation clips on all supported platforms for Thief (2014) I have a fair amount to share with you.

Other signal processing algorithm variants include Discrete Cosine Transform, Principal Component Analysis, and Principal Geodesic Analysis.

The latter two variants are commonly used alongside clustering and database approaches which I’ll explore if enough interest is expressed but I’ll be focusing on Wavelets here.

How It Works

At their core implementations that leverage wavelets for compression will be split into four distinct steps:

Pre-processing
The wavelet transform
Quantization
Entropy coding

The flow of information can be illustrated like this:

The most important step is, of course, the wavelet function, around which everything is centered. Covering the wavelet function first will help clarify the purpose of every other step.

Aside from quantization all of the steps involved are effectively lossless and only suffer from minor floating point rounding. By altering how many bits we use for the quantization step we can control how aggressive we want to be with compression.

The decompression is simply the same process of steps performed in reverse order.

Wavelet Basics

We will avoid going too in depth on this topic in this series instead we will focus on discussing the wavelet properties and what they mean for us with respect to character animation and compression in general. A good starting point for the curious would be the Haar wavelet which is the simplest of wavelet functions, however, it’s generally avoided for compression.

By definition wavelet functions are recursive. Each application of the function is referred to as a sub-band and will output an equal number of scale and coefficient values each exactly half the original input size. In turn, we can recursively apply the function on the resulting scale values of the previous sub-band. The end result is a single scale value and N - 1 coefficients where N is the input size.

Haar wavelet scale is simply the sum of two input values and the coefficient represents their difference. As far as I know most wavelet functions function similarly which yield coefficients that are as close to zero as possible and exactly zero for a constant input signal.

The reason the Haar wavelet is not suitable for compression because it has a single vanishing moment. This means input data is processed in pairs, each outputting a single scale and a single coefficient. The pairs never overlap which means that if there is a discontinuity in between two pairs it will not be taken into account and yield undesirable artifacts if the coefficients are not accurate. A decent alternative is to use a Daubechies D4 wavelet. This is the function I used on Thief (2014) and it turned out quite decently for our purposes.

The wavelet transform can be entirely lossless by using an integer variant but in practice, an ordinary floating point variant is appropriate since compression is lossy by nature and the rounding will not measurably impact the results.

Since wavelet function decomposes a signal on an orthonormal basis we will be able to achieve the highest compression by considering as much of the signal as possible not unlike principal component analysis. Simply concatenate all tracks together into a single 1D signal. The upside of this is that by considering all data as a whole we can find a single orthonormal basis which will allow us to quantize more aggressively but by having a larger signal to transform the decompression speed will suffer. To keep the process reasonably fast in practice on modern hardware each track would likely be processed independently in a small power of two, such as 16 keys at a time. For Thief (2014), all rotation tracks and translation tracks were aggregated independently up to a maximum segment size of 64 KB. We ran the wavelet transform once for rotation tracks, and once for translation tracks.

Pre-processing

Because wavelet functions are recursive the size of the input data needs to be a power of two. If our size doesn’t match we will need to introduce some form of padding:

Pad with zeroes
Repeat the last value
Mirror the signal
Loop the signal
Something even more creative?

Which padding approach you choose is likely to have a fairly minimal impact on compression. Your guess is as good as mine regarding which is best. In practice, it’s best to avoid padding as much as possible by keeping input sizes fairly small and processing the input data in blocks or segments.

The scale of output coefficients is a function of the scale and smoothness of our input values. As such it makes sense to perform range reduction and to normalize our input values.

Quantization

After applying the wavelet transform the number of output values will match the number of input values. No compression has happened yet.

As mentioned previously our output values will be partitioned into sub-bands, and a single scale value somewhat centered around zero — both positive and negative. Each sub-band will end up with a different range of values. Larger sub-bands resulting from the first applications of the wavelet function will be filled with high-frequency information while the smaller sub-bands will comprise the low-frequency information. This is important. It means that a single low-frequency coefficient will impact a larger range of values after performing the inverse wavelet transform. Because of this low-frequency coefficients need higher accuracy than high-frequency coefficients.

To achieve compression we will quantize our coefficients into a reduced number of bits while keeping the single scale value with full precision. Due to the nature of the data, we will perform range reduction per sub-band and normalize our values between [-1.0, 1.0]. We only need to keep the range extent for reconstruction and simply assume that the range is centered around zero. Quantization might not make sense for the lower frequency sub-bands with 1, 2, 4 coefficients due to the extra overhead of the range extent. Once our values are normalized we can quantize them. To choose how many bits to use per coefficient we can simply hard code a high number such as 16 bits, 12 bits, or alternatively experiment with values in an attempt to optimize a solution to meet an error threshold. Quantization could also be performed globally to reduce the range of information overhead instead of per sub-band depending on the number of input values being processed. For example processing 16 keys at a time.

Entropy Coding

In order to be competitive with other techniques, we need to push compression further using entropy coding which is an entirely lossless compression step.

After quantization we obtain a number of integer values all centered around zero and a single scale. The most obvious thing that we can compress now is the fact that we have very few large values. To leverage this we apply a zigzag transform on our data, mapping negative integers to positive unsigned integers such that values closest to zero remain closest to zero. This transforms our data in such a way that we still end up with very few large values which are significant because it means that most of our values represented in memory now have many leading zeroes.

For example suppose we quantize everything onto 16 bit signed integers: -50, 50, 32760. In memory these values are represented with twos complement: 0xFFCE, 0x0032, 0x7FF8. This is not great and how to compress this further is not immediately obvious. If we apply the zigzag transform and map our signed integers into unsigned integers: 100, 99, 65519. In memory these unsigned integers are now represented as: 0x0064, 0x0063, 0xFFEF. An easily predictable pattern emerges with smaller values with a lot of leading zeroes which will compress well.

At this point, a generic entropy coding algorithm is used like zlib, Huffman, or some custom arithmetic coding algorithm. Luke Mamacos gives a decent example of a wavelet arithmetic encoder that takes advantage of leading zeros.

It’s worth noting that if you process a very large input in a single block you will likely end up with lots of padding at the end. This typically ends up as all zero values after the quantization step and it can be beneficial to use run length encoding to compress those before the entropy coding phase.

In The Wild

Signal processing algorithms tend to be the most complex to understand while requiring the most code. This makes maintenance a challenge which is represented by a decreased use in the wild.

While these compression methods can be used competitively if the right entropy coding algorithm is used, they tend to be far too slow to decompress, too complex to implement, and too challenging to maintain for the results that they yield.

Due to its popularity at the time I introduced wavelet compression to Thief (2014) to replace the linear key reduction algorithm used in Unreal 3. Linear key reduction was very hard to tweak properly due to a naive error function it used resulting in a large memory footprint or inaccurate animation clips. The wavelet implementation ended up being faster to compress with and yielded a smaller memory footprint with good accuracy.

Performance

Fundamentally the wavelet decomposition allows us to exploit temporal coherence in our animation clip data, but this comes at a price. In order to sample a single keyframe, we must reverse everything. Meaning, if we process 16 keys at a time we must decompress our 16 keys to sample a single one of them (or two if we linearly interpolate as we normally would when sampling our clip). For this reason, wavelet implementations are terribly slow to decompress and speeds end up not being competitive at all which only gets worse as you process a larger input signal. On Thief (2014) full decompression on the Play Station 3 SPU took between 800us and 1300us for blocks of data up to 64 KB.

Obviously, this is entirely unacceptable with other techniques in the range of 30us and 200us. To mitigate this and keep it competitive an intermediate cache is necessary.

The idea of the cache is to perform the expensive decompression once for a block of data (e.g. 16 keys) and re-use it in the future. At 30 FPS our 16 keys will be usable for roughly 0.5 seconds. This, of course, comes with a cost as we now need to implement and maintain an entirely new layer of complexity. We must first decompress into the cache and then interpolate our keys from it. The decompression can typically be launched early to avoid stalls when interpolating but it is not always possible. This is particularly problematic on the first frame of gameplay where a large number of animations will start to play at the same time while our cache is empty or stale. For similar reasons, the same issue happens when a cinematic moment starts or any moment in gameplay with major or abrupt change.

On the upside, as we decompress only once into the cache we can also take a bit of time to swizzle our data and sort it by key and bone such that our data per key frame is now contiguous. Sampling from our cache then becomes more or less equivalent to sampling with simple quantization. For this reason sampling from the cache is extremely fast and competitive (as fast as simple quantization).

Our small cache for Thief (2014) was held in main memory while our wavelet compressed data was held in video memory on the Play Station 3. This played very well in our favor with the rare decompressions not impacting the rendering bandwidth as much and keeping interpolation fast. This also contributed to slower decompression times but it was still faster than it was on the Xbox 360.

In conclusion signal processing algorithms should be avoided in favor of simpler algorithms that are easier to implement, maintain, and end up just as competitive when properly implemented.

Up next: Error Compensation

Back to table of contents

Animation Compression: Curve Fitting

2016-12-10T00:00:00+00:00

Curve fitting builds on what we last saw with linear key reduction. With it, we saw that we leveraged linear interpolation to remove keys that could easily be predicted. Curve fitting archives the same feat by using a different interpolation method: a spline function.

How It Works

The algorithm is already fairly well described by Riot Games and Bitsquid (now called Stingray) in part 1 and part 2, and as such I will not go further into details at this time.

Catmull-Rom splines are a fairly common and a solid choice to represent our curves.

In The Wild

This algorithm is again fairly popular and is used by most animation authoring software and many game engines. Sadly, I never had the chance to get my hands on a state of the art implementation of this algorithm and as such I can’t quite go as far in depth as I would otherwise like to do.

Performance

Most character animated tracks move very smoothly and approximating them with a curve is a very natural choice. In fact, clips that are authored by hand are often encoded and manipulated as a curve in Maya (or 3DS Max). If the original curves are available, we can use them as-is. This also makes the information very dense and compact. The memory footprint of curve fitting should be considerably lower than with linear key reduction but I do not have access to competitive implementations of both algorithms to make a fair comparison.

For example, take this screen capture from some animation curves in Unity:

We can easily see that each track has five control points but with a total clip duration of 2.5 seconds (note that the image uses a sample rate of 25 FPS which makes the numbering a bit quirky) we would need 2.5 seconds * 30 frames/second = 75 frames to represent the same data. Even after using linear key reduction, the number of keys would remain higher than five.

As with linear key reduction, our spline control points will have time markers and most of what was mentioned previously will apply to curve fitting as well: we need to search for our neighbour control points, we need to sort our data to be cache efficient, etc.

One important distinction is that while linear key reduction only needed two keys per track to reconstruct our desired value at a particular time T, with curve fitting we might need more. For example, Catmull-Rom splines require four control points. This makes it more likely to increase the amount of cache lines we need to read when we sample our clip. For this reason, and the fact that a spline interpolation function is more expensive to execute, decompression should be slower than with linear key reduction but without access to a solid implementation, the fact that it might be slower is only an educated guess at this point.

Additional Reading

Bezier Curves

Up next: Signal Processing

Back to table of contents

Animation Compression: Linear Key Reduction

2016-12-07T00:00:00+00:00

With simple key quantization, if we needed to sample a certain time T for which we did not have a key (e.g. in between two existing keys), we linearly interpolated between the two.

A natural extension of this is of course to remove keys or key frames which can be entirely linearly interpolated from their neighbour keys as long as we introduce minimal or no visible error.

How It Works

The process to remove keys is fairly straight forward:

Pick a key
Calculate the value it would have if we linearly interpolated it from its neighbours
If the resulting track error is acceptable, remove it

The above algorithm continues until nothing further can be removed. How you pick keys may or may not impact significantly the results. I personally only ever came across implementations that iterated on all keys linearly forward in time. However, in theory you could iterate in any number of ways: random, key with smallest error first, etc. It would be interesting to try various iteration methods.

It is worth pointing out that you need to check the error at a higher level than the individual key you are removing since it might impact other removed keys by changing the neighbour used to remove them. As such you need to look at your error metric and not just the key value delta.

Removing keys is not without side effects: now that our data is no longer uniform, calculating the proper interpolation alpha to reconstruct our value at time T is no longer trivial. To be able to calculate it, we must introduce in our data a time marker per remaining key (or key frame). This marker of course adds overhead to our animation data and while in the general case it is a win, memory wise, it can increase the overall size if the data is very noisy and no or very few keys can be removed.

A simple formula is then used to reconstruct the proper interpolation alpha:

TP = Time of Previous key
TN = Time of Next key
Interpolation Alpha = (Sample Time - TP) / (TN - TP)

Another important side effect in introducing time markers is that when we sample a certain time T, we must now search to find between which two keys we must interpolate. This of course adds some overhead to our decompression speed.

The removal is typically done in one of two ways:

Removal of whole key frames that can be linearly interpolated
Removal of independent keys that can be linearly interpolated

While the first is less aggressive and will generally yield a higher memory footprint, the decompression speed will be faster due to needing to search only once to calculate our interpolation alpha.

For example, suppose we have the following track and keys:

The key #3 is of particular interest:

As we can see, we can easily recover the interpolation alpha from its neighbours: alpha = (3 - 2) / (4 - 2) = 0.5. With it, we can perfectly reconstruct the missing key: value = lerp(0.35, 0.85, alpha) = 0.6.

Another interesting key is #4:

It lies somewhat close to the value we could linearly interpolate from its neighbours: value = lerp(0.6, 0.96, 0.5) = 0.78. Whether the error introduced by removing it is acceptable or not is determined by our error metric function.

In The Wild

This algorithm is perhaps the most common and popular out there. Both Unreal 4 and Unity 5 as well as many popular game engines support this format. They all use slight variations mostly in their error metric function but the principle remains the same. Sadly most implementations out there tend to use a poorly implemented error metric which tends to yield bad results in many instances. This typically stems from using a local error metric where each track type has a single error threshold. Of course the problem with this is that due to the hierarchical nature of our bone data, some bones need higher accuracy (e.g. pelvis, root). Some engines mitigate this by allowing a threshold per track or per bone but this requires some amount of tweaking to get right which is often undesirable and sub-optimal.

Twice in my career I had to implement a new animation compression algorithm and both times were to replace bad linear key reduction implementations.

From the implementations I have seen in the wild, it seems more popular to remove individual keys as opposed to removing whole key frames.

Performance

Sadly due to the loss of data uniformity, the cache locality of the data we need suffers. Unlike for simple key quantization, we can no longer simply sort by key frame if we remove individual keys (you still can if you remove whole key frames though) to keep things cache efficient.

Although I have not personally witnessed it, I suspect it should be possible to use a variation of a technique used by curve fitting to sort our data in a cache friendly way. It is well described here and we’ll come back to it when we cover curve fitting.

The need to constantly search for which neighbour keys to use when interpolating quickly adds up since it scales poorly. The longer our clip is, the wider the range we need to search and the more tracks we have also increases the amount of searching that needs to happen. I have seen two ways to mitigate this: partitioning our clip or by using a cursor.

Partitioning our clip data as we discussed with uniform segmenting helps reduce the range to search in as our clip length increases. If the number of keys per block is sufficiently small, searching can be made very efficient with a sorting network or similar strategy. The use of blocks will also decrease the need for precision in our time markers by using a similar form of range reduction which allows us to use fewer bits to store them.

Using a cursor is conceptually very simple. Most clips play linearly and predictably (either forward or backward in time). We can leverage this fact to speed up our search by caching which time we sampled last and which neighbour keys were used to kickstart our search. The cursor overhead is very low if we remove whole key frames but the overhead is a function of the number of animated tracks if we remove individual keys.

Note that it is also quite possible that by using the above sorting trick that it could speed up the search but I cannot speak to the accuracy of this statement at this time.

Even though we can reach a smaller memory footprint with linear key reduction compared to simple key quantization, the amount of cache lines we’ll need to touch when decompressing is most likely going to be higher. Along with the need to search for key neighbours, these facts makes it slower to decompress using this algorithm. It remains popular due to the reduced memory footprint which was very important on older consoles (e.g. PS2 and PS3 era) as well as due to its obvious simplicity.

See the following posts for more details:

Up next: Curve Fitting

Back to table of contents

Animation Compression: Sub-sampling

2016-11-17T00:00:00+00:00

Once upon a time, sub-sampling was a very common compression technique but it is now mostly relegated to history books (and my boring blog!).

How It Works

It is conceptually very simple:

Take your source data, either from Maya (3DS Max, etc.) or already sampled data at some sample rate
Sample (or re-sample) your data at a lower sample rate

Traditionally, character animations have a sample rate of 30 FPS. This means that for any given animated track, we end up with 30 keys per second of animation.

Sub-sampling works because in practice, most animations don’t move all that fast and a lower sampling rate is just fine and thus 15-20 FPS is generally good enough.

Edge Cases

Now of course, this fails miserably if this assumption does not hold true or if a particular key is very important. It can often be the case that an important key is removed with this technique and there is sadly not much that can be done to avoid this issue short of selecting another sampling rate.

It is also worth mentioning that not all sampling rates are necessarily equal. If your source data is already discretized at some original sample rate, sampling rates that will retain whole keys are generally superior to sample rates that will force the generation of new keys by interpolating their neightbours.

For example, if my source animation track is sampled at 30 FPS, I have a key every 1s / 30 = 0.033s. If I sub-sample it at 18 FPS, I have a key every 1s / 18 = 0.055s. This means every key I need is not in sync with my original data and thus new keys must be generated. This will yield some loss of accuracy.

On the other hand, if I sub-sample at 15 FPS, I have a key every 1s / 15 = 0.066s. This means every other key in my original data can be discarded and the remaining keys are identical to my original keys.

Another good example is sub-sampling at 20 FPS which will yield a key every 1s / 20 = 0.05s. This means every 3rd key will be retained from the original data (0.0 … 0.033 … 0.066 … 0.1 … 0.133 … 0.166 … 0.2 …). The other keys do not line up and will be artificially generated from our original neighbour keys.

The problem of keys not lining up is of course absent if your source animation data is not already discretized. If you have access to the Maya (or 3DS Max) curves, the sub-sampling will retain higher accuracy.

In The Wild

In the wild, this used to be a very popular technique on older generation hardware such as the Xbox 360 and the PlayStation 3 (and older). It was very common to keep most of your main character animations with a high sample rate of say 30 FPS, while keeping most of your NPC animations at a lower sample rate of say 15 FPS. Any specific animation that required high accuracy would not be sub-sampled and this selection process was done by hand, making its usage somewhat error prone.

Due to its simplicity, it is also commonly used alongside other compression techniques (e.g. linear key reduction) to further reduce the memory footprint.

However, nowadays this technique is seldom used in large part because we aren’t as constrained by the memory footprint as we used to be and in part because we strive to push the animation quality ever higher.

Up next: Linear Key Reduction

Back to table of contents

Animation Compression: Simple Quantization

2016-11-15T00:00:00+00:00

Simple quantization is perhaps the simplest compression technique out there. It is used for pretty much everything, not just for character animation compression.

How It Works

It is fundamentally very simple:

Take some floating point value
Normalize it within some range
Scale it to the range of a N bit integer
Round and convert to a N bit integer

That’s it!

For example, suppose we have some rotation component value within the range [-PI, PI] and we wish to represent it as a signed 16 bit integer:

Our value is: 0.25
Our normalized value is thus: 0.25 / PI = 0.08
Our scaled value is thus: 0.08 * 32767.0 = 2607.51
We perform arithmetic rounding: Floor(2607.51 + 0.5) = 2608

Reconstruction is the reverse process and is again very simple:

Our normalized value becomes: 2608.0 / 32767.0 = 0.08
Our de-quantized value is: 0.08 * PI = 0.25

If we need to sample a certain time T for which we do not have a key (e.g. in between two existing keys), we typically linearly interpolate between the two. This is just fine because key frames are supposed to be fairly close in time and the interpolation is unlikely to introduce any visible error.

Edge Cases

There is of course some subtlety to consider. For one, proper rounding needs to be considered. Not all rounding modes are equivalent and there are so many!

The safest bet and a good default for compression purposes, is to use symmetrical arithmetic rounding. This is particularly important when quantizing to a signed integer value but if you only use unsigned integers, asymmetrical arithmetic rounding is just fine.

Another subtlety is whether to use signed integers or unsigned integers. In practice, due to the way two’s complement works and the need for us to represent the 0 value accurately, when using signed integers, the positive half and the negative half do not have the same size. For example, with 16 bits, the negative half ranges from [-32768, 0] while the positive half ranges from [0, 32767]. This asymmetry is generally undesirable and as such using unsigned integers is often favoured. The conversion is fairly simple and only requires doubling our range (2 * PI in the example above) and offsetting it by the signed ranged (PI in the example above).

Our normalized value is thus: (0.25 + PI) / (2.0 * PI) = 0.54
Our scaled value is thus: 0.54 * 65535.0 = 35375.05
With arithmetic rounding: Floor(35375.05 + 0.5) = 35375

Another point to consider is: what is the maximum number of bits we can use with this implementation? A single precision floating point value can only accurately represent up to 6-9 significant digits as such the upper bound appears to be around 19 bits for an unsigned integer. Higher than this and the quantization method will need to change to an exponential scale, similar to how depth buffers are often encoded in RGB textures. Further research is needed on the nuances of both approaches.

In The Wild

The main question remaining is how many bits to use for our integers. Most standalone implementations in the wild use the simple quantization to replace a purely raw floating point format. Tracks will often be encoded on a hardcoded number of bits, typically 16 which is generally enough for rotation tracks as well as range reduced translation and scale tracks.

Most implementations that don’t exclusively rely on simple quantization but use it for extra memory gains again typically use a hardcoded number of bits. Here again 16 bits is a popular choice but sometimes as low as 12 bits is used. This is common for linear key reduction and curve fitting. Wavelet based implementations will typically use anywhere between 8 and 16 bits per coefficient quantized and these will vary per sub-band as we will see when we cover this topic.

The resulting memory footprint and the error introduced are both a function of the number of bits used by the integer representation.

This technique is also very commonly used alongside other compression techniques. For example, the remaining keys after linear key reduction will generally be quantized as will curve control points after curve fitting.

Performance

This algorithm is very fast to decompress. Only two key frames need to be decompressed and interpolated. Each individual track is trivially decompressed and interpolated. It is also terribly easy to make the data very processor cache friendly: all we need to do is sort our data by key frame, followed by sorting it by track. With such a layout, all our data for a single key frame will be contiguous and densely packed and the next key frame we will interpolate will follow as well contiguously in memory. This makes it very easy for prefetching to happen either within the hardware or manually in software.

For example, suppose we have 50 animated tracks (a reasonable number for a normal AAA character), each with 3 components, stored on 16 bits (2 bytes) per component. We end up with 50 * 3 * 2 = 300 bytes for a single key frame. This yields a small number of cache lines on a modern processor: ceil(300 / 64) = 5. Twice that amount needs to be read to sample our clip.

Due in large part to the cache friendly nature of this algorithm, it is quite possibly the fastest to decompress.

My GDC presentation goes in further depth on this topic, its content will find its way here in due time.

Up next: Advanced Quantization

Back to table of contents

Animation Compression: Main Compression Families

2016-11-13T00:00:00+00:00

Over the years, a number of techniques have emerged to address the problem of character animation compression. These can be roughly broken down into a handful of general families:

Simple Quantization: Simply storing our key values on fewer bits
Advanced Quantization: Super charge our simple quantization with a variable bit rate
Sub-sampling: Varying the sampling rate to reduce the number of key frames
Linear Key Reduction: Removing keys that can be linearly interpolated from their neighbours
Curve Fitting: Calculating a curve that approximates our keys
Signal Processing: Using signal processing mathematical tools such as PCA and Wavelets

Note that simple quantization and sub-sampling can be and often are used in tandem with the other three compression families although they can be used entirely on their own.

Up next: Simple Quantization

Back to table of contents

Animation Compression: Uniform Segmenting

2016-11-10T00:00:00+00:00

Uniform segmenting is performed by splitting a clip into a number of blocks of equal or approximately equal size. This is done for a number of reasons.

Range Reduction

Splitting large clips (such as cinematics) into smaller blocks allows us to use range reduction on each track within a block. This can often help reach higher levels of compression especially on longer clips. This is typically done on top of the clip range reduction to further increase our compression accuracy.

Easier Seeking

For some compression algorithms, seeking is slow because it involves searching for the keys or control points surrounding the particular time T we are trying to sample. Techniques exist to speed this up by using optimal sorting but it adds complexity. With blocks sufficiently small (e.g. 16 key frames), optimal sorting might not be required nor the usage of a cursor (cursors are used by these algorithms to keep track of the last sampling performed to accelerate the search of the next sampling). See the posts about linear key reduction and curve fitting for details.

With uniform segmenting, finding which blocks we need is very fast in large part because there are typically few blocks for most clips.

Streaming and TLB Friendly

The easier seeking also means the clips are much easier to stream. Everything you need to sample a number of key frames is bundled together into a single contiguous block. This makes it easy to cache them or prefetch them during playback.

For similar reasons, this makes our clip data more TLB friendly. All our relevant data needed for a number of key frames will use the same contiguous memory pages.

Required For Wavelets

Wavelet functions typically work on data whose’s size is a power of two. Some form of padding is used to reach a power of two when the size doesn’t match. To avoid excessive padding, clips are split into uniform blocks. See the post about wavelet compression for details.

Downsides

Segmenting a clip into blocks does add some amount of memory overhead:

Each clip will now need to store a mapping of which block is where in memory.
If range reduction is done per block as well, each block now includes range information.
Blocks might need a header with some flags or other relevant information.
And for some compression algorithms such as curve fitting, it might force us to insert extra control points or to retain more key frames.

However, the upsides are almost aways worth the hassle and overhead.

Up next: Main Compression Families

Back to table of contents

Animation Compression: Range Reduction

2016-11-09T00:00:00+00:00

Range reduction is all about exploiting the fact that the values we compress typically have a much smaller range in practice than in theory.

For example, in theory our translation track range is infinite but in practice, for any given clip, the range is fixed and generally small. This generalizes to rotation and scale tracks as well.

How It Works

First we start with some animated translation track.

Using this track as an example, we first find the minimum and maximum value that will bound our range: [16, 40]

Using these, we can calculate the range extent: 40 - 16 = 24

Now, we have everything we need to re-normalize our track. For every key: normalized value = (input value - range minimum) / range extent

Reconstructing our input value becomes trivial: input value = (normalized value * range extent) + range minimum

We thus represent our range as a tuple: (range minimum, range extent) = (16, 24)

This representation has the advantage that reconstructing our input value is very efficient and can use a single fused multiply-add instruction when it is available (and it is on modern CPUs that support FMA as well as modern GPUs).

This increases the information we need to keep around by an extra tuple for every track. For example, a translation track would require a pair of Vector3 values (6 floats) to encode our 3D range.

This extra information allows us to increase our accuracy (sometimes dramatically). While both representations are 100% equivalent mathematically, this only holds true if we have infinite precision. In practice, single precision floating point values only have 6 to 9 significant decimal digits of precision. Typically, our original input range will be much larger than our normalized range and as such its precision will be lower.

This is most easily visible when our track needs to be quantized on a fixed number of bits. For example, our example translation track would typically be bounded by a large hardcoded number such as 100 centimetres. Our hardcoded theoretical track range thus becomes: [-100, 100]. Tracks with values outside of this range can’t be represented and might need a different base range or a raw un-quantized format. The above range is thus represented by the tuple: (-100, 200). If we quantized this range on 16 bits, our input range of 200cm is evenly divided by 65536 which gives us a precision of: 200cm / 65536 = 0.003cm. However, because we know the range is much smaller, using the same number of bits yields a precision of: 1.0 / 65536 = 0.000015 * 24cm = 0.00037cm. Our accuracy has increased by 8x!

In practice, rotation tracks are generally bounded by [-PI, PI] but animated bones will only use a very small portion of that range on any given clip. Similarly, translation tracks are typically bounded by some hardcoded range but only a small portion is ever used by most bones. This translates naturally to scale tracks as well.

A side effect of range reduction is that all tracks end up being normalized. This is important for the simple quantization compression algorithms as well as when wavelets are used on an aggregate of tracks (as opposed to per track). Future posts will go further into detail.

Conclusion

Range reduction allows us to focus our precision on the range of values that actually matters to us. This increased accuracy very often allows us to be more aggressive in our compression, reducing our overall size despite any extra overhead.

Do note that the extra range information does add up and for very short clips (e.g. 4 key frames) this extra overhead might yield a higher overall memory footprint. These are typically uncommon enough that it is simpler to accept that occasional loss in exchange for simpler decompression that can assume that all tracks use range reduction.

Up next: Uniform Segmenting

Back to table of contents

Animation Compression: Constant Tracks

2016-11-03T00:00:00+00:00

One of the primary advantage of storing bone transforms in local space as opposed to object space is that it reduces dramatically the amount of data that changes. For example, if we have two bones connected, and they have their local rotation animated, we end up with two animated tracks in local space, but three tracks in object space. This is true because the rotation changes from the root bone will cause a translation offset on the second bone.

It is also the case that many bones are never animated by a majority of clips. For example, inverse kinematic (IK) bones are only ever animated when they are required (e.g. object pick up) and facial bones are typically used by cinematic clips or overlay clips. When bones aren’t animated, they will be by definition constant in our raw data.

Consequently, it is often the case that tracks are constant. In general, character animations end up animating many more rotation tracks than they do for translation or scale. As such, translation and scale tracks often end up entirely constant in a clip. Taking advantage of this can be done in one of two ways:

Implicitly: linear key reduction will typically reduce such tracks to two keys since everything else can be interpolated (curve fitting and wavelet benefit similarly from this)
Explicitly: by using a bit set that marks tracks as constant or variable

Typically, using a bit set and explicitly dropping constant tracks is the superior method to reduce the memory footprint but it does add a bit more complexity to the decompression algorithm. It is entirely worth it in my opinion.

Constant tracks will come in two forms:

Constant and equal to the default track value (typically equal to the bind pose or identity)
Constant and equal to some arbitrary track value

For obvious reasons, these tracks compress very well. In the first case, we only need to store a single bit per track. If a track is constant, we easily know which value it should have based on the track type (rotation, translation, or scale). In the second case, in addition to our single bit per track, we also need to store our single repeated key value. This single key can be stored raw since its memory footprint overall will remain small regardless or you can quantize it on fewer bits.

My GDC presentation will include real numbers on the frequency of constant bones, constant tracks, and constant track components from a modern untitled AAA video game but I cannot include them here just yet.

In my experience, the number of constant tracks is closely correlated by the number of constant track components (e.g. a constant scale track is equivalent to three constant scale track components). As such any extra gains we might have from using a bit set per track component (instead of per track) is likely to be offset by the larger bit set size and the extra complexity will likely make decompression slower.

Animation Compression: Measuring Accuracy

2016-11-01T00:00:00+00:00

Every compression algorithm covered herein is lossy in nature, so how we measure accuracy is critically important. Measured deviation from source data needs to be representative of the visual differences observed.

It is also important to compare algorithms against each other. It is surprising to learn that many academic papers on this topic use varying error metrics making meaningful comparison quite difficult.

Error measuring needs to meet a number of important criteria:

Account for the hierarchical nature the data since errors accumulate down the hierarchy when using local space transforms
Account for errors in all three tracks for every bone
Account for the fact that visual mesh almost never overlaps with its skeleton and the further away a vertex is from its bone the larger the error will end up being

Object Space vs Local Space

For compression reasons, transforms and blending poses are usually stored in local space. However, due to the hierarchical nature, a small error on a parent bone could end up causing a much larger error on a child’s bone further down. For this reason, errors are best measured in object space.

Surprisingly, it is quite common for compression algorithms to use local space instead. Error is measured using linear key reduction and curve fitting algorithms.

Approximating the Visual Error

Since a skeleton is generally never directly visible errors are visible on the visual mesh instead. This means the most accurate way to measure errors is to perform skinning for every keyframe and comparing vertices before and after compression.

Sadly, this would be terribly slow even with help from the GPU in many instances.

Instead, we can use virtual or fake vertices at a fixed distance from each bone. This is both intuitive and easy to achieve: for every object space bone transform we simply apply the transform to a local vertex position. For example, a vertex in local space of a bone at a fixed offset of 10cm will end up in object space with the object space bone transform applied. This step is done with both the source animation and the compressed animation so we can compare the distance between the two transformed vertices to measure error.

For this to work we need to use two virtual vertices per bone and they must be orthogonal to each other. Otherwise, if our single vertex happens to be co-linear with the bone’s rotation axis, the error could end up being less visible or entirely invisible. For this reason, two vertices are used with the maximum error delta. In practice, we would use something like vec3(0, 0, distance) and vec3(0, distance, 0).

This properly takes into account all three bone tracks and it approximately takes into account the fact that the visual mesh is always some distance away from the skeleton.

Which virtual distance you use is important. A virtual distance much smaller than the distance of real vertices the error might end up being visible since error increases with distance for rotation and scale. If the distance is much larger we might end up retaining more accuracy than we otherwise need.

For most character bones a distance in the range of 2-10cm makes sense. Special bones like the root and camera might require higher accuracy and as such should probably use a distance closer to 1-10m.

Large animated objects also occasionally benefit from vertex to bone distances of 1-10m and either need extra bones added to reduce maximum distance, or the virtual distance needs to be adjusted correspondingly.

It’s also worth mentioning that we could probably extract the furthest skinned vertex per bone from the visual mesh and use that instead but keep in mind some bones might not have skinned vertices such as the root and camera bones.

Conclusion

Accuracy is modified using a single intuitive number, the virtual vertex distance and is measured with a single value, object space distance.

Both Unity and Unreal measure accuracy in ways that are sub-optimal and fail to take into account all three properties outlined above. From what I have seen their techniques are also representative of a lot of game engines out there. Future posts will explore the solutions those game engines provide.

A number of compression algorithms use the error metric function to attempt to converge on a solution like linear key reduction, for example. If metrics are imprecise or poorly represent true error it is very likely that will be reflected in the end result. Many linear key reduction algorithms use a local space error metric which sometimes leads to important keys being removed from distant children of the point of error. For example, a pelvis key being removed can translate into a large error at the fingertip. The typical way to combat this side-effect is to use overly conservative error thresholds but this translates directly in the memory footprint increasing.

Up next: Constant Tracks

Back to table of contents

Animation Compression: Animation Data

2016-10-27T00:00:00+00:00

Animation clips more or less always boil down to the same thing: a series of animated tracks. There might also be other pre-processed data (such as the total root displacement, etc.) but we won’t concern ourselves with this since their impact on compression is usually limited. Instead, we will focus on bone transform tracks: rotation, translation, and scale.

While we can represent all three tracks within a single affine 4x4 matrix, for compression purposes it is usually best to treat them separately. For one thing, a full column of the matrix doesn’t need to be compressed as it never changes and we can typically encode the rotation in a form much more compact than a 3x3 matrix.

Rotation

Rotations in general can be represented by many forms and choosing an appropriate representation is important for reducing our memory footprint while maintaining high accuracy.

Common representations include:

3x3 matrix (9 float values)
- wasteful, don’t use this for compression…
quaternion (4 float values)
- high accuracy, fast decompression, a bit larger than the others
quaternion with dropped W component (3 float values)
- W can be reconstructed with a square root, the quaternion can be flipped to keep it in the positive hemisphere
quaternion with dropped largest component (3 float values + 2 bit)
- similar to dropping W, but offers higher accuracy
quaternion logarithm (3 float values)
- this is equivalent to the rotation axis multiplied by the rotation angle around our axis
axis & angle (4 float values)
- a bit larger than most alternatives
polar axis & angle (3 float values)
- high accuracy but slower decompression
Euler angles (3 float values)
- best avoided due to Gimbal lock issues near the poles if quaternions are expected at runtime

Which representation is best depends in large part on what the animation runtime uses internally for blending clips. There are three common formats used at runtime:

4x4 affine matrix
quaternion
quaternion logarithm

If the compressed format doesn’t match the runtime format, a conversion will be required and it has an associated cost (typically sin & cos are involved).

Some formats offer higher accuracy than others at an equal number of bits. A separate post will go further in depth about their various error/size metrics.

Generally speaking, the formats with 3 float values are a good starting point to build compression on. Which of them you use might not have a very dramatic impact on the resulting memory footprint but I could be wrong. Obviously, using a format with 4 floats or more will result in an increased memory footprint but beyond that, it might not matter all that much.

Rotations are often animated but their range of motion is typically fairly small in any given clip and the total range is bounded by the unit sphere (-PI/PI for angles, and -1/+1 for axis/quaternion components).

Because rotations work on the unit sphere, the error needs to be measured on the circumference traced by the associated children. For example, if I have an error of 1 degree on my parent bone rotation, the error at the position of a short child bone will be less than that of a longer child bone: the arc length formed by a 1 degree rotation depends on the sphere radius. This needs to be taken into account by our error measuring function as we will see later in a separate post.

Translation

Translations are typically encoded as 3 floats. Translations are generally not animated except for a few bones such as the pelvis. They tend to be constant and often match the bind pose translation. Sadly, they are typically not bounded as some bones can be animated in world space (e.g. root bone in a cinematic) or they can move very far from their parent bone (e.g. camera).

They could also be encoded as an axis and a length (4 floats) and in its polar equivalent (3 floats). Whether the polar form is a better tradeoff or not, I cannot say at this time without measuring the impact on the accuracy, memory footprint, and decompression speed. It might not work out so great for very large translation values (e.g. world space).

Scale

Scale is not common nor is it typical in character animation but it does turn up. For our purposes we will only consider non-uniform scaling with 3 floats but uniform scaling with a single float would be a natural extension.

Scale is very rarely animated and as such it will generally match the bind pose scale. Unfortunately, the range of possible values is unbounded with the minor exception that a scale of 0.0 is generally invalid.

Because the scale affects the rotation part of a 4x4 affine matrix, the same issues pertaining to the error are present.

Similar to translations, it might be possible to encode it in axis/length form but whether or not this is a good idea remains unclear to me at this time.

The skeleton hierarchy

Due to its hierarchical nature, we can infer a number of properties about our data:

In local space, bones higher up in the hierarchy contribute more to the overall error
In local space, a parent bone and its children can be animated independently, leading to improved compression but reduced accuracy since any error will accumulate down the hierarchy
In object space, if a parent is animated, all children will in turn be animated as well but this also means their error does not propagate

As we will see in a later post, these properties can be leveraged to reduce our overall error.

Bone velocity

Generally, we will only compress sampled keys from our original source sequence and as such the velocity information is partially lost. However, it remains relevant for a number of reasons:

High velocity sequences (or moments) will typically compress poorly and these are quite common when they come from an offline simulation (e.g. cloth or hair simulation) or motion capture (e.g. facial animation).
Smooth and low velocity clips will generally compress best and thankfully these are also the most common. Most hand authored sequences will fall in this category.
Most character animation sequences have low velocity motion for most bones (they move slowly and smoothly)
Many bone tracks have no velocity due to the fact that they are constant and it is also common for bones to be animated only for certain parts of a clip

To properly adapt to these conditions, our compression algorithm needs to be adaptive in nature: it needs to use more bits when they are needed for high velocity moments, and use fewer bits when they aren’t needed. As we will see later, there are various techniques to tackle this.

Animation Compression: Terminology

2016-10-26T00:00:00+00:00

Animation Sequence or Clip

Central to character animation is our animation sequence. These are also commonly called animation clips and I will use both terms interchangeably.

A clip is composed of several bones and standalone tracks. This post will not focus on standalone tracks since they can typically be represented as fake bones and regardless, all the techniques we will cover could be used to compress them as well as a natural extension.

Skeleton

Our character animation sequences will always play back over a rigid skeleton. The skeleton defines a number of bones and their hierarchical relationship (parent/child).

A skeleton has a well defined bind pose. The bind pose represents the default reference pose of the skeleton. For example, the pelvis bone would have a bind pose with a fixed translation offset of 60cm above the ground. The first spine bone would have a small offset above the parent pelvis bone, etc.

A skeleton has a root bone. The root bone is the bone highest in the hierarchy on which all other bones are parented. This is typically the bone that is animated to move characters about the world when the motion is animation driven.

Bone

A bone is commonly composed of at least 2 tracks: rotation and translation. The next post will go into further details about how these are represented in the data.

Also very common is for bones to have a scale value associated. When it is present, bones have a 3rd scale track.

All bones have exactly one parent bone (except the root bone which has none) and optional children.

Bone transform

A bone transform can be represented in a number of ways but for our intents and purposes, we can assume it is a 4x4 affine matrix. These support rotation, translation, and non-uniform scale.

A bone transform can be either in object space or local space.

Track

A track is composed of 1+ keys. All tracks in a raw sequence will have the same number of keys. A sequence with a single key represents a static pose and consequently has an effective length in time of 0 seconds.

Track key

A track key is composed of 1+ components. For example, a translation track would generally have 3 components: X, Y, and Z.

Key component

A key component is a 32 bit float value in its raw form. In practice, standalone tracks could have any value or type (bool, int, etc.) but we will mostly focus on rotation, translation, and scale in this series and these will all use floats.

Key interpolation

Key interpolation is the act of taking two neighbour keys and interpolating a new key at some time T in between. For example, if I have a key at time 0.0s and another at time 0.5s, I can trivially reconstruct a key at time 0.25s by interpolating my two keys.

Because video games are realtime, it is very rare for the game to sync up perfectly with animation sequences, as such nearly every frame will end up interpolating between 2 keys. Consequently, when we sample our clips for playback, we will typically always sample 2 keys to interpolate in between.

Generally speaking, the interpolation will always be linear even for rotations. This is typically safe since keys are assumed to be somewhat close to one another such that linear interpolation will correspond approximately to the shortest path.

Note that some algorithms perform this interpolation implicitly when you sample them: linear key reduction, curve fitting, etc.

Object space

Bone transforms in object space are relative to the whole skeleton. For certain animation clips such as cinematics, they are sometimes animated in world space in which case object space will correspond to world space. When this distinction is relevant, we will explicitly mention it.

Local space

Bone transforms in local space are relative to their parent bone. For the root bone, the local space is equivalent to the object space since it has no parent.

Converting from local space to object space entails multiplying the object space transform of our parent bone by the local space transform of our current bone. Converting from object space to local space is the opposite and requires multiplying the inverse object space transform of our parent bone by the object space transform of our current bone.

Rotation

A rotation represents an orientation at a given point in time. It can be represented in many ways but the two most common are 3x3 matrices and quaternions. Other format exist and will be covered to some extent in future posts.

Translation

A translation represents an offset at a given point in time. It is typically represented at a vector with 3 components.

Scale

A scale is composed of either a single scalar value or a vector with 3 components. The later is used to support non-uniform scaling in which skew and shear are possible. For our purposes, scale will always be a vector with 3 components.

Frame

A frame represents a unit of time. Most games are played at 30 frames per second (FPS) as such every frame has a length of time equal to 1.0s / 30.0 = 0.03333s.

A clip with 2 keys consequently has 1 unit of time that elapses in between. If the first key is at 0.0s and the second key is at 0.5s, we can reconstruct the key position at any time in between by linearly interpolating our keys. A clip with 11 keys, has 10 frames, etc.

The frame rate of the game does need to match the sample rate of animation clips.

Note that in the literature, the term ‘frame’ is sometimes used to mean a transform (e.g. frame of reference).

Sample rate

The sample rate dictates the frequency at which the keys are sampled in the original raw clip. A sample rate of 30 frames per second translates in 1 second of time being divided into 30 individual frames. For a sequence with a length of 1 second, the resulting clip will have 30 frames and 31 keys.

If our game runs at a higher frame rate than our sequence sample rate (e.g. 60 FPS), we will end up interpolating between 2 keys twice as much. If in turn our games runs slower than our sample rate (e.g. 15 FPS), some keys will be skipped during playback.

Common sample rates are: 15 FPS, 20 FPS, and 30 FPS. Most games run at a frame rate of 30 FPS or 60 FPS.

Affine transform

An affine transform is a 4x4 matrix that can represent simultaneously rotation, translation, and non-uniform scaling.

Quaternion

A quaternion is a 4D complex number that can efficiently and conveniently represent a rotation in 3D space. We will not go further in depth but I will do my best to explain the relevant bits as we encounter them. Since we will mostly use them for storage, we will do very little math involving them.

The most important thing to know about them is that for a quaternion to represent a rotation, it must be normalized (unit length). And obviously, it is made up of 4 scalar float numbers.

Up next: Animation Data

Back to table of contents

Animation Compression: Preface

2016-10-25T00:00:00+00:00

This post is meant as a preface to explain the overall context of what is character animation, how we use it, and why it needs special consideration.

This series of posts will only cover the compression aspect of animation data, we will not look into everything else that goes into writing an animation runtime library such as Morpheme.

Modern character animation

For the purpose of this series, we will only refer to character animation sequences that play back over a rigid skeleton (also called a rig). In Maya (and other similar products), an animator will typically author the animation sequence from a set of curves by manipulating the skeleton to achieve the desired motion.

This information is used at runtime to animate the rigid skeleton which in turn deforms the visual mesh bound to it. Note that animation sequences can be authored in 3 main ways: by hand, from motion capture, or procedurally (for example, a cloth simulation).

Central to all this is our rigid skeleton. The skeleton is composed of a number of bones, typically in the range of 100-200 for main characters (but this number can be much larger, up to 1000 in some cases). Some of these will almost always be animated (such as the pelvis), some are animated procedurally at runtime and thus never authored, while others are only animated in a small subset of animations such as facial bones.

The actual process to generate the final skeleton pose will vary but it generally involves blending a number of animation sequences together. The number of poses blended will vary greatly at runtime, and for a main character it can often range from 6 to 20 poses. Although an interesting topic, we will not go into further detail.

The final visual mesh deformation process is commonly referred to as ‘skinning’. Depending on the video game engine, skinning is typically done either by interpolating 4x4 matrices or dual quaternions. The former is the most common but can yield some artifacts while the later approach is the mathematically correct way to do skinning. Both have their pros and cons but we will not go further into detail since it is not immediately relevant to our topic at hand.

How is it used in AAA video games?

For a long time now (at least since the original PlayStation), 3D character animation has been used by a large array of video game titles. Over time the techniques used to author the animation sequences have evolved and the amount of content has gradually crept up. Today, it isn’t unusual for a AAA video game to have several thousands of individual character animations.

Animation sequences are used for all sort of things beyond just animating characters: they can animate cameras, vegetation, animals or critters, objects (e.g. door, chest), etc.

As a result of this, it is not unusual to end up having between 20-200MB worth of animations at runtime. On disk, the size is of course even larger.

Its unique requirements

To generate a single image (or frame), it is not uncommon to end up sampling over 100 animation sequences. As such, it is critically important for this operation to be very fast. Typically on a modern console, we are targeting 50-100us or lower to sample a single sequence.

Due to the large amount of animation sequences, it is also very important for their size to remain small. This is both important on disk, which impacts our streaming speed, and in memory, which impacts the maximum amount of sequences we can hold.

As we will see later, not all animation sequences are created equal, as some will require higher accuracy than others.

These things must all be taken into account when designing or selecting a compression algorithm. The compression algorithms we will be discussing will all be lossy in nature as that is the most common industry practice.

Note: All images taken from Unity 5

Up next: Terminology

Back to table of contents

Animation Compression: Table of Contents

2016-10-21T00:00:00+00:00

Character animation compression is an exotic topic that seems to need to be re-approached only about once per decade. Over the course of 2016 I’ve been developing a novel twist to an older technique and I was lucky enough to be invited to talk about it at the 2017 Game Developers Conference (GDC). This gave me the sufficient motivation to write this series.

The amount of material available regarding animation compression is surprisingly thin and often poorly detailed. My hope for this series is to serve as a reference for anyone looking to improve upon or implement efficient character animation compression techniques.

Each post is self-contained and detailed enough to be approachable to a wide audience.

ACL: An open source solution

My answer to the lack of open source implementations of the above algorithms and topics has prompted me to start my own. See all the code and much more over on GitHub for the Animation Compression Library as well as its Unreal Engine plugin!

Memory Allocators: Table of Contents

2016-10-18T00:00:00+00:00

It is perhaps a bit late but I am starting to have quite a few posts on the topic of memory allocators and navigation was beginning to be a bit complex.

I thought it would be a good idea to have a central place that links everything together.

All the relevant code for this lives under Gin (code) on GitHub.

Table of Content

Linear allocators
- Classic linear allocator
- Virtual memory aware linear allocator
- Thread safe linear allocator
Stack frame allocators
- Greedy stack allocator
- Virtual memory aware stack allocator
Circular allocators
- Classic circular allocator
- Circular frame allocator
Miscellaneous
- Dealing with being out of memory

Virtual Memory Aware Stack Frame Allocator

2016-10-17T00:00:00+00:00

Today we cover the second variant: the virtual memory aware stack frame allocator (code). This is a fairly popular incarnation of the stack frame allocator pattern and I have seen quite a few implementations in the wild in line with this toy implementation. Similar in spirit to the virtual memory aware linear allocator, here again we gain in simplicity and performance by leveraging the virtual memory system.

How it works

Not a whole lot differs from the previous implementation aside from the fact that we are no longer segmented. In AAA video games, it is quite common to have a known upper bound (or it is easily approximated) for these allocators and as such we can simplify the implementation quite a bit by having a single segment. To avoid wasting memory, we simply reserve a virtual address range, and commit/decommit memory as needed.

With a single segment, a lot of logic is removed and simplified. When we allocate, we will commit more memory gradually in blocks of our page size. In a real implementation, this size would likely be larger depending on the page size, 64KB would be a decent size to use.

Popping is very easy and fast and it never causes memory to decommit. This is done to avoid repeated commit/decommit calls and managing some hysteresis. In practice, these allocators often have a known fixed lifetime. For example, we might have an instance of the allocator for main thread allocations during a single frame. As such, it is easy at the end of the frame to decommit once and retain some amount of slack. It is also very common to have 2 or 3 for rendering purposes and while they might be double (or triple) buffered, the principle remains.

What can we use it for

Similar to the classic segmented implementation, this variant is almost a drop in replacement. It is best suited when the upper bound on the usage is known. It would not be unusual for this implementation to live alongside the segmented implementation: during development, the segmented version is used such that it never fails, and closer to releasing the product, the upper bound is calculated and the virtual memory aware variant is used instead.

What we can’t use it for

For obvious reasons, this variant is not ideal when the maximum memory footprint is not known or if it needs to grow unbounded. While on a 64 bit system you could technically reserve 4TB of address space and it would likely be enough, in practice another variant might be a better fit.

Edge cases

Unlike the classic greedy variant, popping will not decommit memory, as such if memory pressure is high, special care needs to be taken to avoid issues. Under some circumstances, if an out of memory situation arises with some other allocator, decommitting the slack might allow you to retry the allocation.

Performance

The performance of this allocator is very good. Allocation is O(1) and deallocation is O(1). Both operations are very cheap if no call to the kernel is required to commit memory.

Conclusion

This is our second usable allocator and its usage is quite common. Single segment implementations might not always be virtual memory aware, but they remain the same in nature. Many AAA video game titles have been released with very similar implementations to great success.

Next up, we will revisit the linear memory allocator family to cover an important addition: the thread safe linear allocator. This will be our first thread safe implementation and it is a variant that is very common in multithreaded rendering code.

Reddit thread

Back to table of contents

Greedy Stack Frame Allocator

2016-05-09T00:00:00+00:00

Today we cover the first stack frame allocator variant: the greedy stack frame allocator (code). It takes the term greedy from how it manages free segments after popping a frame. To better understand how this allocator works, you should first read up on the stack frame allocator family.

How it works

This allocator uses memory segments chained together. When an allocation is made, we find a free segment that can accommodate our allocation request. The first segment we check is the current live segment (the segment last used). Should it be full or the remaining space too small, we then look in the free segment list and pick (or allocate) a suitable candidate. Finally, we update our current live segment to the new used segment.

A segment behaves much like our linear allocator previously seen (also commonly called a memory region). Allocation is very cheap since it only involves bumping a pointer forward.

When we pop a frame, all segments that were used and are no longer required or alive are appended to a free list. This is where the allocator is greedy: it will only free the segments once the allocator is destroyed. The upside of this is that we will rarely require to allocate new segments which can involve an expensive malloc/kernel call.

Supporting realloc is trivial and works as expected.

The allocator also supports supplying it with pre-allocated segments that are externally managed. This is handy if you want to prime it with a larger segment which will thus become the minimum managed size. For example, you might know that the allocator is usually on average around 2MB, instead of letting it allocate segments of 64KB on demand, you register a single 2MB segment at initialization and later if it needs more memory, it will allocate 64KB segments on demand.

What can we use it for

The greedy nature of the algorithm makes it ideal when the upper bound on memory usage is known or at the very least manageable and we can afford to keep the memory allocated. This is suitable for many applications and is indeed used in a similar form in Unreal 3 under the name of FMemStack. Many AAA games have shipped with this making extensive use of it.

What we can’t use it for

The greedy nature makes it far less ideal when the upper bound isn’t known ahead of time or is possibly unbounded.

Edge cases

Again, the usual edge cases here involve overflow caused by the size or alignment requested and must be carefully dealt with. Another edge case specific to this allocator remains: because of the linear allocation nature of the algorithm, live segments might not be 100% used and will naturally keep some unused slack. Generally speaking this isn’t such a big deal but it can trivially be tracked if required.

Potential optimizations

Here again we can omit the overflow checks. In practice they are rarely required and could be left to assert instead. This is generally the approach taken in AAA game implementations.

Reallocate support is optional here as well and could be stripped if needed.

The implementation uses a FrameDescription in order to keep track of the order in which the frames are pushed to ensure we pop them back in reverse order. This is entirely for safety and in practice it could be omitted in a shipping build.

Performance

The performance of this allocator is very good. Allocation is O(1) and deallocation is a noop. Frame pushing is O(1) and popping is O(n) where n is the number of live segments freed. All of these operations are very cheap.

Frames are very compact and their footprint is split between AllocatorFrame and a frame description allocated internally.

Segments have a small header (at most 24 bytes) required to chain them as well as to keep track of the allocated size, segment size, and some flags.

Generally, a stack frame allocator will often manage less than 4GB of memory as such it can be templated to use uint32_t internally. This yields a footprint of 40 bytes on 64 bit platforms for the allocator (a single cache line). This ensures that when we allocate, only 2 cache lines will usually be touched (the allocator instance and the live segment header).

Conclusion

This is our first usable allocator and it is a powerful one. On older, slower hardware where taking locks and using a general malloc/free was expensive, I used a thread local stack allocator to great success in the performance sensitive portions of the code. Its simplicity and flexibility makes it ideal to replace allocators of containers that end up on the stack which might often reallocate memory.

Next up, we will cover a virtual memory aware variant better suited for high performance AAA video games where the upper bound is generally known or can be reliably approximated. This second variant will be the last variant we will cover in this allocator family.

Note that it is often desirable to avoid the allocator to grow unbounded in memory in the presence of spikes and a common way to deal with this is to free some segments when we pop frames by keeping a free list with a small number of entries (slack). This is easily achieved with an hysteresis constraint. Alternatively, at a specific point in the code, you could call a function on the allocator to free some of its slack (e.g: at the start or end of a frame).

Reddit thread

Back to table of contents

Stack Frame Allocators

2016-05-08T00:00:00+00:00

This allocator family is quite possibly one of the most common memory management technique.

Note that there are many variants possible of this allocator. As such we will only cover a few to give a general overview of what it can look like. Actual production versions in the wild will often be custom and tailored to the specific application needs. Unlike our linear allocator previously covered, this allocation technique is very common and in use in many applications.

How it works

A topic that comes very early when introducing many programming languages is the call stack. It is central to C, C++, and many other languages. The execution stack used for this mechanic operates as a stack frame based memory management algorithm which is handled by the compiler and the hardware (on x86 and x64).

Every C++ programmer is familiar with the stack and all it has to offer. Each function call pushes a new frame on the stack when it enters and later pops it when it exits. In this context, a frame is a section of memory that holds our local function variables that cannot be stored in registers.

Our stack frame allocator works in much the same way except that pushing and popping of frames is made explicit. This makes it a very powerful and familiar tool.

Our frame based allocators will use a common AllocatorFrame class and the usage is simple:

AllocatorFrame frame(allocatorInstance);
// Allocate here with ‘allocatorInstance’

Popping of the frame is automatic when the destructor is called or it can be popped manually with AllocatorFrame::Pop().

In a frame based allocator such as this, when the frame is popped, all allocations made within our frame can be freed and the memory reclaimed. Different allocator variants will handle this differently as we will later see. The bottom line is that you should not keep pointers to that memory since it is no longer valid. On the plus side, freeing memory is very fast as a bulk operation.

What can we use it for

It allows us to allocate memory dynamically and return it from a function. This is very common and handy when we do not know the maximum amount of potential results returned (maybe we have a rare worst case scenario) and we want to avoid increasing the stack.

It supports realloc for the topmost allocation, useful when appending to vectors.

When using a thread local allocator, we can avoid expensive heap accesses since we offer better performance due to reduced cache and TLB usage. Execution stack memory is often forced to use 4KB pages and this is often controlled by the kernel but our parallel stack can use any page size it wants (when such control is available to user space).

Moving large allocations (such as strings) off the call stack reduces the risks of a malicious user causing the return address to be overwritten and abused in the presence of buffer overflow bugs. It can trivially replace alloca and everything it is used for.

Depending on the variant, it can access all system memory unlike our linear allocator.

Because it is often faster than a generalized heap allocator (potentially shared between multiple threads), if our allocated data is thread local (e.g: a vector on the stack), we can trivially migrate the allocations to use our frame allocator and save on allocation, deallocation, and reallocation.

What we can’t use it for

Much like our linear allocator, stack frame allocators do not support generalized deallocation of memory. This is largely mitigated by the fact that freeing is still supported but now happens in a last in/first out order due to our stack frames.

Performance

Performance of stack frame based allocators is generally very good due in large part to their simplicity when allocating and the bulk free operation. It is very common to have a stack frame allocator per thread which can also avoids a lock operation since they generally do not require to be shared between threads.

Generally speaking, the performance is enhanced compared to a generalized heap allocator because by design we use a smaller virtual memory range or we guaranty that we can re-use a previously used range keeping our TLB usage optimal. This also helps the CPU cache.

Conclusion

Up next we will cover a simple variant: the greedy stack frame allocator.

Alternate names

This memory allocator is often abbreviated to simply ‘Stack Allocator’. There are so many variants that are possible that there are probably just as many names.

Note that if you know a better name or alternate names for this allocator, feel free to contact me.

Reddit thread

Back to table of contents

GPGPU woes part 4

2016-04-14T00:00:00+00:00

After fixing the previous addressing issue, we move on to the next.

Inconsistencies and limitations

Here are the outputs for the various OpenCL driver values on my laptop:

Windows 8.1 x64

Device: Intel(R) Iris(TM) Graphics 5100
- Max workgroup size: 512
- Num compute units: 40
- Local mem size: 65536
- Global mem size: 1837105152
- Is memory unified: true
- Cache line size: 64
- Global cache size: 2097152
- Address space size: 64
- Kernel local workgroup size: 512
- Kernel local mem size: 0
Device: Intel(R) Core(TM) i5-4278U CPU @ 2.60GHz
- Max workgroup size: 1024
- Num compute units: 4
- Local mem size: 32768
- Global mem size: 1837105152
- Is memory unified: true
- Cache line size: 64
- Global cache size: 262144
- Address space size: 64
- Kernel local workgroup size: 1024
- Kernel local mem size: 12288

OS X x64

Device: Iris
- Max workgroup size: 512
- Num compute units: 40
- Local mem size: 65536
- Global mem size: 1610612736
- Is memory unified: true
- Cache line size: 0
- Global cache size: 0
- Address space size: 64
- Kernel local workgroup size: 512
- Kernel local mem size: 30720
Device: Intel(R) Core(TM) i5-4278U CPU @ 2.60GHz
- Max workgroup size: 1024
- Num compute units: 4
- Local mem size: 32768
- Global mem size: 0
- Is memory unified: true
- Cache line size: 3145728
- Global cache size: 64
- Address space size: 64
- Kernel local workgroup size: 1
- Kernel local mem size: 0

As we can see, many things differs and both seems to report bad values for a few things.

An important difference between the CPU and GPU is that CPU local storage is limited to 32KB. But Why? Couldn’t the CPU variant simply consider local storage to be the same as normal memory and thus only be bounded by virtual (or physical) memory?

This seemingly arbitrary limitation makes a lot of sense when you consider the fact that local storage is supposed to be used to avoid touching main memory by keeping intermediate results in fast memory. The CUDA programming guide states that they expose functions to let you tweak how much of GPU L1 is split between real L1 usage and local storage and L1 is fast, really fast.

Sadly, on the CPU we usually have no such control with how the L1 is used or controlled and it is also quite small: usually 32KB for code and 32KB for data. Note that some exotic platforms do allow you some minimal control over the CPU caches such as the Wii cache line locking instructions. However, generally speaking, it is very hard to control effectively by hand.

With most CPUs (including the one I am using) having very small L1 caches, it makes sense to limit local storage to the size of the L1. After all, if you design an algorithm around the use of very fast memory, performance could be adversely affected should you have less available in reality.

GPGPU woes part 3

2015-07-11T00:00:00+00:00

After fixing the previous mind bending issue, we move on to the next issue.

On the GPU, the address space size can differ

Virtually every programmer out there knows that the address space size on the CPU can differ based on the hardware, executable, and kernel. The two most common sizes being 32 bits and 64 bits (partially true since in practice only 48 bits are used for now). Some older programmers will also remember the times of the 16 bits address space.

What few programmers realize is that other chips in their computer might have a different address space size and as such sharing binary structures with them that contain pointers is generally unsafe.

Case in point: my GPU has a 64 bits address space.

What this means is that if I run a i386 executable in either OS X or Windows, the pointer size will be 4 bytes on the CPU but the GPU will use 8 bytes for pointers. This means that structures like this that are shared will not work:

struct dom_node
{
    struct dom_node *parent;
    cl_int id;
    cl_int tag_name;
    cl_int class_count;
    cl_int first_class;
    cl_int style[MAX_STYLE_PROPERTIES];
};

The above code was causing the reading and writing of memory far past the buffer that contained them when running on the GPU. This caused the output to differ from the expected result of the CPU version.

In particular, it ended up corrupting read-only buffers that the kernel was using (the read-only stylesheet) causing further confusion since the output differed from run to run. I was lucky enough that I didn’t end up corrupting anything else more critical as it could have made debugging the issue much harder (e.g: driver crash).

The fix is to ensure there is proper padding by wrapping the pointer in a union:

struct dom_node
{
    union
    {
        struct dom_node *parent;
        cl_ulong pad_parent;
    };
    cl_int id;
    cl_int tag_name;
    cl_int class_count;
    cl_int first_class;
    cl_int style[MAX_STYLE_PROPERTIES];
};

OpenCL has integer types with the cl_ prefix in order to ensure their size does not differ between the CPU and GPU but sadly no such trick is possible with pointers: we have to resort to our manual union hack. In practice we could probably introduce a macro to wrap this and make it cleaner.

Ideally sharing structures that contain pointers with the GPU isn’t such a good idea. Unless memory is unified, the GPU will not be able to access that memory and as such it is wasteful. Even if the memory is unified, typically the GPU accessible memory must be allocated with particular flags and depending on the platform it might only be possible through the driver.

GPGPU woes part 2

2015-07-07T00:00:00+00:00

After fixing the previous painful issue, we move on to the next strange issue (and much worse!).

On the GPU, a NULL pointer isn’t always NULL

Despite the code being very simple, the CPU and GPU versions produced very different outputs for me in OS X (not verified in Windows 8.1). For the life of me I could not figure it out as it proved even stranger than the previous issue.

Ultimately, the issue was here:

// Inlined the CSS_CUCKOO_HASH_FIND macro and minor formatting cleanup

__global const struct css_rule * rule = 0;
if (hash_->left[left_index_].type != 0 && hash_->left[left_index_].value == value_) {
    rule = &hash_->left[left_index_];
}
if (hash_->right[right_index_].type != 0 && hash_->right[right_index_].value == value_) {
    rule = &hash_->right[right_index_];
}
if (rule_set != 0) {
    // code

}

Can you see it? If you can’t, I don’t think anybody could blame you.

After painfully narrowing it down to this precise piece of code, it turns out that the rule_set != 0 check was failing when no rule was found (neither if statement was taken) on the GPU (and was obviously working as expected on the CPU).

This is mere speculation but I have the gut feeling that this issue might be caused because internally some bits in the memory addresses must be used to tell global memory from local memory on the GPU.

It is entirely possible that a NULL pointer for global memory might not equal a NULL pointer for local memory (or constant memory). In such a scenario, comparing both would return false. Perhaps the memory qualifier information was lost and the optimizer was left to perform a comparison with the literal 0 value.

The only other alternative would be a compiler bug. Sadly I could not take a look at the generated assembly to double check what was happening.

The fix was simply to introduce a boolean/integer variable that we could safely compare against. Again note that I was more concerned with getting the code to work while keeping it as close to the original than making it fast. It is also possible that force casting the 0 literal to the pointer type might have worked. This is left as an exercise to the reader.

GPGPU woes part 1

2015-07-02T00:00:00+00:00

Curiosity about the Servo project from Mozilla finally got the best of me and I looked around to see if I could contribute somehow. A particular exploratory task caught my eye that involved running the CSS rule matching on the GPU. I forked the original sample code and got to work.

Little did I know I would hit some very weird and peculiar issues related to OpenCL on my Intel Iris 5100 inside my MacBook Pro. These issues are so exotic and rarely discussed that I figure they warranted their own blog posts.

Just getting the sample to work reliably on OS X and Windows 8.1 while attempting to get identical results in x86, x64, CPU, and GPU versions proved to take considerable time due to various issues.

Moving pieces

GPGPU has a lot of moving pieces that can easily cause havoc. Code is typically written in a C like dialect (OpenCL, CUDA) and it is easy to make mistakes if you are not careful. If the old adage of blowing your foot off with C is true on the CPU, writing C on the GPU is probably akin to blowing up your whole neighbourhood along with your foot.

The first moving piece is the driver you use. Different platforms have different drivers, they also differ by hardware and how they update also differs. Bugs in the drivers are not unheard of and quite frequent since the hardware is still rapidly evolving and the range of supported devices grows everyday. For example, OS X provides the drivers as opposed to the manufacturer providing them like they do on Windows. This means the update process is much slower.

The second moving piece is the hardware itself. Even from a single manufacturer, there is considerable variation. From the number of compute units, the size of local storage, the size of the address space, all the way to whether memory is unified or not with the CPU.

This brings us to our first issue.

On the GPU, there is no stack

The first exotic issue I hit was that the original code would not run for me. On Windows 8.1, the GPU would hit 100% utilization and cause the driver to time out (sometimes forcing me to power cycle). On OS X, the kernel would return after 5 or 10 seconds of runtime and attempting to run it a second time would cause the program to crash (after modifying it to run the kernel more than once to gather average timings).

After hunting for several hours, I finally found the culprit: a C style stack array with 16 elements. The total size of this array was 3 integers times 16 or 192 bytes. This seems fairly small but it fails to take into account how the GPU and generated assembly handle C style stack arrays.

struct css_matched_property
{
    cl_int specificity;
    cl_int property_index;
    cl_int property_count;
};

// Later inside the kernel function

struct css_matched_property matched_properties[16];

On the GPU, there is no stack. From past experiences, the generated assembly will attempt to keep everything in registers instead of putting it in local or global storage (since local and global storage usage require explicit keywords). Because of this reason, it also will fail to spill in local storage or global memory if we run out of registers. In practice the driver could probably spill to memory when this happens but the performance would be terrible.

According to the hardware specifications of my GPU, each thread has 128 registers that each store 32 bytes (SIMD 8 elements of 32 bits). The above array requires 48 such registers if the data is not coalesced into fewer registers. Since we use a struct with 3 individual integers, this is a reasonable assumption. Along with everything else going on in the function, my kernel would in all likelihood (due to the nature of the crash, I failed to get exact measurements for the number of registers used) exhaust all available registers.

This marks the second time I see a GPU crash caused by the driver attempting to run a kernel that requires more registers than are available.

These sort of issues are nasty since the same kernel would work on hardware with more registers. It also looks like clean and simple code if you aren’t aware of what happens being the curtains.

The fix, as is probably obvious now, was to keep the array in shared local memory. This ensures we can calculate how much actual memory we require for this and based on the amount available on the given hardware, it caps the maximum number of threads we can execute in a group to avoid running out.

const cl_int MAX_NUM_WORKITEMS_PER_GROUP = 320;
__local struct css_matched_property matched_properties[16 * MAX_NUM_WORKITEMS_PER_GROUP];
cl_int matched_properties_base_offset = 16 * get_local_id(0);

Keep in mind that at this stage of development, I was more concerned with getting the code running correctly than I was in getting it to run fast. The is no point in having fast code that does not do what you want.

Out of Memory

2015-06-25T00:00:00+00:00

I have been playing Game of War: Fire Age for some time now and a peculiar issue keeps recurring which prompted this post: the game often runs out of memory.

This is an often rarely discussed issue and the solutions to it are seldom discussed as well. This post is an attempt to document various causes and solutions to this and help offer some insight into this.

The causes

In video games, the memory workload is typically very predictable: we have hard limits on many features (e.g: max number of players) and generally fixed data. These two things coupled together imply that out of memory situations are generally quite rare. The typical causes are as follow:

An excessively large memory size is requested and cannot be serviced.

I mean by this that 2GB or more might be requested on a device with very little memory (e.g: 256MB). This is generally caused by attempting to allocate memory with a negative size (size_t is typically used in C++ to represent this and it is unsigned however depending on the warning level and the compiler, automatic (or sometimes programmer forced) coercion can happen). This is generally an error due to unforeseen circumstances which rarely happens in a released title but that happens from time to time during development.

More memory is allocated than the system allows.

Again, due to the predictable memory footprint of things in video games, it generally happens during development and very rarely in a released title.

Over time, due to memory leaking, you run out of memory.

This can happen in released titles and more than a few have shipped with memory leaks.

Memory fragmentation.

If the memory pressure is high and fragmentation is present, even though free memory might exist to service a particular allocation request, the system might fail due to fragmentation (either in user space or due to physical memory fragmentation on some embedded devices). Fragmentation is a real and painful issue to deal with when it creeps up. It will often remain hidden during development until very late primarily due to two things: final content often comes very late in production and the game will often not run for more than one hour until late in production as well. Out of memory situations can happen in released titles on memory constrained devices. I see it at least twice a day in Game of War on my android tablet.

How to deal with it

On memory constrained devices, if you have memory fragmentation it is a fact of life that you will hit out of memory situations. This is even more likely if your software might run on devices below your minimum specifications (e.g: mobile android devices). When this happens, there are a few ways to deal with this. Here are the ones that come to mind, in increasing order of complexity:

Do nothing and let it crash and burn.

Many games go this route and it costs literally nothing to adopt this strategy (if you can call it that). Sadly, not all crashes will be equal in impact. It is quite common for save games to generate quite a few memory allocations and crashing while the game is saving can often result in corrupted save games. For obvious reasons, this is very bad for the user experience.

Let it crash but do so in a controller manner.

Games that realize that they might crash and instead opt to handle this with the least amount of effort will typically poll how much free memory remains and crash when it passes a threshold in a controlled and safe manner. Typically this implies doing this when the game isn’t doing anything important (such as saving the game) and presenting some kind of fatal error message to the user. As far I as I can recall, a version of Gears of War running on Unreal 3 simply displayed a dirty disk error. This is generally considered acceptable since while it isn’t ideal for the user, at least nothing of value will be lost and ultimately it will remain a minor annoyance (depending on the frequency of course).

Deal with it in a clever way.

Game of War is a good example of this. When the game runs out of memory, it sends you back to your home city screen and flashes some colours briefly. (I do not have access to the source code to confirm this but it appears to me to be the cause of this peculiar behaviour.) This can happen almost anywhere except when in the city screen. This is likely because the city screen has a low or very predictable memory footprint. This is superior to the previous approaches since while it remains a minor annoyance, at least you remain within the game and presumably you can continue playing for a little while longer.

Fix the underlying issues.

This often requires the largest time investment. Not only does it require extensive testing to make sure even under all your imposed hard limits you do not exceed the maximum memory allowed (e.g: 16 vs 16 players) but it often requires dealing with memory fragmentation and making sure that there is none or very little.

Case study: AAA title (2014)

During the development of AAA title (2014), our final content for most maps came in very late in development and it all came at once. This made testing everything very hard. We knew very early on that a single platform would struggle with memory pressure and that prediction proved very accurate: our PlayStation 3 title suffered from rampant out of memory situations.

A number of factors lead to this:

64KB page sizes meant that our memory allocator had to deal properly with virtual memory to avoid fragmentation.
The PS3 has ~213MB of usable main memory and ~256MB of usable video memory. While you can use video memory as general purpose memory, accessing it from the CPU is very slow and is generally not recommended. This makes it the platform with the least amount of general purpose memory.
With high memory pressure comes memory fragmentation issues.

While we made sure to perform memory optimizations throughout development to reduce our footprint, ultimately it proved not to be sufficient when we neared our release date. The final major memory optimizations (both code and data) came in about 6 months before we released our title. Around that time, memory fragmentation reared its ugly head and the battle began.

Fighting memory fragmentation is hard and painful. Even though I had knowledge of how it happens prior to facing it, I had never had actual experience dealing with it.

The battle raged on for 6 months before we finally eradicated it for good. We ultimately released the game with much more free memory than we anticipated: our efforts finally paid off.

But that is not the whole story. Dealing with memory fragmentation is complicated and is best left to a future blog post. I will however discuss our plan to deal with our worst case scenario: failure to fix our memory fragmentation issues.

Few people on AAA title (2014) really knew how bad it really got. At one point I had over 100 separate bug reports of out of memory issues: so many that whenever I would make an improvement, all I could do was claim everything as fixed and see what came back. It became a running gag that I had over half the bug reports assigned to me.

At some point, about 2 months before our release date, we could play the game for about an hour or two before running out of memory due to memory fragmentation. It was bad and we weren’t sure if we were going to be able to fix the issue in time for the release or even in time for our first patch. To prepare for this scenario, we took a similar approach to Game of War and when we detected a low memory situation, we would force a map transition into the Player Hub level. This was a small level that you would return to in between story arcs. This made it the perfect place. It was expected that the map transition would always succeed (at least before the user got tired!) due to the fact that whatever was unloading was larger than what we loaded. The map transition would also save your progress ensuring that your save game would never corrupt due to this.

It was a horrible hack, it was ugly, but it was necessary. With this, we knew that it was better than crashing and that if the user continued to play after this, and fragmentation became really bad, at worst he would not be able to leave that level without reloading into it and they would eventually get the message and restart the title. A necessary evil.

Ultimately, only 2 or 3 bug reports ever spoke of this weird behaviour and by the time we released our game, our memory fragmentation issues were fixed and it became unnecessary. In the end, we removed the hack from the final product since it was only for a single platform and now unnecessary. I personally played the game for over 4 consecutive hours on the night prior to our release and made sure our free memory would never dip below our acceptable threshold: 10MB. Most maps ended up with 15-20MB of free memory with our biggest maps closer to 10MB.

These hacks are a poor substitute for a real fix but with the pressures of the real world, they are often a necessary and realistic option. Do you have a similar war story?

Back to table of contents

Virtual Memory Aware Linear Allocator

2015-06-11T00:00:00+00:00

This allocator is a variant of the linear allocator we covered last time and again it serves to introduce a few important concepts to allocators. Today we cover the virtual memory aware linear memory allocator (code).

How it works

The internal logic is nearly identical to the linear allocator with a few important tweaks:

We Initialize the allocator with a buffer size. The allocator will use this size to reserve virtual memory but it will not commit any physical memory to it until it is needed.
Allocate will commit physical memory on demand as we allocate.
Deallocate remains unchanged and does nothing.
Reallocate remains largely unchanged and like Allocate it will commit physical memory when needed.
Reset remains largely the same but it now decommits all previously committed physical memory.

Much like the vanilla linear allocator, the buffer is not modified by the allocator and there is no memory overhead per allocation.

There are two things currently missing from the implementation. The first is that we do not specify how eagerly we commit physical memory. There is a cost associated with committing (and decommitting) physical memory since we must call into the kernel to update the TLB page tables (and invalidate the relevant TLB entries). Depending on the usage scenarios, this can have an important cost. Currently we commit memory with a granularity of 4KB (the default page size on most platforms). In practice, even if the system uses pages of 4KB, we could use any multiple higher than this to commit memory with (e.g: commit in blocks of 128KB). The second missing implementation detail is that we simply decommit everything when we Reset. In practice, some sort of policy would be used here in order to manage slack. This policy would be required at initialization as well as when we reset. For similar reasons as stated prior, committing and decommitting have costs and reducing that overhead is important. In many cases it would make sense to keep some amount of the memory always committed. While these two important details are not necessarily important in this allocator variant, in many other allocators it is of critical importance. Decommitting memory is often very important to fight fragmentation and is critical when multiple allocators must work along side each other.

What can we use it for

Unlike the vanilla linear allocator, because we commit physical memory on demand, this allocator is better suited for large buffers or for buffers where the used size is not known ahead of time. The only requirement is that we know an upper bound.

In the past, I have used something very similar to manage video game checkpoints. It is often known or enforced by the platform that a save game or checkpoint does not have a size larger than a known constant. However, it is very rare that we know the exact size it will have when the buffer must be created. To create your save game, you can simply employ this linear allocator variant with an upper bound, use it behind a stream type class to facilitate serialization and be done with it. You can sleep well at night knowing that no realloc will happen and no memory will be needlessly copied.

What we can’t use it for

Much like the vanilla linear allocator, this allocator is ill suited if freeing memory at the pointer granularity is required.

Edge cases

The edge cases of this allocator are identical to the vanilla linear allocator with the exception of two new ones we add to the list.

When we first initialize the allocator, virtual memory must be reserved. Virtual memory is a finite resource and can run out like everything else. On 32 bit systems, this value is typically 2 to 4GB. On 64 bit systems, this value is very large since typically 40 bits are used to represent it.

When we commit physical memory, we might end up running out. In this scenario, we still have free reserved virtual memory but we are out of physical memory. This can happen if the physical memory becomes fragmented and mixed page sizes are used by the application. For example, consider an allocator A using pages of 2MB while another allocator B uses pages of 4KB. It might be possible for the system to end up with holes that are smaller than 2MB. Since the TLB must refer to a contiguous region of physical memory when large pages are used, this is bad. On many platforms, the kernel will defragment physical memory if this happens by copying memory around and remapping TLB entries. However, not all platforms will do this and some will simply bail out on you.

Potential optimizations

Much like the vanilla linear allocator, all observes optimization opportunities are also available here. Also, as previously discussed, depending how greedy we are with committing and how much slack we keep when decommitting, we can tune the performance quite effectively.

One notable other optimization avenue that is not for the faint of heart is that we can remove the check to commit memory inside the Allocate function and instead let the system produce an invalid access fault. By modifying the handler function and registering our allocator, we could commit memory when this happens and retry the faulting instruction. Depending on the granularity of the pages used to commit memory, this could reduce somewhat the overhead required for allocation by removing the branch required for this check.

While this implementation uses a variable to keep track of how much memory is committed, depending on the actual policy used, it could potentially be dropped as well. It was added for simplicity not out of necessity since in the current implementation, the allocated size could be rounded up to a multiple of the page size used for the purpose of tracking the committed memory.

Performance

Due to its simplicity, it offers great performance. All allocation and deallocation operations are O(1) and only amount to a few instructions. However, resetting and destroying the allocator now have the added cost of decommitting physical memory and will thus be linearly dependant on the amount of committed memory.

On most 32 bit platforms, the size of an instance should be 28 bytes if size_t is used. On 64 bit platforms, the size should be 56 bytes with size_t. Both versions can be made even smaller with smaller integral types such as uint32_t or by stripping support for Reallocate. As such, either version will comfortably fit inside a single typical cache line of 64 bytes.

Conclusion

Once again, this is a very simply and perhaps even toy allocator. However it serves as an important building block to discuss the various implications of committing and decommitting memory as well as the general implementation details surrounding these things. Manual virtual memory management is a classic and important tool of modern memory allocators and seeing it put to use in a simple context will serve as a learning example.

Fundamentally, linear allocators are a variation of a much more interesting and important allocator: the stack frame allocator. In essence, linear allocators are stack frame allocators where only a single frame is supported. Pushing of the frame happens at initialization and popping happens when we reset or at destruction.

Next up, we will cover the ever so useful: stack frame allocators.

Alternate names

To my knowledge, there is no alternate name for this allocator since it isn’t really a real allocator one would see in the wild.

Note that if you know a better name or alternate names for this allocator, feel free to contact me.

Reddit thread

Back to table of contents

How to legally generate the Windows ISO from a Mac or Linux PC

2015-05-27T00:00:00+00:00

Surprisingly, it turns out that getting your hands on a legal Windows ISO after purchasing it is not as easy as it might seem. After purchasing a digital copy of Windows in the Microsoft store, you will be able to download a Windows application that downloads the ISO for you after entering your product key.

Therein lies the issue: to create a Windows install media, you need access to a PC with Windows already installed. If like me all you have on hand is a Mac or Linux PC without easy or quick access to a PC with Windows (or one that you can trust with your product key), you are left in a tight spot. Even the Microsoft support will not be able to help you and will simply point out that it can’t be that hard to find a PC with Windows.

Worse still is that not all versions of Windows are supported and for example, all versions that run on the Free Amazon Web Services are not supported so this avenue is not open to us either (at least at the time of writing).

The only viable alternative appears to be to download the ISO from an untrusted torrent site. If like me this leaves you uneasy, read on.

As it turns out, an unrelated tool that Microsoft releases will allow us to execute that ISO generating executable in a safe and legal way.

Step 1

Install Virtual Box. Virtual Box is a virtual machine application from Oracle that allows you to run an operating system in a virtualized environment as a guest.

Step 2

Download a Virtual Box image from the Internet Explorer Developer website. These images are legal versions of Windows provided by Microsoft to allow developers to test various Internet Explorer versions on various Windows versions. They expire after 90 days and cannot be activated but that doesn’t matter to us since we’ll only really use it for as long as the download of the ISO takes.

Simply grab the image for the platform that you have (Mac or Linux). Note that most images are 32 bit which will only allow you to later download the 32 bit ISO. At the time of writing, the Windows 10 image is 64 bit but requires the following fix to work properly.

Step 3

Inside Virtual Box, import the Windows image downloaded at the previous step. The documentation recommends you set at least 2GB of RAM for the virtual machine to use. This is fine since we’ll only run it for a short amount of time anyway.

Step 4

Use Virtual Box to share a directory with the virtual machine with write access and make sure it automatically mounts (easier for us). This is the directory where we will copy the ISO file into. At the time of writing it does not work with the Windows 10 image yet and as such you will need to plug in a USB key and share it with the virtual machine from the settings or share a network folder (I had more luck sharing the folder inside the Windows guest virtual machine and accessing it from my Mac).

Step 5

Launch the virtual machine instance and use it to run the executable from the Windows store. If you do not have access to it, I have not tested but presume that you might be able to use this utility as well.

Step 6

Follow the instructions and download the ISO to some directory (e.g: the desktop). For some reason Virtual Box did not allow me to download directly to the shared directory we setup in Step 4. Either way, once the download terminates, simply copy the ISO to the shared directory.

Complications

Sadly, the Windows 10 image is a bit finicky. You can’t shutdown properly and install the updates or it won’t boot again and you’ll have to start over. I could not manage to get my USB stick to work either and as such I had to create a shared directory inside the Windows guest. In order to be able to access the shared directory, I had to hot swap the virtual machine network card from NAT to Host bridge (you will need to add a host adapter in the Virtual Box preferences). I also needed to add a default gateway in the IPv4 settings of the Windows guest for my mac to be able to access it and I also needed to allow Guest access in the network settings.

Next, due to our previous hack to change the time to allow Windows to boot, the executable will refuse to download the ISO and complain it can’t connect to the internet. To resolve this I had to change the time inside the guest manually to the current date. I also needed to hot swap the network card back to the NAT setting and removed the default gateway.

Last but not least in order to copy the ISO out, I had to hot swap the network card once again to Host bridge and add back the default gateway.

Profit!

That’s it! You are done and you can now get rid of Virtual Box if you wish to. You can now continue the steps to create your bootable DVD or USB stick from the ISO.

There are so many threads out there asking how to do this that I hope this will be able to help someone avoid the hassle I went through to find this. You would think Microsoft would make it easier for you to install their operating system…

Linear Allocator

2015-05-21T00:00:00+00:00

The first memory allocator we will cover is by far the simplest and serves as an introduction to the material. Today we cover the linear memory allocator (code).

How it works

The internal logic is fairly simple:

We initialize the allocator with a pre-allocated memory buffer and its size.
Allocate simply increments a value indicating the current buffer offset.
Deallocate does nothing
Reallocate first checks if the allocation being resized is the last performed allocation and if it is, we return the same pointer and update our current allocation offset.
Reset is used to free all allocated memory and return the allocator to its original initialized state.

There is very little work to do for all these functions which makes it very fast. Deallocate does nothing because in practice, it makes little sense to support it and other allocators are often better suited when freeing memory at the pointer level is required. The only sane implementation we could do is similar to how Reallocate works by checking if the memory being freed is the last allocation (last in, first out). Because we work with a pre-allocated memory buffer, Reallocate does not need to perform a copy if the last allocation is being resized and there is enough free space left regardless of whether it grows or shrinks.

Interestingly, you can almost free memory at the pointer level by using Reallocate and using a new size of 0 but in practice, the alignment used for the allocation would remain lost forever (if it was originally mis-aligned).

Critical to this allocator is that it adds no overhead per allocation and it does not modify the pre-allocated memory buffer. Not all allocators will have these properties and I will always mention this important bit. This makes it ideal for low level systems or for working with read-only memory.

What can we use it for

Despite being a very simple allocator, it has a few uses. I have used it in the past with success to clean up code dealing with a lot of pointer arithmetic. The general idea is that if you have a memory buffer representing a custom binary format with most fields having a variable size and requiring alignment, you will end up with a lot of logic to take your raw buffer and split it into the various internal bits.

uintptr_t buffer;   // user supplied

size_t bufferOffset = 0;
uint32_t* numValue1 = reinterpret_cast<uint32_t*>(buffer);  // assume buffer is properly aligned for first value

bufferOffset = AlignTo(bufferOffset + 1 * sizeof(uint32_t), alignof(uint8_t));
uint8_t* values1 = reinterpret_cast<uint8_t*>(buffer + bufferOffset);
bufferOffset = AlignTo(bufferOffset + *numValue1 * sizeof(uint8_t), alignof(uint16_t));
uint16_t* numValue2 = reinterpret_cast<uint16_t*>(buffer + bufferOffset);
bufferOffset = AlignTo(bufferOffset + 1 * sizeof(uint16_t), alignof(float));
float* values2 = reinterpret_cast<float*>(buffer + bufferOffset);

Using a linear allocator to wrap the buffer allows you to manipulate it in an elegant and efficient manner.

LinearAllocator buffer(/* user supplied */);
uint32_t* numValue1 = new(buffer) uint32_t;
uint8_t* values1 = new(buffer) uint8_t[*numValue1];
uint16_t* numValue2 = new(buffer) uint16_t;
float* values2 = new(buffer) float[*numValue2];

Note that the above two versions are not 100% equivalent due to the fact C++11 offers no way to access the required alignment of the requested type when implementing the new operator. However, with macro support, the original intent above can be supported and be just as clear while supporting alignment properly. I plan to cover this important bit in a later post.

I wrote earlier that this allocator can be used with read-only memory which is a strange property for a memory allocator. Indeed, since it mostly abstracts away pointer arithmetic when partitioning a raw memory region without modifying it, we can use it to do just that over read-only memory. In the example above, this means that we can easily use it for writing and reading our custom binary format.

What we can’t use it for

Due to the fact that we do not add per allocation overhead, we cannot properly support freeing memory at the pointer level while supporting variable alignment. When support for freeing memory is needed, other allocators are better suited.

This allocator is also generally a poor fit for very large memory buffers. Due to the fact that we need to pre-allocate it up front, we bear the full cost regardless of how much actual memory we allocate internally.

Edge cases

There are two important edge cases with this allocator and they are shared by all allocators: overflow and out of memory conditions.

We can cause arithmetic overflow in two ways in this allocator: first by supplying a large alignment value, and second by attempting to allocate a large amount of memory. This is fairly simple to test but there is one small bit we must be careful with: if the alignment provided is very large, we can overflow in such a way that our resulting pointer would end up in our buffer either at a lower memory address if the allocation size is very large as well or at a higher memory address if the allocation size is small. The proper way to deal with this is to check overflow after taking into account the alignment and again after adjusting for the new allocation size.

We eventually run out of memory if we attempt to allocate more than our pre-allocated buffer owns.

Potential optimizations

While the implementation I provide aims for safety first, in practice a linear allocator should never run out of memory nor should it overflow if the logic using it is correct. This is because by providing a pre-allocated buffer, either we assume we do not need more memory or our assumption is wrong. In the later case, it is highly likely that we do not check the return value of allocations which is of little help even if our allocator is safe. Usage scenarios for this allocator are also generally simple in logic with few unknown variables to cause havoc.

In this light, a number of improvements can be made by stripping the overflow and out of memory checks and simply keeping asserts around. I may end up providing a way to do just that with template arguments or macros in the future.

Another option is to remove the branches added by the overflow and out of memory checks and simply ensure the internal state does not change instead. There is very little logic and as such an early out branch could very well save very little in the rare case it is taken and at the same time, since it is rarely taken, we always end up performing most of the logic anyway.

Last but not least, Reallocate support is often not required and could trivially be stripped as well.

Performance

Due to its simplicity, it offers great performance. All operations are O(1) and only amount to a few instructions.

On most 32 bit platforms, the size of an instance should be 24 bytes if size_t is used. On 64 bit platforms, the size should be 48 bytes with size_t. Both versions can be made even smaller with smaller integral types such as uint16_t (which would be very appropriate since this allocator is predominantly used with small buffers) or by stripping support for Reallocate. As such, either version will comfortably fit inside a single typical cache line of 64 bytes.

Conclusion

Despite its simple internals, the linear allocator is an important building block. It serves as an ideal example for a number of sibling allocators we will see in the next few posts which involve similar internal logic and edge cases.

Next up, we will cover a variation: the virtual memory aware linear allocator.

Alternate names

The most common alternate name for this allocator is called the arena allocator. However, that name is overloaded and is also often associated with a number of other allocators which manage a fixed memory region.

Note that if you know a better name or alternate names for this allocator, feel free to contact me.

Back to table of contents

Virtual memory explained

2015-05-12T00:00:00+00:00

Virtual memory is present on most hardware platforms and it is surprisingly simple. However, it is often poorly understood. It works so well and seamlessly that few inquire about its true nature.

Today we will take an in-depth look at why do we have virtual memory and how it works under the hood.

Physical memory: a scarce ressource

In the earlier days of computing, there was no virtual memory. The computing needs were simple and single process operating systems were common (batch processing was the norm). As the demand grew for computers, so did the complexity of the software running on them.

Eventually the need to run multiple processes concurrently appeared and soon became mainstream. The problem with multiple processes in regards to memory is three-fold:

Physical memory must be shared somehow
We must be able to access more than 64KB of memory using 16 bit registers
We must protect the memory to prevent tampering (malicious or accidental)

Memory segments

On x86, the first and second point were addressed first with real mode (1MB range) and later, memory protection was introduced with protected mode (16MB range). Virtual memory was born but it was not in the form most commonly seen today: early x86 used a segment based virtual memory model.

Segments were complex to manage but served their original purpose. Coupled with secondary storage, the operating system was now able to share the physical memory by swapping entire segments in and out of memory. Under 16 bit processors, segments had a fixed size of 64KB but later, when 32 bit processors emerged, segments grew to a variable and maximum size of 16MB. This later development had an unfortunate side-effect: due to the variable size of segments, physical memory fragmentation could now occur.

To address this, paging was introduced. Paging, like earlier segments, divides physical memory into fixed sized blocks. They also introduce an added indirection. With paging, segments were now further divided into pages and each segment further contained a page table to resolve the mapping between segment relative addresses and real effective physical addresses. By their nature, pages do not need to be contiguous within a given segment allowing them to resolve the memory fragmentation issues.

At this point in time, memory accesses now contain two indirections: first we must construct the segment relative address using a 16 bit segment index, reading the 32 bit segment base address associated and adding a 32 bit segment offset (386 CPUs shifted from 24 bit base addresses and offsets to 32 bit at the same time it introduced paging). This yields us a memory address that we must now look up in the segment page table to ultimately find the physical memory page (often called frame) that contains what we are looking for.

Modern virtual memory

Things now look much closer to modern virtual memory. Now that we have 32 bit processors and that they are common enough, there is no longer a need for segments and paging alone can be used. The x86 hardware of the time already used 32 bit segment base addresses and 32 bit segment offsets. Memory becomes far easier to manage if we assume it is a single memory segment with paging and it allows us to drop one level of indirection.

Most of this memory address translation logic now happens inside the MMU (memory management unit) and is helped by internal caches. This cache is now more commonly called the Translation Look-aside Buffer, or TLB for short.

Earlier x86 processors only supported 4KB memory pages. To accommodate this, a virtual memory address was split into three parts:

The first 10 bits represent an index in a page directory that is used to look up a page table.
The following 10 bits represent an index in the previously found page table that is used to look up a physical page frame.
The remaining 12 least significant bits are the final offset into the physical page frame leading to the desired memory address.

A dedicated register points to a page directory in memory. Both page directory entries and page table entries are 32 bits and contain flags to indicate memory protection and other attributes (cacheable, write combine, etc.). Another dedicated register holds the current process identifier which is used to tell which TLB entries are valid or not (and avoids the need for flushing the entire TLB when context switching between processes).

As memory grew, 4KB pages had difficulty scaling. Servicing large allocation requests requires mapping large numbers of 4KB pages putting more and more pressure on the TLB. The bookkeeping overhead also grows along with the number of pages used. Eventually, larger pages (4MB) were introduced to address these issues.

And then memory sizes grew even further. Soon enough, being confined to an address space of 4GB with 32 bit pointers became too small and physical address extension was introduced and extended the addressable range to 64GB. This forced larger pages to reduce to a size of 2MB as now some bits were needed for another indirection level in the page table walk.

Virtual memory today

Today, x64 hardware most commonly supports pages of the following sizes: 4KB, 2MB, and sometimes 1GB. Typically, only 48 bits are used limiting the addressable memory to 256TB.

Another important aspect of virtual memory today is how it interacts with virtualization (when present). Because physical memory must be shared between the running operating systems, page directory entries and page table entries now point into virtual memory instead of physical memory and require further TLB look-ups to resolve the complete effective physical address. Much like process identifiers, a virtualization instance identifier has been introduced for the same purpose: avoiding full TLB flushes when context switching.

Modern hardware will generally have separate caches per page size which means that to get the most performance, which page sizes are used for what data must be carefully planned. For example, in certain embedded platforms, it is not uncommon to have 4KB pages be used for code segments, data segments, and the stack while encouraging programmers to use 2MB pages inside their programs. It is also not uncommon to have virtual pages introduced: a virtual page will be composed of a smaller number of pages. For example, you might request allocations to use 64KB or 4MB pages even though the underlying hardware only supports 4KB and 2MB pages. The distinction is mostly important for the kernel since managing larger pages implies lower bookkeeping overhead and faster servicing.

An important point bears mentioning, when pages larger than 4KB are used, the kernel must find contiguous physical memory to allocate them. This can be a problem if page sizes are mixed, it opens the door to fragmentation to rear its ugly head. When the kernel fails to find enough contiguous space but it knows enough space would otherwise remain, it has two choices:

Bail out and return stating that you have run out of memory.
Defragment the physical memory by copying memory around and remapping the pages.

Generally speaking, if mixed page sizes are used, it is generally recommended to allocate large pages as early as possible in the process’s life to remedy the above problem.

Virtual memory secondary storage

The fact that modern hardware allows virtual memory to be mapped in your process without mapping all the required pages to physical memory enables modern kernels to spill memory onto secondary storage to artificially increase the amount of memory available up to the limits of that secondary storage.

As is common knowledge, the most common form of secondary storage is the swap file on your hard drive.

However, the kernel is free to place that memory anywhere and implementations exist where the memory will be distributed in a network or some other medium (e.g: memory mapped files).

Conclusion

This concludes our introduction to virtual memory. In later blog posts we will explore the TLB into further details and the implications virtual memory has on the CPU cache. Until then, here are some additional links:

A memory allocator interface

2015-05-11T00:00:00+00:00

As previously discussed, the memory allocator integration into C++ STL containers is far from ideal.

Attempts to improve on it have been made but I have yet to meet anyone satisfied with the current state of things.

Memory in large applications will typically come from two majors places: new/malloc calls and container allocations. Today we will only discuss the later.

Where are we at?

There appears to be two main approaches when it comes to memory allocator interfaces:

Templating the container with the allocator used (C++ STL, Unreal 3)
Implementing an interface and perform a virtual function call (Bitsquid)

The first, as previously discussed, has a number of downsides but on the up side, it is the fastest since all allocation calls are likely to either be inline or at least be a static branch.

The second, while much more flexible, introduces indirection and as previously discussed, is slower and generally less performant. Not only do we introduce an extra cache miss (and likely TLB miss) but we introduce an indirect branch which the CPU will not be able to prefetch.

Can we do better?

The compromise

It seems that the cost of added flexibility is to use an indirect branch. This is unavoidable. However, we can remove the extra cache and TLB miss by defining a partial inline virtual function dispatch table in the allocator interface.

The idea is simple:

There are typically few allocator instances (on the order of <100 instances isn’t unusual)
Allocator instances are often small in memory footprint and often will fit within one or two cache lines (even when they manage a lot of memory, the footprint will generally lie somewhere else in memory and not inline with the allocator)
Most container allocations can be implemented with a single function: realloc

We can thus conclude that it is a viable alternative to add a function pointer to a realloc function inline within the allocator and call this for all our needs.

Here is the code (which is also on github):

class Allocator
{
protected:
    typedef void* (*ReallocateFun)(Allocator*, void*, size_t, size_t, size_t);

    inline          Allocator(ReallocateFun reallocateFun);

public:
    virtual         ~Allocator() {}

    virtual void*   Allocate(size_t size, size_t alignment) = 0;
    virtual void    Deallocate(void* ptr, size_t size) = 0;

    inline void*    Reallocate(void* oldPtr, size_t oldSize, size_t newSize, size_t alignment);

    virtual bool    IsOwnerOf(void* ptr) const = 0;

protected:
    ReallocateFun   m_reallocateFun;
};

void* Allocator::Reallocate(void* oldPtr, size_t oldSize, size_t newSize, size_t alignment)
{
    assert(m_reallocateFun);
    return (*m_reallocateFun)(this, oldPtr, oldSize, newSize, alignment);
}

A number of things stand out and are important:

This is not a proper interface but instead an abstract base class.
We provide some virtual functions for common things and more will come later (debugging features, etc.)
Reallocate is inline and NOT virtual
Deallocate/Reallocate follow the latest C++ standard and include the size used when the original allocation was made to allow further optimizations within allocator implementations.
Alignment must be provided which is important for AAA video games, and more generally with SIMD code.
The first argument to the ReallocateFun is an instance of the base class itself. This is important because it points to a static free standing function as such when we call Reallocate, the implicit this present as first argument will simply be forwarded as is with no extra register shuffling (at least on x64) even if it ends up not inlined and allows the implementation to call a member function without shuffling registers as well.

Usage is very simple:

An allocator simply derives from this base class, implement the necessary function and initializes the base class with a function pointer to a suitable reallocate function.
Containers simply call Reallocate(nullptr, 0, size, alignment) to allocate, Reallocate(ptr, size, 0, alignment) to deallocate and simply supply all arguments to reallocate.

It is often the case that containers will simply reallocate (e.g: vector<..> with POD) and as such this interface is ideally suited for this. With this implementation, when a container performs an allocation or deallocation, the cache miss to access the reallocate function pointer will load relevant data in its cache line leading to less wasted cycles compared to the virtual function dispatch approach while maintaining all the flexibility needed by the indirection and the interface.

In the coming blog posts, I will introduce a number of memory allocators and containers that will use this new interface.

Caches everywhere

2015-05-05T00:00:00+00:00

The modern computer is littered with caches to help improve performance. Each cache has its own unique role. A solid understanding of these caches is critical in writing high performance software in any programming language. I will describe the most important ones present on modern hardware but it is not meant to be an exhaustive list.

The modern computer is heavily based on the Von Neumann architecture. This basically boils down to the fact that we need to get data into a memory of sorts before our CPU can process it.

This data will generally come from one of these sources: a chip (such as ROM), a hard drive or a network and this data must make it somehow all the way to the processor. Note that other external devices can DMA data into memory as well but the above three are the most common and relevant to today’s discussion.

I will not dive into the implications for all the networking stack and the caches involved but I will mention that there are a few and understanding them can yield good gains.

The kernel page cache

Hard drives are notoriously slow compared to memory. To speed things up, a number of caches lay in between your application and the file on disk you are accessing. From the disk up to your application, the caches are in order: disk embedded cache, disk controller cache, kernel page cache, and last but not least, whatever intermediate buffers you might have in your user space application.

For the vasty majority of applications dealing with large files, the single most important optimization you can perform at this level is to use MMAP to access your files. You can find an excellent write up about it here. I will not repeat what he says but add that this is a massive win not only because it allows avoiding needless copying of memory but also by mentioning that you can prefetch easily, something not so easily achieved with fread and the likes.

If you are writing an application that does a lot of IO or where it needs to happen as fast as possible (such as console/mobile games), you should ideally be using MMAP where it is supported (android supports it, iOS supports it but with 700MB virtual memory limit on 32bit processors, and both Windows and Linux obviously support it). I am not 100% certain, but in all likely hood, the newest game consoles (XboxOne and PlayStation 4) should support it as well. However, note that not all these platforms might have a kernel page cache but that it is a safe bet that going forward, newer devices are likely to support it.

The CPU cache

The next level of caches lays much closer to the CPU and take the form of the popular L1, L2, and L3 caches. Not all processors will have these and when they do, they might only have one or two levels. Higher levels are generally inclusive in regards to the lower levels but not always. For example, while the L3 will generally be inclusive of the L2, the L2 might not be inclusive of the L1. This is so because some instructions such as prefetching will prefetch in the L2 but not the L1 and thus potentially cause eviction of cache lines that are in the L1.

The CPU cache is always moving data in and out of memory in units called a cache line (they will always be aligned in memory to a multiple of the cache line size). Cache line sizes vary depending on the hardware. Popular values are: 32 bytes on some older mobile processors, 64 bytes on most modern mobile, desktop and game console processors, and 128 bytes on some PowerPC processors (notably the Xbox360 and the PlayStation 3).

This cache level will contain code that is executing, data being processed, and translation look-aside buffer entries (for both code and data, see next section). They might be dedicated to either code or data (e.g: L1) or be inclusive of both (e.g: L2/L3).

The CPU cache is usually N-way set associative. You can find a good write up about this here. Note that depending on the cache level, where the cache line index and tag comes from might differ. For example, the code L1 might use the physical memory address to calculate both the index and the tag while the data L1 might use the virtual memory address for both the index and the tag. The L2 is also free to choose whatever arrangement it pleases and note that in theory, they could be mixed (tag from virtual, index from physical).

When the CPU requests for a memory address, it is said to hit the cache if the desired cache line is inside a particular cache level and a miss if it is not and we must look either in a higher cache level or worse, main memory.

This level of caching is the primary reason why packing things as tight as possible in memory is important for high performance: fetching a cache line from main memory is slow and when you do, you want most of that cache line to contain useful information to reduce waste. The larger the cache line, the more waste will generally be present. Things can be pushed further by aligning your data along cache lines when you know a cache miss will happen for a particular data element and to group relevant members next to it. This should not be done carelessly, as it can easily degrade performance if you are not careful.

This level of caching is also an important reason while calling a virtual function is often painted as slow: not only must you incur a potential cache miss to read the pointer to the virtual function table but you must also incur a potential cache miss for the pointer to the actual function to call afterwards. It is also generally the case that these two memory accesses will not be on the same cache line as they will generally live in different regions in memory.

This level of caching is also partly responsible for explaining why reading and writing unaligned values is slower (when the CPU supports it). Not only must it split or reconstruct the value from potentially two different cache lines but each access is potentially a separate cache miss.

Generally speaking, all cache levels discussed up to here are shared between processes that are currently executed by the operating system.

The translation look-aside buffer

The next level of caching is the translation look-aside buffer, or TLB for short.

The TLB is responsible for caching results of the translation of virtual memory addresses into physical memory addresses. Translation happens with a granularity of a fixed page size and modern processors generally support two or more page sizes with the most popular on x64 hardware being 4KB and 2MB. Modern CPUs will often support 1GB pages but using anything but 4KB pages in a user space application is sometimes not trivial. However, on game console hardware, it is common for the kernel to expose this and using this important tool is important for high performance.

The TLB will generally have separate caches for the different page sizes and it often has several levels of caching as well (L1 and L2, note that these are separate from the CPU caches mentioned above). When the CPU requests the translation of a virtual memory address, since it does not yet know the page size used, it will look in all TLB L1 caches for all page sizes and attempt to find a match. If it fails, it will look in all L2 caches. These operations are often done in parallel at every step. If it fails to find a cached result, a table walk will generally follow or a callback into the kernel happens. This step is potentially very expensive! On x64 with 4KB pages, typically four memory accesses will need to happen to find which physical memory frame (frame is generally the word used to refer to the unit of memory management the hardware and kernel use) contains the data pointed to and a fifth memory access to finally load that data into the CPU cache. Using 2MB pages will remove one memory access reducing the total from five to four. Note that each time a TLB entry is accessed in memory, it will be cached in the CPU cache like any other code or data.

This sounds very scary but in practice, things are not as dire as they may seem. Since all the memory accesses are cached, a TLB miss does not generally result in five cache misses. Pages cover a large range of memory addresses and typically the top levels of the address used will cover a large range of virtual memory and as such, the least virtual memory touched, the less data the TLB needs to manage. However, a TLB entry miss at a level N will generally guaranty that all lower TLB entry accesses will result in cache misses as well. In essence, not all TLB misses are equal.

As must be obvious now, every time I mentioned the potential for CPU cache misses in the previous section, is now potentially even worse if it results in a TLB miss as well. For example, as previously mentioned, virtual tables require an extra memory access. This extra memory access, by virtue of being in a separate memory region (the virtual table itself is read only and compiled by the linker while the pointers to said virtual table could be on the heap, stack, or part of the data segment), will typically require a separate cache line for it and separate TLB entries. It is clear that the indirection has a very real cost, not only in terms of the branch the CPU must take that it cannot predict but also in the extra pressure on the CPU and TLB caches.

Note that when a hypervisor is used with multiple operating systems running concurrently, since the physical memory is shared between all of these, generally speaking when looking up a TLB entry, an additional virtual memory address translation might need to take place to find the true location. Depending on the security restrictions of the hypervisor, it might elect to share the TLB entries between guests and the hypervisor or not.

The micro TLB

Yet another popular cache that is not present (as far as I know) in x64 hardware but common on PowerPC and ARM processors is a higher level cache for the TLB. ARM calls this the micro TLB while PowerPC calls it ERAT.

This cache level, like the previous one, caches the result of translation of virtual addresses into physical addresses but it uses a different granularity from the TLB. For example, while the ARM processors will generally support pages of 4KB, 64KB, 1MB, and sometimes 16MB; the micro TLB will generally have a granularity of 4KB or 1MB. What this means is that the micro TLB will miss more often but will often hit in the TLB afterwards if a larger page is used.

This cache level will generally be split for code and data memory accesses but will generally contain mixed page sizes and it is often fully associative due to its reduced size.

The CPU, TLB, and micro TLB caches are not only shared by currently running processes of the current operating system but when running in a virtualized environment, they are also shared between all the other operating systems. When the hardware does not support the sharing by means of registers holding a process identifier and virtual environment identifier, generally these caches must be flushed or cleared when a switch happens.

The CPU register

The last important cache level in a computer I will discuss today is the CPU register. The register is the basic unit modern processors use to manipulate data. As has been outlined so far, getting data here was not easy and the journey was long. It is no surprise that at this level, everything is now very fast and as such packing information in a register can yield good performance gains.

Values are loaded in and out of registers directly into the L1 assisted by the TLB. Register sizes keep growing over the years: 16bit is a thing of the past, 32bit is still very common on mobile processors, 64bit is now the norm on newer mobile and desktop processors, and processors with multimedia or vector processing capability will often have 128bit or even 256bit registers. In this later case, this means that we only need two 256bit registers to hold an entire cache line (generally 64 bytes on these processors).

Conclusion

This last points hammers in the overall high performance mantra: don’t touch what you don’t need and do the most work you can with as little memory as possible.

This means loading as little as possible from a hard drive or the network when such accesses are time sensitive or look into compression: it is not unusual that simple compression or packing algorithms will improve throughput significantly.

Use as little virtual memory as you can and ideally use large pages if you can to reduce TLB pressure. Data used together should ideally resides close by in memory. Note that virtual memory exists primarily to help reduce physical memory fragmentation but the larger the pages you use, the less it helps you.

Pack as much relevant data as possible together to help ensure that it ends up on the same cache line or better yet, align it explicitly instead of leaving it to chance.

Pack as much relevant data as possible in register wide values and manipulate them as an aggregate to avoid individual memory accesses (bit packing).

Ultimately, each cache level is like an instrument in an orchestra: they must play in concert to sound good. Each has their part and purpose. You can tune individually all you like but if the overall order is not present, it will not sound good. It is thus not about the mastery of any one topic but in understanding how to make them work well together.

This post is merely an overview of the various caches and their individual impacts. A lot of information is incomplete and missing to keep this post concise (I tried..). I hope to revisit each of these topics in separate posts in the future when time will allow.

Introducing the 'gin' library

2015-05-03T00:00:00+00:00

gin aims to be a playground for high performance C++ ideas all under the MIT license. gin is currently empty but it will not remain so for long!

Here is an overview of the things I intend to cover in the coming posts and that will be included in the library:

A convenient memory allocator interface suitable for high performance and containers Several high performance memory allocators (linear, frame, small block, heap, external, etc.) A number of containers using the above technology (all sorts of array variants, bit sets, etc.) And much more!

I will use C++11 features where relevant and keep things in headers as much as I can for easy integration as well as all the popular reasons.

I choose the name gin for a number of reasons: it is a refreshing drink for the upcoming summer months, there is no associated programming material with that word, it is short, and if I ever need a side project, I can cleverly name it tonic.

The code can be found here on github.

C++ STL container deficiencies

2015-05-01T00:00:00+00:00

In the AAA video game community, it is common to reimplement containers to fix a number of deficiencies. EA STL being a very good example of this. You can find a document here that explains in great detail the reasoning behind it. Some of the points are a bit dated and have been fixed with C++11 or will be fixed in C++14. I will not repeat them here but I will add some of my own.

Some function names are un-intuitive

deque and list have push/pop which are symmetric and this is good. But insert/erase seems a poor choice of words and we have the emplace variants which overlap the two functionalities. vector appears to have a mix of queue/stack like functions (push/pop and front/back) but this choice of words is a poor fit when talking about arrays.

Naming things is always sensitive subject and in the interest of discussion I give the following two examples: Java uses add/remove for all container operations while C# uses Add/Remove and Insert/RemoveAt.

Vector is a poor container name

In video games, and mathematical applications, vectors are N dimensional. This reflects the STL container name but in video games vectors are almost always 2D, 3D, or 4D. Ultimately, this can yield some confusion. Even in the vector documentation, it is explained in terms of arrays while array is reserved for an array of fixed size that also happens to be inline. In practice, there are many useful variants of array implementations and adding a simple qualifier to the name makes everything much more clear: FixedArray, InlineArray, DynamicArray, HybridArray, etc.

The meaning of the word ‘size’ is overloaded and confusing

Size in C++ means many things. size_t is defined as the type of the result of the sizeof family of operators. On most systems it will be the same size as the largest integral register (32 or 64 bits) and will typically match uintptr_t for this reason. The sizeof operators also return a size and this size is measured in bytes (which can be 8 or more bits depending on the platform). For the containers, size() returns the number of elements.

Size is thus a type, a number of bytes and a number of elements depending on the context.

Templating the allocator on the container type is a bad idea

EA STL touches on most of the issues related with the poor STL memory allocator support but it forgets an important point: the speed gain of templating the allocator type renders refactoring code much harder. It is not unusual to write a function that takes a container as argument and at a later time, the need to change the allocator type arises. This forces the user to change every site where the allocator type appears which can end up being many functions. I have witnessed optimizations that could not be performed because the amount of work would be too great due to this. It is also often desirable to write functions that will require allocation/deallocation that are allocator agnostic without templating the whole function on the allocator type.

A common fix for this is to use an allocator interface with virtual functions which is simple and clean. The bitsquid engine does this. It is not without its faults but it is ultimately much more flexible. I plan to discuss this in further detail in later posts.

Containers are often larger than they need to be

It is not unusual to end up with hundred of thousands or even millions of array containers and the likes. In this scenario, every byte counts and this can end up bloating objects that hold them. Generally, size_t is far too large to keep track of sizes and in fact, many video game engines will simply hardcode a uint32_t instead (only the most recent game consoles now have more than 4GB of memory, and just barely). However, even this is often wasteful as the norm is to have tiny arrays and not larger ones. It would be much more flexible is we had a base container templated on the type used to track the sizes and define all popular variants: Vector8, Vector16, Vector32, Vector64, etc. The container should then check for overflow and assert/fail if it does so. We could have a default Vector without a size specified that defaults to a sensible configurable value (e.g: Vector32 or Vector64).

Another source of waste is the allocator contained inside the containers. For reasons discussed above, it is desirable to store a pointer to the allocator inside containers. Since a container that requires allocation will already have a pointer to point to the allocated memory, and since we can rely on the fact that the container has two states (unallocated and allocated), we can store that pointer inside the data pointer in the former case (and use either a flag in the LSB or the capacity of the container to tell which state it is in) and simply append it at the front or back of the allocated memory in the later case. This will save a few bytes in the container object itself reducing the size of the holding object. I will discuss this again in a future blog post.

Removing items from an unsorted array

It is generally the norm that array containers will hold their contents unsorted. When this is the case, upon removing an item, we can simply overwrite it with the last item of the array instead of shifting all the elements left. This ensures a much more deterministic runtime cost for this operation.

Containers hide everything

Sometimes, having a C style API with a raw view is beneficial. The bitsquid foundation library explains and uses this. However, at the same time, in the vast majority of cases, it does not make sense to expose programmers directly to this as it can be error prone (it becomes easy to make a container out of sync or corrupted). For example, bit sets often come in two variants: an inline array of bits and a dynamically allocated array of bits. It is often the case that the code will be duplicated to support both cases. However, at times a third cases arises: when you need to make your dynamic bit array inline as part of a larger memory block (e.g.: imagine a dynamic array where for each slot we also store a bit, with the STL containers, we would have two dynamic allocations when one would suffice). In the later scenario, none of the bit set functions can be reused and they must again be duplicated. In practice, all three cases can be implemented by reusing functions that take as input a raw view of the bit set along with the number of bits.

I plan to look at a number of these topics more in depth in future blog posts, stay tuned!

Virtual function dispatching

2015-04-30T00:00:00+00:00

Very often, C++ interviews ask about the virtual table: how it works, how is it implemented and the tricky bits around constructors & destructors. However, rarely are the following questions asked: why is it the way it is and are there alternatives?

Seeing how the C++ standard does not mandate how virtual function dispatching should be implemented, it is interesting that most compilers ended up converging on the current table approach. The wikipedia article does a fairly good job to explain how it works and gives a good idea of the problem it is trying to solve. However, it fails to address the assumptions and performance constraints that likely shaped the current design.

For simplicity’s sake, we will only consider the case of single inheritance but keep in mind all methods discussed here can be extended to support multiple inheritance.

Let us go back to the drawing board and take a closer look at the problem we are trying to solve, what our tools are and our performance constraints.

The problem:

It is often desirable to derive from a base class and override part of its behaviour. This is best achieved by implementing a specialized function that the base class would call (or for that matter, any code) when present or call the base implementation when our only knowledge is the type of the base class at a particular function call site.

The tools:

All classes and functions are known at compile time and ideally we want to leverage that as much as possible.

The performance constraints:

Memory and CPUs are slow and scarce and we want to have this mechanism be as fast as possible and at the same time be as compact in memory as possible. We can safely assume that for some classes we could have a large number of such virtual functions and we can assume as well that for some classes we might have a large number of instances.

The inline solution:

The simplest solution would simply be to store a function pointer for every such virtual function inside the object itself. At object construction, we can simply set these pointers and later when we need to invoke them, we simply call the correct function pointer.

The upside of such a simple method is that the information we need is right there inside the object and we can access it directly. It will also end up with the rest of the adjacent object data in whatever cache line is loaded to access the function pointer. By identifying which data is used by which function, we could move relevant data close to that function pointer and improve locality. Calling such a virtual function would thus incur a single data cache miss (to access the pointer).

The downside is that it does not scale. The more virtual functions we add, the larger our objects will end up being. This is made even worse on platforms with 64 bit pointers. The cost to initialize those data members also increases with each new virtual function. Finally, if enough virtual functions are added, our cache line might end up being completely filled by them such that no other immediately useful information ends up being loaded anyway.

Ultimately, this method will not degrade gracefully when either the number of object instances increases or the number of virtual functions rises.

This method is ideal when:

the number of virtual functions per class is low
meaningful data can be moved close to the virtual function pointer (eg: a getter and the returned value)
the number of class instances is low

The dispatch function:

A more flexible approach would be for our objects to have a single function pointer to a per class dispatch function. This function would then be called in a similar fashion to syscall(..): an extra argument to indicate which virtual function is expected would be provided. On object construction, similar to the previous approach, we simply set the proper dispatch function.

The upside is that now our cost per object is a single function pointer and we are very flexible. At runtime we could add extra data to change the dispatching behaviour (and indeed a mechanism similar to this is used by many dynamic languages for duck typing and such) and we are also very flexible in our we store the virtual function pointers: we can either store them in an array somewhere along with the identifier that we need to match or we can store it inline in the assembly stream. The latter approach is also interesting because it could end up being more compact than the former on architectures where instruction sizes can vary and that support relative jumps (e.g.: x64). Another point in favour of the later approach is that the code cache generally has less pressure than the data cache by virtue of the facts that code execution is largely linear and more easily prefetched and there is also less code compared to data. Last but not least, a dispatch function for a given class only needs to test the virtual functions it implements and defer resolution to its base class should it not handle the call.

The downside is that to find the correct virtual function pointer to call we need to introduce extra branching which will could end up being poorly predicted. Various methods can mitigate this: sorting the virtual functions in a tree structure to guaranty O(log(n)) access time (it could even be updated at runtime to favour hot functions) or the introduction of a bloom filter as an early out mechanism. Another problem is the calling convention: because for efficiency reasons we need a register to hold the virtual function identifier when we call our dispatch function, that register must be reserved to avoid shuffling all the registers and the stack when the effective virtual function is called. While this approach scales well when the number of object instances increases, it does not scale as well when the number of virtual functions rises.

Compared to the previous approach, we will need to incur one data cache miss for the dispatch function pointer and one (or more) code cache miss for the dispatching code. The compiler could be smart here and position the dispatch code close to the virtual functions of the same class in hope that it ends up being on the same page and thus avoid a TLB miss when the actual function is called.

Much like the previous approach, this method does not degrade gracefully when the number of virtual functions increases.

This method is ideal when:

The dispatching behaviour needs to be altered at runtime
The dispatching logic is complicated
The number of virtual functions per class is low
The number of class instances is high

The virtual table:

Finally we reach the current popular implementation: a single pointer to an array of virtual function pointers. Similar to previous methods, on object construction we simply set the proper array pointer. For this method to work, the array must contain function pointers for all virtual functions of the current class as well as all base classes. When we perform a virtual function call, knowing which function is called we can infer an offset in the array to use and simply call the correct virtual function pointer.

The upside is that like the previous approach, this scales well to a large number of object instances due to the low per object overhead but unlike the previous approach, it also scales well to a large number of virtual functions since finding the correct one is O(1).

There are no obvious downsides and in the end, this method will degrade gracefully.

However, things are not perfect: the array of virtual functions must be put somewhere. When we call a virtual function, we will thus incur two data cache misses: one for the pointer to the array and another for the actual virtual function pointer used. This second cache miss may very well not contain relevant information in the cache line besides the few bytes we need for the pointer. This will of course depend heavily on the usage but in general, there will be a lot of waste in that cache line. Because we can’t position the array in meaningful position, it is also possible that the page where it resides does not contain other relevant information, causing pressure on the TLB. In practice, other virtual function arrays are likely to end up in that page and a large number of them could fit inside. Thus a program using virtual functions with this method with 16KB worth of virtual function arrays could very easily end up permanently using four 4KB TLB entries (it is typical for most kernels to use 4KB pages for this sort of data). Incidentally, with increased TLB pressure, we also have an increased pressure on higher level caches such as ERAT on PowerPC and the micro TLB on ARM chips.

This method is ideal when:

The number of class instances is high
The number of virtual functions per class is high

The conclusion:

Depending on the scenario and usage, each variant could end up being the superior choice but ultimately, if a language does not expose control over which method is used for this, it must make a safe choice that supports its relevant use cases (e.g.: duck typing at runtime) and degrades as gracefully as possible under unknown scenarios.

The need for speed

2015-04-26T00:00:00+00:00

The art of writing fast software is subtle and ever changing. Fundamentally, it is all about where your data comes from and how you access it.

At a high level, you have the entire field of algorithmic complexity dedicated to touching the least amount of data for a given task. At this level, the actual hardware that executes the algorithm hardly matters since we deal in very large numbers but this is not to say that it is irrelevant, faster hardware will always help.

As the amount of data manipulated shrinks, the gains from an optimal algorithm versus a slightly less optimal one will blur and a new perspective is necessary.

At a lower level, knowing how to leverage the hardware becomes of paramount importance. The operating system and the hardware go to great lengths to cache data in various ways and knowing how they are used is key in squeezing the very last drop of performance.

Last but not least, the modern processor is really a tiny virtual computer. At this micro level, the instruction stream ordering and what instructions are used can still make a significant difference on performance.

Each tier has its own optimization specialists but they do not all enjoy the same amount of attention. The reason for this is simple: they are all not equal in value. Out in the real world, most programmers will be impacted largely by the first and second tier, in that order. It isn’t unusual to work with thousands of elements or more and at these scales, algorithmic complexity and good cache usage will always dominate. The last tier is mostly reserved to extreme low level programmers: hardware designers, embedded programmers, compiler programmers and specialized library programmers.

The secret to writing fast software is to make your assumptions explicit and measure early and often. Explicit assumptions are key to making sure and documenting that the choices you make as a result are proper and make sense. Should your assumptions change or prove wrong, bad things could happen or new opportunities might arise.

It is not unusual for this to happen due to changes in technology. Consider how performance assumptions had to be revised after SSDs were introduced or when C++11 made the addition of move semantics.

However, high performance software is slightly different. Much like race cars, performance is built in from the ground up, it is intrinsic to its DNA. A Toyota Camry is a good car but fitting in a Formula One engine into it and dropping the car on a race track will yield disappointing results compared to true breeds. The same applies to software and the same could be said of the many goals previously discussed (as Adobe Flash found out, security cannot be retro-fitted easily).

All about data

2015-04-25T00:00:00+00:00

The modern computers of today are an amazing and complex creation. From the smallest cellphones up to your super charged desktop PC, each and everyone of them is in reality a mixture of many smaller specialized computers working in concert.

However, at the end of the day, they all do the same thing: they munch on data. That is their only purpose and from the meaning inscribed in that data comes out the complex behaviours that we see in everyday programs and devices.

Programming is the art of organizing that data in meaningful ways in order to achieve a specific end result. Many criteria exist to take into consideration when designing such systems:

User friendly: how do we show the data and how do we allow the data to be manipulated by the user?
Development friendly: how do we organize the data (and code) for it to be easily manipulated in the ways that are required by the product?
Hardware friendly: how do we make sure that the data is layered out in the optimal way for the hardware underneath?
Communication friendly: how do we make sure that the data is easily communicated in between various systems either internally in the computer or externally?
Resilient to damage: how do we make the data safe from hardware failures?
Tamper proof: how do we make the data safe from malicious tampering?
Secure: how do we make sure that data in transit isn’t intercepted by a third party?

I am sure there are many more, these a simply a small subset.

What is often less discussed is that very often these goals have conflicting desires. In such scenarios, it is paramount to clearly identify the main goals of the software in order to make and validate our assumptions. These should guide all the important decisions regarding how the software is built and how the data is dealt with.

For example, in the context of AAA video games, with the hardware being largely fixed and the demand for ever higher & prettier visuals, the two most important criteria are almost always user friendliness and hardware friendliness for single player games. For multiplayer games, the complexity increases and communication & tampering become important. Last but not least, for games with micro transactions, security comes into play. Depending on the platform and the game, the risks will also vary in scale and scope.

In contrast, in the context of mobile games, development friendliness is king to allow churning updates, content and new games as quick as possible. It isn’t unusual to find mobile games with poor user interfaces, bad performance and that are easily tampered or compromised. In the top tier of games, user & hardware friendliness are again very important and show up in very clean games such as Clash of Clans and Candy Crush.

The single most important thing for a team is to be aligned on these goals and consider them when making every decision. Failure to do this exercise can have disastrous consequences: a change might easily introduce a performance issue or a security issue if one isn’t careful and paying attention to these goals. Ultimately the entire product can become at risk.

This is the reality in software design as much as it is in any other industry where teams must juggle multiple goals.