Animation Compression: Linear Key Reduction

With simple key quantization, if we needed to sample a certain time T for which we did not have a key (e.g. in between two existing keys), we linearly interpolated between the two.

A natural extension of this is of course to remove keys or key frames which can be entirely linearly interpolated from their neighbour keys as long as we introduce minimal or no visible error.

How It Works

The process to remove keys is fairly straight forward:

  • Pick a key
  • Calculate the value it would have if we linearly interpolated it from its neighbours
  • If the resulting track error is acceptable, remove it

The above algorithm continues until nothing further can be removed. How you pick keys may or may not impact significantly the results. I personally only ever came across implementations that iterated on all keys linearly forward in time. However, in theory you could iterate in any number of ways: random, key with smallest error first, etc. It would be interesting to try various iteration methods.

It is worth pointing out that you need to check the error at a higher level than the individual key you are removing since it might impact other removed keys by changing the neighbour used to remove them. As such you need to look at your error metric and not just the key value delta.

Removing keys is not without side effects: now that our data is no longer uniform, calculating the proper interpolation alpha to reconstruct our value at time T is no longer trivial. To be able to calculate it, we must introduce in our data a time marker per remaining key (or key frame). This marker of course adds overhead to our animation data and while in the general case it is a win, memory wise, it can increase the overall size if the data is very noisy and no or very few keys can be removed.

A simple formula is then used to reconstruct the proper interpolation alpha:

TP = Time of Previous key
TN = Time of Next key
Interpolation Alpha = (Sample Time - TP) / (TN - TP)

Another important side effect in introducing time markers is that when we sample a certain time T, we must now search to find between which two keys we must interpolate. This of course adds some overhead to our decompression speed.

The removal is typically done in one of two ways:

  • Removal of whole key frames that can be linearly interpolated
  • Removal of independent keys that can be linearly interpolated

While the first is less aggressive and will generally yield a higher memory footprint, the decompression speed will be faster due to needing to search only once to calculate our interpolation alpha.

For example, suppose we have the following track and keys: Some Keys

The key #3 is of particular interest: Key #3

As we can see, we can easily recover the interpolation alpha from its neighbours: alpha = (3 - 2) / (4 - 2) = 0.5. With it, we can perfectly reconstruct the missing key: value = lerp(0.35, 0.85, alpha) = 0.6.

Another interesting key is #4: Key #4

It lies somewhat close to the value we could linearly interpolate from its neighbours: value = lerp(0.6, 0.96, 0.5) = 0.78. Whether the error introduced by removing it is acceptable or not is determined by our error metric function.

In The Wild

This algorithm is perhaps the most common and popular out there. Both Unreal 4 and Unity 5 as well as many popular game engines support this format. They all use slight variations mostly in their error metric function but the principle remains the same. Sadly most implementations out there tend to use a poorly implemented error metric which tends to yield bad results in many instances. This typically stems from using a local error metric where each track type has a single error threshold. Of course the problem with this is that due to the hierarchical nature of our bone data, some bones need higher accuracy (e.g. pelvis, root). Some engines mitigate this by allowing a threshold per track or per bone but this requires some amount of tweaking to get right which is often undesirable and sub-optimal.

Twice in my career I had to implement a new animation compression algorithm and both times were to replace bad linear key reduction implementations.

From the implementations I have seen in the wild, it seems more popular to remove individual keys as opposed to removing whole key frames.

Performance

Sadly due to the loss of data uniformity, the cache locality of the data we need suffers. Unlike for simple key quantization, we can no longer simply sort by key frame if we remove individual keys (you still can if you remove whole key frames though) to keep things cache efficient.

Although I have not personally witnessed it, I suspect it should be possible to use a variation of a technique used by curve fitting to sort our data in a cache friendly way. It is well described here and we’ll come back to it when we cover curve fitting.

The need to constantly search for which neighbour keys to use when interpolating quickly adds up since it scales poorly. The longer our clip is, the wider the range we need to search and the more tracks we have also increases the amount of searching that needs to happen. I have seen two ways to mitigate this: partitioning our clip or by using a cursor.

Partitioning our clip data as we discussed with uniform segmenting helps reduce the range to search in as our clip length increases. If the number of keys per block is sufficiently small, searching can be made very efficient with a sorting network or similar strategy. The use of blocks will also decrease the need for precision in our time markers by using a similar form of range reduction which allows us to use fewer bits to store them.

Using a cursor is conceptually very simple. Most clips play linearly and predictably (either forward or backward in time). We can leverage this fact to speed up our search by caching which time we sampled last and which neighbour keys were used to kickstart our search. The cursor overhead is very low if we remove whole key frames but the overhead is a function of the number of animated tracks if we remove individual keys.

Note that it is also quite possible that by using the above sorting trick that it could speed up the search but I cannot speak to the accuracy of this statement at this time.

Even though we can reach a smaller memory footprint with linear key reduction compared to simple key quantization, the amount of cache lines we’ll need to touch when decompressing is most likely going to be higher. Along with the need to search for key neighbours, these facts makes it slower to decompress using this algorithm. It remains popular due to the reduced memory footprint which was very important on older consoles (e.g. PS2 and PS3 era) as well as due to its obvious simplicity.

Up next: Curve Fitting

Back to table of contents

Animation Compression: Sub-sampling

Once upon a time, sub-sampling was a very common compression technique but it is now mostly relegated to history books (and my boring blog!).

How It Works

It is conceptually very simple:

  • Take your source data, either from Maya (3DS Max, etc.) or already sampled data at some sample rate
  • Sample (or re-sample) your data at a lower sample rate

Traditionally, character animations have a sample rate of 30 FPS. This means that for any given animated track, we end up with 30 keys per second of animation.

Sub-sampling works because in practice, most animations don’t move all that fast and a lower sampling rate is just fine and thus 15-20 FPS is generally good enough.

Edge Cases

Now of course, this fails miserably if this assumption does not hold true or if a particular key is very important. It can often be the case that an important key is removed with this technique and there is sadly not much that can be done to avoid this issue short of selecting another sampling rate.

It is also worth mentioning that not all sampling rates are necessarily equal. If your source data is already discretized at some original sample rate, sampling rates that will retain whole keys are generally superior to sample rates that will force the generation of new keys by interpolating their neightbours.

For example, if my source animation track is sampled at 30 FPS, I have a key every 1s / 30 = 0.033s. If I sub-sample it at 18 FPS, I have a key every 1s / 18 = 0.055s. This means every key I need is not in sync with my original data and thus new keys must be generated. This will yield some loss of accuracy.

On the other hand, if I sub-sample at 15 FPS, I have a key every 1s / 15 = 0.066s. This means every other key in my original data can be discarded and the remaining keys are identical to my original keys.

Another good example is sub-sampling at 20 FPS which will yield a key every 1s / 20 = 0.05s. This means every 3rd key will be retained from the original data (0.0 … 0.033 … 0.066 … 0.1 … 0.133 … 0.166 … 0.2 …). The other keys do not line up and will be artificially generated from our original neighbour keys.

The problem of keys not lining up is of course absent if your source animation data is not already discretized. If you have access to the Maya (or 3DS Max) curves, the sub-sampling will retain higher accuracy.

In The Wild

In the wild, this used to be a very popular technique on older generation hardware such as the Xbox 360 and the PlayStation 3 (and older). It was very common to keep most of your main character animations with a high sample rate of say 30 FPS, while keeping most of your NPC animations at a lower sample rate of say 15 FPS. Any specific animation that required high accuracy would not be sub-sampled and this selection process was done by hand, making its usage somewhat error prone.

Due to its simplicity, it is also commonly used alongside other compression techniques (e.g. linear key reduction) to further reduce the memory footprint.

However, nowadays this technique is seldom used in large part because we aren’t as constrained by the memory footprint as we used to be and in part because we strive to push the animation quality ever higher.

Up next: Linear Key Reduction

Back to table of contents

Animation Compression: Simple Quantization

Simple quantization is perhaps the simplest compression technique out there. It is used for pretty much everything, not just for character animation compression.

How It Works

It is fundamentally very simple:

  • Take some floating point value
  • Normalize it within some range
  • Scale it to the range of a N bit integer
  • Round and convert to a N bit integer

That’s it!

For example, suppose we have some rotation component value within the range [-PI, PI] and we wish to represent it as a signed 16 bit integer:

  • Our value is: 0.25
  • Our normalized value is thus: 0.25 / PI = 0.08
  • Our scaled value is thus: 0.08 * 32767.0 = 2607.51
  • We perform arithmetic rounding: Floor(2607.51 + 0.5) = 2608

Reconstruction is the reverse process and is again very simple:

  • Our normalized value becomes: 2608.0 / 32767.0 = 0.08
  • Our de-quantized value is: 0.08 * PI = 0.25

If we need to sample a certain time T for which we do not have a key (e.g. in between two existing keys), we typically linearly interpolate between the two. This is just fine because key frames are supposed to be fairly close in time and the interpolation is unlikely to introduce any visible error.

Edge Cases

There is of course some subtlety to consider. For one, proper rounding needs to be considered. Not all rounding modes are equivalent and there are so many!

The safest bet and a good default for compression purposes, is to use symmetrical arithmetic rounding. This is particularly important when quantizing to a signed integer value but if you only use unsigned integers, asymmetrical arithmetic rounding is just fine.

Another subtlety is whether to use signed integers or unsigned integers. In practice, due to the way two’s complement works and the need for us to represent the 0 value accurately, when using signed integers, the positive half and the negative half do not have the same size. For example, with 16 bits, the negative half ranges from [-32768, 0] while the positive half ranges from [0, 32767]. This asymmetry is generally undesirable and as such using unsigned integers is often favoured. The conversion is fairly simple and only requires doubling our range (2 * PI in the example above) and offsetting it by the signed ranged (PI in the example above).

  • Our normalized value is thus: (0.25 + PI) / (2.0 * PI) = 0.54
  • Our scaled value is thus: 0.54 * 65535.0 = 35375.05
  • With arithmetic rounding: Floor(35375.05 + 0.5) = 35375

Another point to consider is: what is the maximum number of bits we can use with this implementation? A single precision floating point value can only accurately represent up to 6-9 significant digits as such the upper bound appears to be around 19 bits for an unsigned integer. Higher than this and the quantization method will need to change to an exponential scale, similar to how depth buffers are often encoded in RGB textures. Further research is needed on the nuances of both approaches.

In The Wild

The main question remaining is how many bits to use for our integers. Most standalone implementations in the wild use the simple quantization to replace a purely raw floating point format. Tracks will often be encoded on a hardcoded number of bits, typically 16 which is generally enough for rotation tracks as well as range reduced translation and scale tracks.

Most implementations that don’t exclusively rely on simple quantization but use it for extra memory gains again typically use a hardcoded number of bits. Here again 16 bits is a popular choice but sometimes as low as 12 bits is used. This is common for linear key reduction and curve fitting. Wavelet based implementations will typically use anywhere between 8 and 16 bits per coefficient quantized and these will vary per sub-band as we will see when we cover this topic.

The resulting memory footprint and the error introduced are both a function of the number of bits used by the integer representation.

This technique is also very commonly used alongside other compression techniques. For example, the remaining keys after linear key reduction will generally be quantized as will curve control points after curve fitting.

Performance

This algorithm is very fast to decompress. Only two key frames need to be decompressed and interpolated. Each individual track is trivially decompressed and interpolated. It is also terribly easy to make the data very processor cache friendly: all we need to do is sort our data by key frame, followed by sorting it by track. With such a layout, all our data for a single key frame will be contiguous and densely packed and the next key frame we will interpolate will follow as well contiguously in memory. This makes it very easy for prefetching to happen either within the hardware or manually in software.

For example, suppose we have 50 animated tracks (a reasonable number for a normal AAA character), each with 3 components, stored on 16 bits (2 bytes) per component. We end up with 50 * 3 * 2 = 300 bytes for a single key frame. This yields a small number of cache lines on a modern processor: ceil(300 / 64) = 5. Twice that amount needs to be read to sample our clip.

Due in large part to the cache friendly nature of this algorithm, it is quite possibly the fastest to decompress.

My GDC presentation goes in further depth on this topic, its content will find its way here in due time.

Up next: Advanced Quantization

Back to table of contents

Animation Compression: Main Compression Families

Over the years, a number of techniques have emerged to address the problem of character animation compression. These can be roughly broken down into a handful of general families:

Note that simple quantization and sub-sampling can be and often are used in tandem with the other three compression families although they can be used entirely on their own.

Up next: Simple Quantization

Back to table of contents

Animation Compression: Uniform Segmenting

Uniform segmenting is performed by splitting a clip into a number of blocks of equal or approximately equal size. This is done for a number of reasons.

Range Reduction

Splitting large clips (such as cinematics) into smaller blocks allows us to use range reduction on each track within a block. This can often help reach higher levels of compression especially on longer clips. This is typically done on top of the clip range reduction to further increase our compression accuracy.

Easier Seeking

For some compression algorithms, seeking is slow because it involves searching for the keys or control points surrounding the particular time T we are trying to sample. Techniques exist to speed this up by using optimal sorting but it adds complexity. With blocks sufficiently small (e.g. 16 key frames), optimal sorting might not be required nor the usage of a cursor (cursors are used by these algorithms to keep track of the last sampling performed to accelerate the search of the next sampling). See the posts about linear key reduction and curve fitting for details.

With uniform segmenting, finding which blocks we need is very fast in large part because there are typically few blocks for most clips.

Streaming and TLB Friendly

The easier seeking also means the clips are much easier to stream. Everything you need to sample a number of key frames is bundled together into a single contiguous block. This makes it easy to cache them or prefetch them during playback.

For similar reasons, this makes our clip data more TLB friendly. All our relevant data needed for a number of key frames will use the same contiguous memory pages.

Required For Wavelets

Wavelet functions typically work on data whose’s size is a power of two. Some form of padding is used to reach a power of two when the size doesn’t match. To avoid excessive padding, clips are split into uniform blocks. See the post about wavelet compression for details.

Downsides

Segmenting a clip into blocks does add some amount of memory overhead:

  • Each clip will now need to store a mapping of which block is where in memory.
  • If range reduction is done per block as well, each block now includes range information.
  • Blocks might need a header with some flags or other relevant information.
  • And for some compression algorithms such as curve fitting, it might force us to insert extra control points or to retain more key frames.

However, the upsides are almost aways worth the hassle and overhead.

Up next: Main Compression Families

Back to table of contents