11 Jan 2017
Unreal 4 offers a wide range of documented character animation compression settings.
It offers the following compression algorithms:
Note: The following review was done with Unreal 4.15
Reverts any animation compression, restoring the animation to the raw data.
This setting ensures we use raw data. It sets all tracks to use no compression which means they will have full floating point precision where None is used for translation and Float 96 for rotation, explained below. Interestingly it does not appear to revert the scale to a sensible default raw compression setting which is likely a bug.
Remove Every Second Key
Keyframe reduction algorithm that simply removes every second key.
This setting does exactly what it claims; it removes every other key. It is a variation of sub-sampling. Each remaining key is further bitwise compressed.
Note that this compression setting also removes the trivial keys.
Remove Trivial Keys
Removes trivial frames of tracks when position or orientation is constant over the entire animation from the raw animation data.
This takes advantage of the fact that very often tracks are constant and when it happens a single key is kept and the rest is discarded.
Bitwise Compress Only
Bitwise animation compression only; performs no key reduction.
This compression setting aims to retain every single key and encodes them with various variations. It is a custom flavor of simple quantization and will use the same format for every track in the clip. You can select one format per track type: rotation, translation, and scale.
These are the possible formats:
- None: Full precision.
- Float 96: Full precision is used for translation and scale. For rotations, a quaternion component is dropped and the rest are stored with full precision.
- Fixed 48: For rotations, a quaternion component is dropped and the rest is stored on 16 bits per component.
- Interval Fixed 32: Stores the value on 11-11-10 bits per component with range reduction. For rotations, a quaternion component is dropped.
- Fixed 32: For rotations, a quaternion component is dropped and the rest is stored on 11-11-10 bits per component.
- Float 32: This appears to be a semi-deprecated format where a quaternion component is dropped and the rest is encoded onto a custom float format with 6 or 7 bits for the mantissa and 3 bits for the exponent.
- Identity: The identity quaternion is always returned for the rotation track.
Only three formats are supported for translation and scale: None, Interval Fixed 32, and Float 96. All formats are supported for rotations.
If a track has a single key it is always stored with full precision.
When a component is dropped from a rotation quaternion, W is always selected. This is safe and simple since it can be reconstructed with a square root as long as our quaternion is normalized. The sign is forced to be positive during compression by negating the quaternion if W was originally negative. A quaternion and its negated opposite represent the same rotation on the hypersphere.
Sadly, the selection of which format to use is done manually and no heuristic is used to perform some automatic selection with this setting.
Note that this compression setting also removes the trivial keys.
Remove Linear Keys
Keyframe reduction algorithm that simply removes keys which are linear interpolations of surrounding keys.
This is a classic variation on linear key reduction with a few interesting twists.
They use a local space error metric coupled with a partial virtual vertex error metric. The local space metric is used classically to reject key removal beyond some error threshold. On the other hand, they check the impacted end effector (e.g. leaf) bones that are children of a given track. To do so a virtual vertex is used but no attempt is made to ensure it isn’t co-linear with the rotation axis. Bones in between the two bones (the one being modified and the leaf) are not checked at all and should they end up with unacceptable error, it will be invisible and not accounted for by the metric.
There is also a bias applied on tracks to attempt to remove similar keys on children and their parent bone tracks. I’m not entirely certain why they would opt to do this but I doubt it can harm the results.
On the decompression side no cursor or acceleration structure is used; meaning that each time we sample the clip we will need to search for which keys to interpolate between and we do so per track.
Note that this compression setting also removes the trivial keys.
Compress each track independently
Keyframe reduction algorithm that removes keys which are linear interpolations of surrounding keys, as well as choosing the best bitwise compression for each track independently.
This compression setting combines the linear key reduction algorithm along with the simple quantization bitwise compression algorithm.
It uses various custom error metrics which appear local in nature with some adaptive bias. Each encoding format will be tested for each track and the best result will be used based on the desired error threshold.
Animation compression algorithm that is just a shell for trying the range of other compression schemes and picking the smallest result within a configurable error threshold.
A large mix of the previously mentioned algorithms are tested internally. In particular, various mixes of sub-sampling are tested along with linear key reduction and simple quantization. The best result is selected given a specific error threshold. This is likely to be somewhat slow due to the large number of variations tested (35 at the time of writing!).
In memory, every compression setting has a layout that is naive with the data sorted by tracks followed by keys. This means that during decompression each track will incur a cache miss to read the compressed value.
Performance wise, decompression times are likely to be high in Unreal 4 in large part due to the memory layout not being cache friendly and the lack of an acceleration structure such as a cursor when linear key reduction is used; and it is likely used often with the Automatic setting. This is an obvious area where it could improve.
Memory wise, the linear key reduction algorithm should be fairly conservative but it could be minimally improved by ditching the local space error metric. Coupled with the bitwise compression it should yield a reasonable memory footprint; although it could be further improved by using more advanced quantization techniques.
Up next: Unity 5
Back to table of contents
23 Dec 2016
In order to keep the material concrete, we will take a deeper look at some popular video game engines. We will highlight the techniques that they use for character animation compression and where they shine or come short.
Up next: Unreal 4
Back to table of contents
22 Dec 2016
Unless you use the curves directly authored by the animator for your animation clip, the animation compression you use will end up being lossy in nature and some amount of fidelity will be lost (possibly visible to the naked eye or not). To combat the loss of accuracy, three error compensation techniques emerged over the years to help push the memory footprint further down while keeping the visual results acceptable: in-place correction, additive correction, and inverse kinematic correction.
This technique has already been detailed in Game Programming Gems 7. As such, I won’t go into implementation details unless there is significant interest.
As we have seen previously, our animated bone data is stored in local space which means that any error on a parent bone will propagate to its children. Fundamentally this technique aims to compensate our compression error by applying a small correction to each track to help stop the error propagating in our hierarchy. For example, if we have two parented bones, a small error in the parent bone will offset the child in object space. To account for this and compensate, we can apply a small offset on our child in local space such that the end result in object space is as close to the original animation as possible.
The single most important up side of this technique is that it adds little to no overhead to the runtime decompression. The bulk of the overhead remains entirely on the offline process of compression. We only add overhead on the decompression if we elect to introduce tracks to compensate for the error.
There are four issues with this technique.
We can only apply a correction on a particular child bone if it is animated. If it is not animated, adding our correction will end up adding more data to compress and we might end up increasing the memory footprint. If the translation track is not animated, our correction will be partial in that with rotation alone we will not be able to match the exact object space position to reach in order to mirror the original clip.
Adding a correction for every key frame will tend to yield noisy tracks. Each key frame will end up with a micro-correction which in turn will need somewhat higher accuracy to keep it reliable. For this reason, tracks with correction in them will tend to compress a bit more poorly.
To properly calculate the correction to apply for every bone track, we must calculate the object space transforms of our bones. This adds extra overhead to our compression time. It may or may not end up being a huge deal, your mileage may vary.
Due to the fact that we add corrections to our animated tracks, it is possible and probable that our track ranges might change as we compress and correct our data. This needs to be properly taken into account, further adding more complexity if range reduction is used.
Additive correction is very similar to the in-place correction mentioned above. Instead of modifying our track data in-place by incorporating the correction, we can instead store our correction separately as extra additive tracks and combine it during the runtime decompression.
This variation offers a number of interesting trade-offs which are worth considering:
- Our compressed tracks do not change and will not become noisy nor will their range change
- Missing tracks are not an issue since we always add separate additive tracks
- Adding the correction at runtime is very fast and simple
- Additive tracks are compressed separately and can benefit from a different accuracy threshold
However by its nature, the memory overhead will most likely end up being higher than with the in-place variant.
Inverse Kinematic Correction
The last form of error compensation leverages inverse kinematics. The idea is to store extra object space translation tracks for certain high accuracy bones such as feet and hands. Bones that come into contact with things such as the environment tend to make compression inaccuracy very obvious. Using these high accuracy tracks, we run our inverse kinematic algorithm to calculate the desired transforms of a few parent bones to match our desired pose. This will tend to spread the error of our parent bones, making it less obvious while keeping our point of contact fixed and accurate.
Besides allowing our compression to be more aggressive, this technique does not have a lot of up sides. It does have a number of down sides though:
- Extra tracks means extra data to compress and decompress
- Even a simple 2-bone inverse kinematic algorithm will end up slowing down our decompression since we need to calculate object space transforms for our bones involved
- By its very nature, our parent bones will no longer closely match the original clip, only the general feel might remain depending on how far the inverse kinematic correction ends up putting us.
All three forms of error correction can be used with any compression algorithm but they all have a number of important down sides. For this reason, unless you need the compression to be very aggressive, I would advise against using these techniques. If you choose to do so, the first two appear to be the most appropriate due to their reduced runtime overhead. Note that if you really wanted to, all three techniques could be used simultaneously but that would most likely be very extreme.
Up next: Case Studies
Back to table of contents
19 Dec 2016
Signal processing algorithm variants come in many forms but the most common and popular approach is to use Wavelets. Having utilized this method for all character animation clips on all supported platforms for Thief (2014) I have a fair amount to share with you.
Other signal processing algorithm variants include Discrete Cosine Transform, Principal Component Analysis, and Principal Geodesic Analysis.
The latter two variants are commonly used alongside clustering and database approaches which I’ll explore if enough interest is expressed but I’ll be focusing on Wavelets here.
How It Works
At their core implementations that leverage wavelets for compression will be split into four distinct steps:
- The wavelet transform
- Entropy coding
The flow of information can be illustrated like this:
The most important step is, of course, the wavelet function, around which everything is centered. Covering the wavelet function first will help clarify the purpose of every other step.
Aside from quantization all of the steps involved are effectively lossless and only suffer from minor floating point rounding. By altering how many bits we use for the quantization step we can control how aggressive we want to be with compression.
The decompression is simply the same process of steps performed in reverse order.
We will avoid going too in depth on this topic in this series instead we will focus on discussing the wavelet properties and what they mean for us with respect to character animation and compression in general. A good starting point for the curious would be the Haar wavelet which is the simplest of wavelet functions, however, it’s generally avoided for compression.
By definition wavelet functions are recursive. Each application of the function is referred to as a sub-band and will output an equal number of scale and coefficient values each exactly half the original input size. In turn, we can recursively apply the function on the resulting scale values of the previous sub-band. The end result is a single scale value and
N - 1 coefficients where
N is the input size.
Haar wavelet scale is simply the sum of two input values and the coefficient represents their difference. As far as I know most wavelet functions function similarly which yield coefficients that are as close to zero as possible and exactly zero for a constant input signal.
The reason the Haar wavelet is not suitable for compression because it has a single vanishing moment. This means input data is processed in pairs, each outputting a single scale and a single coefficient. The pairs never overlap which means that if there is a discontinuity in between two pairs it will not be taken into account and yield undesirable artifacts if the coefficients are not accurate. A decent alternative is to use a Daubechies D4 wavelet. This is the function I used on Thief (2014) and it turned out quite decently for our purposes.
The wavelet transform can be entirely lossless by using an integer variant but in practice, an ordinary floating point variant is appropriate since compression is lossy by nature and the rounding will not measurably impact the results.
Since wavelet function decomposes a signal on an orthonormal basis we will be able to achieve the highest compression by considering as much of the signal as possible not unlike principal component analysis. Simply concatenate all tracks together into a single 1D signal. The upside of this is that by considering all data as a whole we can find a single orthonormal basis which will allow us to quantize more aggressively but by having a larger signal to transform the decompression speed will suffer. To keep the process reasonably fast in practice on modern hardware each track would likely be processed independently in a small power of two, such as 16 keys at a time. For Thief (2014), all rotation tracks and translation tracks were aggregated independently up to a maximum segment size of 64 KB. We ran the wavelet transform once for rotation tracks, and once for translation tracks.
Because wavelet functions are recursive the size of the input data needs to be a power of two. If our size doesn’t match we will need to introduce some form of padding:
- Pad with zeroes
- Repeat the last value
- Mirror the signal
- Loop the signal
- Something even more creative?
Which padding approach you choose is likely to have a fairly minimal impact on compression. Your guess is as good as mine regarding which is best. In practice, it’s best to avoid padding as much as possible by keeping input sizes fairly small and processing the input data in blocks or segments.
The scale of output coefficients is a function of the scale and smoothness of our input values. As such it makes sense to perform range reduction and to normalize our input values.
After applying the wavelet transform the number of output values will match the number of input values. No compression has happened yet.
As mentioned previously our output values will be partitioned into sub-bands, and a single scale value somewhat centered around zero — both positive and negative. Each sub-band will end up with a different range of values. Larger sub-bands resulting from the first applications of the wavelet function will be filled with high-frequency information while the smaller sub-bands will comprise the low-frequency information. This is important. It means that a single low-frequency coefficient will impact a larger range of values after performing the inverse wavelet transform. Because of this low-frequency coefficients need higher accuracy than high-frequency coefficients.
To achieve compression we will quantize our coefficients into a reduced number of bits while keeping the single scale value with full precision. Due to the nature of the data, we will perform range reduction per sub-band and normalize our values between
[-1.0, 1.0]. We only need to keep the range extent for reconstruction and simply assume that the range is centered around zero. Quantization might not make sense for the lower frequency sub-bands with 1, 2, 4 coefficients due to the extra overhead of the range extent. Once our values are normalized we can quantize them. To choose how many bits to use per coefficient we can simply hard code a high number such as 16 bits, 12 bits, or alternatively experiment with values in an attempt to optimize a solution to meet an error threshold. Quantization could also be performed globally to reduce the range of information overhead instead of per sub-band depending on the number of input values being processed. For example processing 16 keys at a time.
In order to be competitive with other techniques, we need to push compression further using entropy coding which is an entirely lossless compression step.
After quantization we obtain a number of integer values all centered around zero and a single scale. The most obvious thing that we can compress now is the fact that we have very few large values. To leverage this we apply a zigzag transform on our data, mapping negative integers to positive unsigned integers such that values closest to zero remain closest to zero. This transforms our data in such a way that we still end up with very few large values which are significant because it means that most of our values represented in memory now have many leading zeroes.
For example suppose we quantize everything onto 16 bit signed integers:
-50, 50, 32760. In memory these values are represented with twos complement:
0xFFCE, 0x0032, 0x7FF8. This is not great and how to compress this further is not immediately obvious. If we apply the zigzag transform and map our signed integers into unsigned integers:
100, 99, 65519. In memory these unsigned integers are now represented as:
0x0064, 0x0063, 0xFFEF. An easily predictable pattern emerges with smaller values with a lot of leading zeroes which will compress well.
At this point, a generic entropy coding algorithm is used like zlib, Huffman, or some custom arithmetic coding algorithm. Luke Mamacos gives a decent example of a wavelet arithmetic encoder that takes advantage of leading zeros.
It’s worth noting that if you process a very large input in a single block you will likely end up with lots of padding at the end. This typically ends up as all zero values after the quantization step and it can be beneficial to use run length encoding to compress those before the entropy coding phase.
In The Wild
Signal processing algorithms tend to be the most complex to understand while requiring the most code. This makes maintenance a challenge which is represented by a decreased use in the wild.
While these compression methods can be used competitively if the right entropy coding algorithm is used, they tend to be far too slow to decompress, too complex to implement, and too challenging to maintain for the results that they yield.
Due to its popularity at the time I introduced wavelet compression to Thief (2014) to replace the linear key reduction algorithm used in Unreal 3. Linear key reduction was very hard to tweak properly due to a naive error function it used resulting in a large memory footprint or inaccurate animation clips. The wavelet implementation ended up being faster to compress with and yielded a smaller memory footprint with good accuracy.
Fundamentally the wavelet decomposition allows us to exploit temporal coherence in our animation clip data, but this comes at a price. In order to sample a single keyframe, we must reverse everything. Meaning, if we process 16 keys at a time we must decompress our 16 keys to sample a single one of them (or two if we linearly interpolate as we normally would when sampling our clip). For this reason, wavelet implementations are terribly slow to decompress and speeds end up not being competitive at all which only gets worse as you process a larger input signal. On Thief (2014) full decompression on the Play Station 3 SPU took between 800us and 1300us for blocks of data up to 64 KB.
Obviously, this is entirely unacceptable with other techniques in the range of 30us and 200us. To mitigate this and keep it competitive an intermediate cache is necessary.
The idea of the cache is to perform the expensive decompression once for a block of data (e.g. 16 keys) and re-use it in the future. At 30 FPS our 16 keys will be usable for roughly 0.5 seconds. This, of course, comes with a cost as we now need to implement and maintain an entirely new layer of complexity. We must first decompress into the cache and then interpolate our keys from it. The decompression can typically be launched early to avoid stalls when interpolating but it is not always possible. This is particularly problematic on the first frame of gameplay where a large number of animations will start to play at the same time while our cache is empty or stale. For similar reasons, the same issue happens when a cinematic moment starts or any moment in gameplay with major or abrupt change.
On the upside, as we decompress only once into the cache we can also take a bit of time to swizzle our data and sort it by key and bone such that our data per key frame is now contiguous. Sampling from our cache then becomes more or less equivalent to sampling with simple quantization. For this reason sampling from the cache is extremely fast and competitive (as fast as simple quantization).
Our small cache for Thief (2014) was held in main memory while our wavelet compressed data was held in video memory on the Play Station 3. This played very well in our favor with the rare decompressions not impacting the rendering bandwidth as much and keeping interpolation fast. This also contributed to slower decompression times but it was still faster than it was on the Xbox 360.
In conclusion signal processing algorithms should be avoided in favor of simpler algorithms that are easier to implement, maintain, and end up just as competitive when properly implemented.
Up next: Error Compensation
Back to table of contents
10 Dec 2016
Curve fitting builds on what we last saw with linear key reduction. With it, we saw that we leveraged linear interpolation to remove keys that could easily be predicted. Curve fitting archives the same feat by using a different interpolation method: a spline function.
How It Works
The algorithm is already fairly well described by Riot Games and Bitsquid (now called Stingray) in part 1 and part 2, and as such I will not go further into details at this time.
Catmull-Rom splines are a fairly common and a solid choice to represent our curves.
In The Wild
This algorithm is again fairly popular and is used by most animation authoring software and many game engines. Sadly, I never had the chance to get my hands on a state of the art implementation of this algorithm and as such I can’t quite go as far in depth as I would otherwise like to do.
Most character animated tracks move very smoothly and approximating them with a curve is a very natural choice. In fact, clips that are authored by hand are often encoded and manipulated as a curve in Maya (or 3DS Max). If the original curves are available, we can use them as-is. This also makes the information very dense and compact. The memory footprint of curve fitting should be considerably lower than with linear key reduction but I do not have access to competitive implementations of both algorithms to make a fair comparison.
For example, take this screen capture from some animation curves in Unity:
We can easily see that each track has five control points but with a total clip duration of 2.5 seconds (note that the image uses a sample rate of 25 FPS which makes the numbering a bit quirky) we would need
2.5 seconds * 30 frames/second = 75 frames to represent the same data. Even after using linear key reduction, the number of keys would remain higher than five.
As with linear key reduction, our spline control points will have time markers and most of what was mentioned previously will apply to curve fitting as well: we need to search for our neighbour control points, we need to sort our data to be cache efficient, etc.
One important distinction is that while linear key reduction only needed two keys per track to reconstruct our desired value at a particular time
T, with curve fitting we might need more. For example, Catmull-Rom splines require four control points. This makes it more likely to increase the amount of cache lines we need to read when we sample our clip. For this reason, and the fact that a spline interpolation function is more expensive to execute, decompression should be slower than with linear key reduction but without access to a solid implementation, the fact that it might be slower is only an educated guess at this point.
Up next: Signal Processing
Back to table of contents