Smaller, faster: ACL lets you cut your animation costs in half

I am excited to announce the open source Animation Compression Library has reached v1.1 along with an updated Unreal Engine 4 plugin v0.2.

ACL now beats Unreal Engine 4 on all the important metrics. The plugin is 1.7x smaller, 3x more accurate, 2.5x faster to compress, and up to 4.8x faster to decompress!

What’s new

The latest release focused on decompression performance. ACL v1.1 is about 30% faster than the previous version on every platform when decompressing a whole pose. Decompressing a single bone, a common operation in Unreal, is now about 3-4x faster. Also, ARM-based products will now use NEON SIMD acceleration when available.

The UE4 plugin was reworked to be more tightly integrated and is about twice as fast compared to the previous version.

ACL has now reached a point where I can confidently say that it is the best overall animation compression algorithm in the video games industry. While other techniques might beat ACL on some of these metrics, beating it simultaneously on speed, size, and accuracy will prove to be very challenging. In particular, unlike other algorithms that offer very fast decompression, ACL has no extra runtime memory cost beyond the compressed clip data.

New data!

One year ago, Epic generously agreed to let me use the Paragon animations for research purposes. This helped me find and fix bugs in Unreal Engine and ACL, and see how well both animation compression approaches perform in a real game. Paragon also allows each release to be rigorously tested against a large, relevant, and varied data set.

I am excited to announce that Epic is allowing me to use Fortnite to further my research as well! While Paragon will continue to play its role in tracking compression performance and regression testing, Fortnite will allow me to measure decompression performance in real world scenarios much more easily. Testing with Fortnite should highlight new ways ACL can be improved further.

What’s next

I am shifting my focus to add animation compression plugin support to UE4 during the next few months. If everything goes well, when UE 4.22 is released next year, I will be able to add the ACL plugin to the Unreal Engine Marketplace for everyone to use, for free.

Proper plugin support will remove overhead and help make ACL’s in-game decompression faster still.

Due to the rigorous testing and extensive statistics extraction every release now requires, I expect the release cycle to slow down. I will aim to perform non-bug fix releases about twice a year.

Compression performance overview

Here is a quick glance of how well it performs on the animations from Paragon:

  ACL Plugin v0.2.0 UE v4.19.2
Compressed size 226.09 MB 392.97 MB
Compression ratio 18.91 : 1 10.88 : 1
Compression time 6h 4m 18.21s 15h 10m 23.56s
Bone Error 99th percentile 0.0116 cm 0.0328 cm
Samples below 0.01 cm error threshold 98.85 % 84.88 %

The 99th percentile and the number of samples below the 0.01 cm error threshold are calculated by measuring the world-space error of every bone at every sample in each of the 6558 animation clips. To put this into perspective, over 99 % of the compressed data has an error lower than the width of a human hair. More details on how the error is measured can be found here.

Decompression performance overview

Decompression performance is currently tracked with the Matinee fight scene. The troopers have around 70 bones each while the main trooper has 541.

Matinee S8 Median Performance

Much care was taken to ensure that ACL has consistent decompression performance. The following two images show the time taken to decompress a pose at every point of the Matinee fight scene which highlights how regular ACL is.

Matinee UE4 S8 Performance Variance Matinee ACL S8 Performance Variance

It also has consistent decompression performance regardless of the playback direction and it works on every modern platform making it a safe choice when using it as the default algorithm in your games.

Overall, ACL is ideal for games with large amounts of animations playing concurrently such as those with large crowds, MMOs, and e-sports as well as those that run on mobile or slower platforms.

Animation Compression Library: Release 1.0.0

The long awaited ACL v1.0 release is finally here! And it comes with the brand new Unreal Engine 4 plugin v0.1! It took over 15 months of late nights, days off, and weekends to reach this point and I couldn’t be more pleased with the results.

Recap

The core idea behind ACL was to explore a different way to perform animation compression, one that departed from classic methods. Unlike the vast majority of algorithms in the wild, it uses bit aligned values as opposed to naturally aligned integers. This is slower to unpack but I hoped to compensate by not performing any sort of key reduction. By retaining every sample, the data is uniform in memory and offsets are trivially calculated, keeping things fast, the memory touched contiguous, and the hardware happy. While the technique itself isn’t novel and is often used with compression algorithms in other fields, to my knowledge it had never been tried to the extent ACL pushes it with animation compression, at least not publicly.

Very early, the technique proved competitive and over time it emerged as a superior alternative over traditional techniques involving key reduction. I then spent about 8 months writing the necessary infrastructure to make ACL not only production ready but production quality: unit tests were written, extensive regression tests were introduced, documentation was added as well as comments, scripts to replicate the results, cross platform support (ACL now runs on every platform!), etc. All that good stuff that one would expect from a professional product.

But don’t take my word for it! Check out the 100% C++ code (MIT license), the statistics below, and take the plugin out for a spin!

Performance

While ACL provides various synthetic test hardnesses to benchmark and extract statistics, nothing beats running it within a real game engine. This is where the UE4 plugin comes in and really shines. Just as with ACL, three data sets are measured: CMU, Paragon, and the Matinee fight scene.

Note that there are small differences between measuring with the UE4 plugin and with the ACL test harnesses due to implementation choices in the plugin.

Carnegie-Mellon University (CMU)

  ACL Plugin v0.1.0 UE v4.19.2
Compressed size 70.60 MB 99.94 MB
Compression ratio 20.25 : 1 14.30 : 1
Max error 0.0722 cm 0.0996 cm
Compression time 34m 30.51s 1h 27m 40.15s

ACL was smaller for 2532 clips (99.92 %)
ACL was more accurate for 2486 clips (98.11 %)
ACL has faster compression for 2534 clips (100.00 %)
ACL was smaller, better, and faster for 2484 clips (98.03 %)

Would the ACL Plugin have been included in the Automatic Compression permutations tried, it would have won for 2534 clips (100.00 %)

Data tracked here by the plugin, and here by ACL.

Paragon

  ACL Plugin v0.1.0 UE v4.19.2
Compressed size 226.02 MB 392.97 MB
Compression ratio 18.92 : 1 10.88 : 1
Max error 0.8566 cm 0.6424 cm
Compression time 6h 35m 03.24s 15h 10m 23.56s

ACL was smaller for 6413 clips (97.79 %)
ACL was more accurate for 4972 clips (75.82 %)
ACL has faster compression for 5948 clips (90.70 %)
ACL was smaller, better, and faster for 4499 clips (68.60 %)

Would the ACL Plugin have been included in the Automatic Compression permutations tried, it would have won for 6098 clips (92.99 %)

Data tracked here by the plugin, and here by ACL.

Matinee fight scene

  ACL Plugin v0.1.0 UE v4.19.2
Compressed size 8.67 MB 23.67 MB
Compression ratio 7.20 : 1 2.63 : 1
Max error 0.0674 cm 0.0672 cm
Compression time 52.44s 54m 03.18s

ACL was smaller for 1 clip (20 %)
ACL was more accurate for 4 clips (80 %)
ACL has faster compression for 5 clips (100 %)
ACL was smaller, better, and faster for 0 clip (0 %)

Would the ACL Plugin have been included in the Automatic Compression permutations tried, it would have won for 3 clips (60 %)

Data tracked here by the plugin, and here by ACL.

Decompression performance

Matinee S8 Median Performance

Playground S8 Median Performance

Data tracked here by the plugin, and here by ACL (they also include other platforms and more data).

Performance summary

As the numbers clearly show, ACL beats UE4 across every compression metric, sometimes by a significant margin: it is MUCH faster to compress, the quality is just as good, and the memory footprint is significantly reduced. ACL achieves all of this with default settings that animators rarely if ever need to tweak. What’s not to love?

However, the ACL decompression performance is sometimes ahead, sometimes behind, or the same. There are a few reasons for this, most of which I am hoping to fix in the next version to take the lead: NEON (SIMD) is not yet used on ARM, the ACL plugin needlessly performs MUCH more work than UE4 when decompressing, and many low hanging fruits were left to be fixed post-1.0 release.

ACL is just getting started!

How to use the ACL Plugin

As the documentation states here, a few minor engine changes are required in order to support the ACL plugin. These changes mostly consist of bug fixes and changes to expose the necessary hooks to plugins.

For the time being, the plugin is not yet on the marketplace as it is not fully plug-and-play. However, this summer I am working with Epic to introduce the necessary changes in order to publish the ACL plugin on the marketplace. Stay tuned!

Note that the ACL Plugin will reach v1.0 once it can be published on the marketplace but it is production ready regardless.

What’s new in ACL v1.0

Few things actually changed in between v0.8 and v1.0. Most of the changes revolved around minor additions, documentation updates, etc. There are two notable changes:

  • The first is visible in the decompression graphs: we now yield the thread before measuring every sample. This helps ensure more stable results by reducing the likelihood that the kernel will swap out the thread and interrupt it while executing the decompression code.
  • The second is visible in the compression stats for Paragon: a bug was causing the visible error to sometimes be partially hidden when 3D scale is present. While the new version is not less accurate than the previous, the measured error can be higher in very rare cases (only 1 clip is higher).

Regardless, the measuring should now be much more stable.

What’s next

The next release of ACL will focus on improving the compression and decompression performance. While ACL was built from the ground up to be fast to decompress; so far the focus has been on making sure things function properly and safely to establish a solid baseline to work with. Now that this work is done, the fun part can begin: making it the best it can be! I have many improvements planned and while some of them will make it in v1.1, others will have to wait for future versions.

Special care will be taken to make sure ACL performs at its best in UE4 but there is no reason why it couldn’t be used in your own favorite game engine or animation middleware. Developing with UE4 is easier for me in large part because of my past experience with it, my relationship with Epic, and the fact that it is open source. Other game engines like Unity explicitly forbid their use for benchmarking purposes in their EULA which prevents me from publishing any results without prior written agreement form their legal departement. Furthermore, without access to the source code, creating a plugin for it requires a lot more work. In due time, I hope to support Unity, Godot, and anyone else willing to try it out.

Animation Compression Library: Release 0.8.0

Today marks the v0.8 release of the Animation Compression Library. It contains lots of goodies but by far the most significant point is the fact that it has now reached feature parity with Unreal 4. For the first time Unreal 4 games should be able to run exclusively with ACL. The focus for the next 2 months will be to validate this with my custom UE 4.15 integration and implement whatever might be missing as well as to create a proper and free plugin to bring ACL to the marketplace.

While I have already published some decompression performance numbers earlier this week, once a proper integration has been made, new numbers will be published to showcase how ACL performs against Unreal 4 within the game engine itself. The existing numbers for the Carnegie-Mellon University database, Paragon, and the Matinee fight scene already clearly show ACL to be ahead in terms of compression time, compression ratio, and accuracy. However, while it remains to be seen if it will also be ahead with its decompression performance, I fully expect that it will.

ACL decompression performance

At long last I finally got around to measuring the decompression performance of ACL. This blog post will detail the baseline performance from which we will measure future progress. As I have previously mentioned, no effort has been made so far to optimize the decompression and I hope to remedy that following the v1.0 release scheduled around June 2018.

In order to establish a reliable data set to measure against, I use the same 42 clips used for regression testing plus 5 more from the Matinee fight scene. To keep things interesting, I measure performance on everything I have on hand:

The first two use both x86 and x64 while the later two use armv7-a and arm64 respectively. Furthermore, on the desktop I also compare VS 2015, VS 2017, GCC 7, and Clang 5. The more data, the merrier!

Decompression is measured both with a warm CPU cache to remove the memory fetches as much as possible from the equation as well as with a cold CPU cache to simulate a more realistic game engine playback scenario.

Three forms of playback are measured: forward, backward, and random.

Each clip is sampled 3 times at every key frame based on the clip sample rate and the smallest value is retained for that key.

Finally, two ways to decompress are profiled: decompressing a whole pose in one go (decompress_pose), and decompressing a whole pose bone by bone (decompress_bone).

The profiling harness is not perfect but I hope the extensive data pulled from it will be sufficient for our purposes.

Playback direction

In a real game, the overwhelming majority of clips play forward in time. Some clips play backwards (e.g. opening and closing a chest might use the same animation played in reverse) and a few others play randomly (e.g. driven by a thumb stick).

Not all algorithms will exhibit the same performance regardless of playback direction. In particular, forms of delta encoding as well as any caching of the last played position will severely degrade when the playback isn’t the one optimized for (as is often the case with key reduction techniques due to the data being sorted by time).

ACL currently uses the uniformly sampled algorithm which offers consistent performance regardless of the playback direction. To validate this claim, I hand picked 3 clips that are fairly long: 104_30 (44 bones, 11 seconds) from CMU, and Trooper_1 (71 bones, 66 seconds) and Trooper_Main (541 bones, 66 seconds) from the Matinee fight scene. To visualize the performance, I used a box and whiskers chart which shows concisely the min/max as well as the quartiles. Forward playback is shown in Red, backward in Green, and random in Blue.

VS 2015 x64 Playback Performance

As we can see, the performance is identical for all intents and purposes regardless of the playback direction on my desktop with VS 2015 x64. Let’s see if this claim holds true on my iPad as well.

iOS arm64 Playback Performance

Here again we see that the performance is consistent. One thing that shows up on this chart is that, surprisingly, the iPad performance is often better than my desktop! That is INSANE and I nearly fell off my chair when I first saw this. Not only is the CPU clocked at a lower frequency but the desktop code makes use of SSE and AVX where it can for all basic vector arithmetic while there is currently no corresponding NEON SIMD support. I double and triple checked the numbers and the code. Suspecting that the compiler might be playing a big part in this, I undertook to dump all the compiler stats on desktop; something I did not originally intend to do. Read on!

The CPU cache

Because animation clips are typically sampled once per rendered image, the CPU cache will generally always be cold during decompression. Fortunately for us, modern CPUs offer hardware prefetching which greatly helps when reads are linear. The uniformly sampled algorithm ACL uses is uniquely optimized for this with ALL reads being linear and split into 4 streams: constant track values, clip range data, segment range data, and the animated segment data.

Notes: ACL does not currently have any software prefetching and the constant track and clip range data will later be merged into a single stream since a track is one of three types: default (in which case there is neither constant nor range data), constant with no range data, or animated with range data and thus not constant.

For this reason, a cold cache is what will most interest us. That being said, I also measured with a warm CPU cache. This will allow us to see how much time is spent waiting on memory versus executing instructions. It will also allow us to compare the various platforms in terms of CPU and memory speed.

In the following graphs, the x86 performance was omitted because for every compiler it is slower than x64 (ranging from 25% slower up to 200%) except for my OS X laptop where the performance was nearly identical. I also omitted the VS 2017 performance because it was identical to VS 2015. Forward playback is used along with decompress_pose. The median decompression time is shown.

Two new clips were added to the graphs to get a better picture.

Cold CPU Cache Performance Cold CPU Cache Performance cont.

Again, we can see that the iPad outperforms almost everything with a cold cache except on the desktop with GCC 7 and Clang 5. It is clear that Clang does an outstanding job and plays an integral part in the surprising iPad performance. Another point worth noting is that its memory is faster than what I have in my desktop. My iPad has memory clocked at 1600 MHz (25 GB/s) while my desktop has its memory clocked at 1067 MHz (16.6 GB/s).

And now with a warm cache:

Warm CPU Cache Performance Warm CPU Cache Performance cont.

We can see that the iPad now loses out to VS 2015 with one exception: Trooper_Main. Why is that? That particular clip should easily fit within the CPU cache: only about 40KB is touched when sampling (or about 650 cache lines). Further research led to another interesting fact: the iPad A10X processor has a 64KB L1 data cache per core (and 8 MB L2 shared) while my i7-6850K has a 32KB L1 data cache and a 256KB L2 (with 15MB L3 shared). The clip thus fits entirely within the L1 on the iPad but needs to be fetched from the L2 on desktop.

Another takeaway from these graphs is that GCC 7 beats VS 2015 and Clang 5 beats both hands down on my desktop.

Finally, my Nexus 5X is really slow. On all the graphs, it exceeded any reasonable scale and I had to truncate it. I included it for the sake of completeness and to get a sense of how much slower it was.

Decompression method

ACL currently offers two ways to decompress: decompress_pose and decompress_bone. The former is more efficient if the whole pose is required but in practice it is very common to decompress specific bones individually or to decompress a pose bone by bone.

The following charts use the median decompression time with a cold CPU cache and forward playback.

Function Performance on CMU Function Performance on the Matinee Fight

Once more, we see very clearly how outstanding and consistent the iPad performance is. The numbers for the Nexus 5X are very noisy in comparison in large part because of the slower memory and larger footprint of some clips (decompress_bone is not shown for Android because it was far too slow and prevented a clean view of everything else).

We can clearly see that decompressing each bone separately is much slower and this is entirely because at the time of writing, each bone not required needs to be skipped over instead of using a direct look up with an offset. This will be optimized soon and the performance should end up much closer.

Conclusion

Despite having no external reference frame to compare them against, I could confirm and validate my hunches as well as observe a few interesting things:

  • My Nexus 5X is really slow …
  • Both GCC 7 and Clang 5 generate much better code than VS 2017
  • decompress_bone is much slower than it needs to be
  • The playback direction has no impact on performance

By far the most surprising thing to me was the iPad performance. Even though what I measure is not representative of ordinary application code, the numbers clearly demonstrate that the single core decompression performance matches that of a modern desktop. It might even exceed the single core performance of an Xbox One or PlayStation 4! Wow!!

I do have some baseline Unreal 4 numbers on hand but this blog post is already getting long and the next ACL version aims to be integrated into a native Unreal 4 plugin which will allow for a superior comparison to be made. However, they do show that ACL will be very close and will likely exceed the UE 4.15 decompression performance; stay tuned!

How much does additive bind pose help?

A common trick when compressing an animation clip is to store it relative to the bind pose. The conventional wisdom is that this allows to reduce the range of motion of many bones, increasing the accuracy and the likelihood that constant bones will turn into the identity, and thus allowing a lower memory footprint as a result. I have implemented this specific feature many times in the past and the results were consistent: a memory reduction of 3-5% was generally observed.

Now that the Animation Compression Library supports additive animation clips, I thought it would be a good idea to test this claim once more.

How it works

The concept is very simple to implement:

  • Before compression happens, the bind pose is removed from the clip by subtracting it from every key.
  • Then, the clip is compressed as usual.
  • Finally, after we decompress a pose, we simply add back the bind pose.

The transformation is lossless aside from whatever loss happens as a result of floating point precision. It has two primary side effects.

The first is that bone translations end up having a much shorter range of motion. For example, a walking character might have the pelvic bone about 60cm up from the ground (and root bone). The range of motion will thus circle around this large value for the whole track. Removing the bind pose brings the track closer to zero since the bind pose value of that bone is likely very near 60cm. Smaller floating point values generally retain higher accuracy. The principle is identical to normalizing a track within its range.

The second impacts constant tracks. If the pelvic bone is not animated in a clip, it will retain some constant value. This value is often the bind pose itself. When this happens, removing the bind pose yields the identity rotation and translation. Since these values are trivial to reconstruct at runtime, instead of having to store the constant floating point values, we can store a simple bit set.

As a result, hand animated clips with the bind pose removed often find themselves with a lower memory footprint following compression.

Mathematically speaking, how the bind pose is added or removed can be done in a number of ways, much like additive animation clips. While additive animation clips heavily depend on the animation runtime, ACL now supports three variants:

  • Relative space
  • Additive space 0
  • Additive space 1

The last two names are not very creative or descriptive… Suggestions welcome!

Relative space

In this space, the clip is reconstructed by multiplying the bind pose with a normal transform_mul operation. For example, this is the same operation used to convert from local space to object space. Performance wise, this is the slowest: to reconstruct our value we end up having to perform 3 quaternion multiplications and if negative scale is present in the clip, it is even slower (extra code not shown below, see here).

Transform transform_mul(const Transform& lhs, const Transform& rhs)
{
	Quat rotation = quat_mul(lhs.rotation, rhs.rotation);
	Vector4 translation = vector_add(quat_rotate(rhs.rotation, vector_mul(lhs.translation, rhs.scale)), rhs.translation);
	return transform_set(rotation, translation, scale);
}

Additive space 0

This is the first of the two classic additive spaces. It simply multiplies the rotations, it adds the translations, and multiplies the scales. The animation runtime ozz-animation uses this format. Performance wise, this is the fastest implementation.

Transform transform_add0(const Transform& base, const Transform& additive)
{
	Quat rotation = quat_mul(additive.rotation, base.rotation);
	Vector4 translation = vector_add(additive.translation, base.translation);
	Vector4 scale = vector_mul(additive.scale, base.scale);
	return transform_set(rotation, translation, scale);
}

Additive space 1

This last additive space combines the base pose in the same way as the previous except for the scale component. This is the format used by Unreal 4. Performance wise, it is very close to the previous space but requires an extra instruction or two.

Transform transform_add1(const Transform& base, const Transform& additive)
{
	Quat rotation = quat_mul(additive.rotation, base.rotation);
	Vector4 translation = vector_add(additive.translation, base.translation);
	Vector4 scale = vector_mul(vector_add(vector_set(1.0f), additive.scale), base.scale);
	return transform_set(rotation, translation, scale);
}

It is worth noting that because these two additive spaces differ only by how they handle scale, if the animation clip has none, both methods will yield identical results.

Results

Measuring the impact is simple: I simply enabled all three modes one by one and compressed all of the Carnegie-Mellon University motion capture database as well as all of the Paragon data set. Decompression performance was not measured on its own but the compression time will serve as a hint as to how it would perform.

Everything has been measured with my desktop using Visual Studio 2015 with AVX support enabled with up to 4 clips being compressed in parallel. All measurements were performed with the upcoming ACL v0.8 release.

CMU Results

CMU has no scale and it is thus no surprise that the two additive formats perform the same. The memory footprint and the max error remain overall largely identical but as expected the compression time degrades. No gain is observed from this technique which further highlights how this data set differs from hand authored animations.

Paragon Results

Paragon shows the results I was expecting. The memory footprint reduces by about 7.9% which is quite significant and the max error improves as well. Again, we can see both additive methods performing equally well. The relative space clearly loses out here and fails to show significant gains to compensate for the dramatically worse compression performance.

Conclusion

Overall it seems clear that any potential gains from this technique are heavily data dependent. A nearly 8% smaller memory footprint is nothing to spit at but in the grand scheme of things, it might no longer be worth it in 2018 when decompression performance is likely much more important, especially on mobile devices. It is not immediately clear to me if the reduction in memory footprint could save enough to translate into fewer cache lines being fetched but even so it seems unlikely that it would offset the extra cost of the math involved.

Back to table of contents