The Animation Compression Library just got even faster

Slowly but surely, the Animation Compression Library has now reached v1.2 along with an updated v0.3 Unreal Engine 4 plugin. The most notable changes in this release are as follow:

  • More compilers and architectures added to continuous integration
  • Accuracy bug fixes
  • Floating point sample rate support
  • Dramatically faster compression through the introduction of a compression level setting

TL;DR: Compared to UE 4.19.2, the ACL plugin compresses up to 1.7x smaller, is up to 3x more accurate, up to 158x faster to compress, and up to 7.5x faster to decompress (results may vary depending on the platform and data).

Note that UE 4.21 introduced changes that significantly sped up the compression with its Automatic Compression codec but I haven’t had the time to setup a new branch with it to measure.

UE 4 plugin support and progress

Now that ACL properly supports a floating point sample rate, the UE4 plugin has reached feature parity with the stock codecs.

As announced at the GDC 2019, work is ongoing to refactor the Unreal Engine to natively support animation compression plugins and is currently on track to land with UE 4.23. Once it does, the plugin will be updated once more, finally reaching v1.0 on the Unreal marketplace for free.

Lighting fast compression

One of the most common feedback I received from those that use ACL in the wild (both within UE4 and outside) was the desire for faster compression. The optimization algorithm is very aggressive and despite its impressive performance overall (as highlighted in prior releases), some clips with deep bone hierarchies could take a very long time to compress, prohibitively so.

In order to address this, a new compression level was introduced in the compression settings to better control how much time should be spent attempting to find an optimal bit rate. Higher levels take more time but yield a lower memory footprint. A total of five levels were introduced but the lowest three currently behave the same for now: Lowest, Low, Medium, High, Highest. The Highest level corresponds to what prior releases did by default. After carefully reviewing the impact of each level, a decision was made to make the default level be Medium instead. This translates in dramatically faster compression, identical accuracy, with a very small and acceptable increase in memory footprint. This should provide for a much better experience for animators during production. Once the game is ready to be released, the animations can easily and safely be recompressed with the Highest setting in order to squeeze out every byte.

In order to extract the following results, I compressed the Carnegie-Mellon University motion capture database, Paragon, and Fortnite in parallel with 4 threads using ACL standalone. Numbers in parenthesis represent the delta again Highest.

Compressed Size Highest High Medium
CMU 67.05 MB 68.85 MB (+2.7%) 71.01 MB (+5.9%)
Paragon 206.87 MB 211.81 MB (+2.4%) 218.58 MB (+5.7%)
Fortnite 491.79 MB 497.60 MB (+1.2%) 507.11 MB (+3.1%)
Compression Time Highest High Medium
CMU 24m 57.59s 11m 51.48s 6m 20.89s
Paragon 4h 55m 42.57s 1h 19m 36.01s 29m 21.65s
Fortnite 8h 13m 1.66s 2h 29m 59.37s 1h 3m 18.17s
Compression Speed Highest High Medium
CMU 977.36 KB/sec 2057.24 KB/sec (+2.1x) 3842.79 KB/sec (+3.9x)
Paragon 246.79 KB/sec 916.82 KB/sec (+3.7x) 2485.58 KB/sec (+10.1x)
Fortnite 613.56 KB/sec 2016.82 KB/sec (+3.3x) 4778.65 KB/sec (+7.8x)

And here are the default settings in action on the animations from Paragon with the ACL plugin inside UE4:

  ACL Plugin v0.3.0 ACL Plugin v0.2.0 UE v4.19.2
Compressed size 234.76 MB 226.09 MB 392.97 MB
Compression ratio 18.22 : 1 18.91 : 1 10.88 : 1
Compression time 30m 14.69s 6h 4m 18.21s 15h 10m 23.56s
Compression speed 2412.94 KB/sec 200.32 KB/sec 80.16 KB/sec
Max ACL error 0.8623 cm 0.8590 cm 0.8619 cm
Max UE4 error 0.8601 cm 0.8566 cm 0.6424 cm
ACL Error 99th percentile 0.0094 cm 0.0116 cm 0.0328 cm
Samples below ACL error threshold 99.19 % 98.85 % 84.88 %

The 99th percentile and the number of samples below the 0.01 cm error threshold are calculated by measuring the error of every bone at every sample in each of the 6558 animation clips. More details on how the error is measured can be found here.

In this new release, the decompression performance remains largely unchanged. It is worth noting that a month ago my Google Nexus 5X died abruptly and as such performance numbers will no longer be tracked on it. Instead, my new Google Pixel 3 will be used from here on out.

What’s next

The next release v1.3 currently scheduled for the Fall 2019 will aim to tackle commonly requested features:

  • Faster decompression in long clips by optimizing seeking
  • Multiple root transform support (e.g. rigid body simulation compression)
  • Scalar track support (e.g. float curves for blend shapes)
  • Faster compression in part by using multiple threads to compress a single clip (which will help the UE4 plugin a lot)

If you use ACL and would like to help prioritize the work I do, feel free to reach out and provide feedback or requests!

Compressing Fortnite Animations

New year, new stats! A few months ago, Epic agreed to let me use their Fortnite animations for my open source research with the Animation Compression Library (ACL). Following months of work to refactor Unreal Engine 4 in order to natively support animation compression plugins, it has finally entered the review stage on Epic’s end. While I had hoped the changes could make it in time for Unreal Engine 4.22, due to unforeseen delays, 4.23 seems a much more likely candidate.

Even though the code isn’t public yet, the new updated ACL plugin kicks ass and Fortnite is a great title to showcase it with. The real game uses the classic UE4 codecs but I recompressed everything with the latest and greatest. After spending several hundred hours compressing the animations, fixing bugs, and iterating I can finally present the results.

TL;DR: Inside Fortnite, ACL shines bright with 2x faster compression, 2x smaller memory footprint, and higher accuracy. Decompression is 1.6x faster on desktop, 2.3x faster on a Samsung S8, and 1.2x faster on the Xbox One.

Methodology

For the UE4 measurements I used a modified UE 4.21 with its default Automatic Compression. It tries a list of codecs in parallel and selects the optimal result by considering both size and accuracy.

ACL uses a modified version of the open source ACL Plugin v0.2.2. It uses its own default compression settings and in the rare event where the error is above 1cm, it falls back automatically to safer settings.

Although the UE4 refactor doesn’t change the legacy codecs, it does speed up their decompression a bit compared to previous UE4 releases. That is one of many benefits everyone will get to enjoy as a result of my refactor work regardless of which codec is used.

Error measurements

While the UE4 and ACL error measurements never exactly matched, they historically have been very close for every single clip I had tried, until Fortnite. As it turns out, some exotic animations brought to light the fact that some subtle differences in how they both measure the error can lead to some large perceived discrepancies. This has now been documented in the plugin here.

Three differences stand out: how the error is measured, where the error is measured in space, and where the error is measured in time. You can follow the link above for the gory details but the jist is that ACL is more conservative and more accurate in how it measures the error and it should always be trusted over what UE4 reports in case of doubt or disagreement.

It is worth noting that because ACL does not currently support a floating point sample rate (e.g 28.3 FPS), those clips (and there are many) have a higher reported error with UE4 because by rounding, we are effectively time stretching those clips a tiny bit. They still look just as good though. This will be fixed in the next version.

The animations

I extracted all the non-additive animations regardless of whether they were used by the game or not: a grand total of 8304 clips! A total raw size of 17 GB and roughly 17.5 hours worth of playback.

Fortnite has a surprising number of exotic clips. Some take hours to compress with UE4 and others have a range of motion as wide as the distance between the earth and the moon! These allowed me to identify a number of very subtle bugs in ACL and to fix them.

Compression stats

  ACL Plugin UE4
Compressed size 498.21 MB 1011.84 MB
Compression ratio 35.55 : 1 17.50 : 1
Compression time 12h 38m 04.99s 23h 8m 58.94s
Compression speed 398.72 KB/sec 217.62 KB/sec
Max ACL error 0.9565 cm 8392339 cm
Max UE4 error 108904.6797 cm 8397727 cm
ACL Error 99th percentile 0.0309 cm 2.1856 cm
Samples below ACL error threshold 97.71 % 77.37 %

Once again, ACL performs admirably: the compression speed is twice as fast (1.83x), the memory footprint reduces in half (2.03x smaller), and the accuracy is right where we want it. This is also in line with the previous results from Paragon.

Fortnite Max Error Distribution

UE4’s accuracy struggles a bit with a few clips but in practice the error might not be visible as the overwhelming majority of samples are very accurate. This is consistent as well with previous results.

UE4 Import Comic

A handful of clips contribute to a large portion of the UE4 compression time and its high error. One clip in particular stands out: it has 1167 bones, 8371 samples at 120 FPS, and a total raw size of 372.66 MB. Its range of motion peaks at 477000 kilometers away from the origin! It truly pushes the codecs to their absolute limits.

  ACL Plugin UE4
Compressed size 71.53 MB 220.87 MB
Compression ratio 5.21 : 1 1.69 : 1
Compression time 1m 38.07s 4h 51m 59.13s
Compression speed 3891.19 KB/sec 21.78 KB/sec
Max ACL error 0.0625 cm 8392339 cm
Max UE4 error 108904.6797 cm 8397727 cm

It takes almost 5 hours to compress with UE4! In comparison, ACL zips through in well under 2 minutes. While it tries its best with the default settings it ultimately ends up using the safety fallback and thus compresses twice in that amount of time.

Overall, if you added the ACL codec to the Automatic Compression list, here is how it would perform:

  • ACL is smaller for 7711 clips (92.86 %)
  • ACL is more accurate for 7576 clips (91.23 %)
  • ACL has faster compression for 5704 clips (68.69 %)
  • ACL is smaller, better, and faster for 5017 clips (60.42 %)
  • ACL wins Automatic Compression for 7863 clips (94.69 %)

Decompression stats

Fortnite has the handy ability to create replays. These make gathering deterministic profiling numbers a breeze. The numbers that follow are from a 50 vs 50 replay. On each platform, a section of the replay with some high intensity action was profiled.

Desktop

Fortnite Desktop Decompression Time

The performance on desktop looks pretty good. ACL is consistently faster, about 38% on average. It also appears a bit less noisy, a direct benefit of the improved cache friendliness of its algorithm.

Samsung S8

Fortnite Samsung S8 Decompression Time

Fortnite Samsung S8 ACL Decompression Time

ACL really shines on mobile. On average it is 56% faster but that is only part of the story. On the S8 it appears that the core is hyperthreaded and another thread does heavy work and applies cache pressure. This causes all sorts of spikes with UE4 but in comparison, the cache aware ACL allows it to maintain a clean and consistent performance.

Hyperthreading on the CPU (and the GPU) works, roughly speaking, by the processor switching to another thread already executing when it notices that the current thread is stalled waiting on a slow operation, typically when memory needs to be pulled into the cache. Both threads are executing in the sense that they have data being held in registers but only one of them advances at a time on that core. When one stalls, the other executes.

When you have a piece of code that triggers a lot of cache misses, such as some of the legacy UE4 codecs, the processor will be more likely to switch to the other hyperthread. When this happens, execution is suspended and it will only resume once the other thread stalls or the time slice expires. This could be a long time especially if the other hyperthread is executing cache friendly code and doesn’t otherwise stall often.

This translates into the type of graph above where there is heavy fluctuation as the execution time varies widely from the noise of the neighbor hyperthread.

On the other hand, when the code is cache friendly, it doesn’t give the chance to the other thread to run. This gives a nice and smooth graph for that current thread as the risk of long interruptions is reduced. When the code is that optimized, hyperthreading typically doesn’t help speed things up much as both threads compete for the same time slice with few opportunities to hide stalling latency. This is also what I observed when measuring the compression performance. In theory due to the higher cache pressure, performance could even degrade with hyperthreading but in practice I haven’t observed it, not with ACL at least.

Xbox One

Fortnite Xbox One Decompression Time

On the Xbox One ACL is about 13% faster on average. Both lines seem to have very similar shapes unlike the previous two platforms due in large part to the absence of hyperthreading. There are a few possibilities as to why the gain isn’t as significant on this platform:

  • The MSVC compiler does not generate assembly that is as clean as it generates on PC, it’s certainly sub-optimal on a few points. It fails to inline many trivial functions and it leaves around unnecessary shuffle instructions.
  • Perhaps the other threads that share the L2 thrash the hardware prefetcher, preventing it from kicking in. ACL benefits heavily from hardware prefetching.
  • The CPU is quite slow compared to the speed of its memory. This reduces the benefit of cache locality as it keeps L2 cache misses fairly cheap in comparison.

The last two points seems the most likely culprits. ACL does a bit more work per sample decompressed than UE4 but everything is cache friendly. This gives it a massive edge when memory is slow compared to the CPU clock as is the case on my desktop, the Samsung S8, and lots of other platforms.

Conclusion

With faster compression, faster decompression on every platform, a smaller memory footprint, and higher accuracy, ACL continues to shine in UE4 and it won’t be long now before everyone can find it on the marketplace for free.

In the meantime, in the next few months I will perform another release of ACL and its plugin with all the latest fixes made possible with Fortnite’s data.

Status update

To my knowledge, the first game released with ACL came out in November 2018 with the public UE4 plugin: OVERKILL’s The Walking Dead. I was told it reduced their animation memory footprint by over 50% helping them fit within their console budgets.

A number of people have also integrated it into their own custom game engines and although I have no idea if they are using it or not, Remedy Entertainment has forked ACL!

Last but not least, I’d like to extend a special shout-out to Epic for allowing me to do this and to the ACL contributors!

Introducing Realtime Math v1.0

Almost two years ago now, I began writing the Animation Compression Library. I set out to build it to be production quality which meant I needed a whole lot of optimized math. At the time, I took a look at the landscape of math libraries and I opted to roll out my own. It has served me well, propelling ACL to success with some of the fastest compression and decompression performance in the industry. I am now proud to announce that the code has been refactored out into its own open source library: Realtime Math v1.0 (RTM) (MIT license).

There were a few reasons that motivated the choice to move the code out on its own:

  • A significant amount of the ACL Continuous Integration build time is compiling and running the math unit tests which slows things down a bit more than I’d like
  • It decouples code that will benefit from being on its own
  • I believe it has its place in the landscape of math libraries out there

In order to support that last point, I reviewed 9 other popular and usable math libraries for realtime applications. I looked at these with the lenses of my own needs and experience, your mileage may vary.

Disclaimer: the list of reviewed libraries is in no way exhaustive but I believe it is representative. Note that Unreal Engine 4 is included for informational purposes as it isn’t really usable on its own. Libraries are listed in no particular order and I tried to be as objective as possible. If you spot any inaccuracies, don’t hesitate to reach out.

The list: Realtime Math, MathFu, vectorial, VectorialPlusPlus, C OpenGL Graphics Math (CGLM), OpenGL Graphics Math (GLM), Industrial Light & Magic Base (ILMBase), DirectX Math, and Unreal Engine 4.

TL;DR: How Realtime Math stands out

I believe Realtime Math stands out for a few reasons.

It is geared for high performance, deeply hot code. Most functions will end up inlined but the price to pay is an API that is a bit more verbose as a result of being C-style. When the need arises to use intrinsics, it gets out of the way and lets you do your thing. Only two libraries had what I would call optimal inlinability: Realtime Math and DirectX Math. Only those two libraries properly support the __vectorcall calling convention explicitly and only RTM handles GCC and Clang argument passing explicitly.

While it still needs a bit of love, quaternions are a first class citizen and it is the only standalone open source library I could find that supports QVV transforms (a rotation quaternion, a 3d scale vector, and a translation vector).

Realtime Math uses a coding style similar to the C++ standard library and feels clean and natural to read and write.

It consists entirely of C++11 headers, it runs almost everywhere, it supports 64 bit floating point arithmetic, and it sports a very permissive MIT license.

License

ACL is open source and uses the MIT license. I am never keen on adding dependencies and if I really have to, I want a permissive license free of constraints.

Library License
Realtime Math MIT
MathFu Apache 2.0
vectorial BSD 2-clause
VectorialPlusPlus BSD 2-clause
CGLM MIT
GLM Modified MIT
ILMBase Custom but permissive
DirectX Math MIT
Unreal Engine 4 UE4 EULA

Header only

For simplicity and ease of integration, I want ACL to be entirely made of C++11 headers. This also constrains any dependencies to the same requirement.

Library Header Only
Realtime Math Yes
MathFu Yes
vectorial Yes
VectorialPlusPlus Yes
CGLM Yes (optional lib)
GLM Yes
ILMBase No
DirectX Math Yes
Unreal Engine 4 No

Verbosity, readability, and power

An important requirement for a math library is to be reasonably concise with average code without getting in the way if the need arises to dive right into raw intrinsics. In my experience, general math type abstractions take you very far but in order to squeeze out every cycle it is sometimes necessary to write custom per platform code. When this is required, it is important for the library to not hide its internals and leave the door open.

I am personally more a fan of C-style interfaces for a math library for various reasons: I can infer very well what happens under the hood (I have seen many libraries make fancy use of some operators that leave many newcomers to wonder what they do) and they are optimal for performance as we will discuss later. The downside of course is that they tend to be a bit more verbose. However, this largely boils down to a matter of personal taste.

vectorial is one of the few libraries that offers both a C-style interface and C++ wrappers and at the other end of the spectrum DirectX Math has both a namespace and a prefix for every type, constant and function.

Library Verbosity
Realtime Math Medium (C-style)
MathFu Light (C++ wrappers)
vectorial Light (C++ wrappers) and Medium (C-style)
VectorialPlusPlus Light (C++ wrappers)
CGLM Medium (C-style)
GLM Light (C++ wrappers)
ILMBase Light (C++ wrappers)
DirectX Math Medium++ (C-style with prefix and namespace)
Unreal Engine 4 Light (C++ wrappers)

It is very common for C-style math APIs to typedef their types to the underlying SIMD type. Realtime Math, DirectX Math, and many others do this. While this is great for performance, it does raise one problem: type safety is reduced. While usually those interfaces will opt to not expose proper vector2 or vector3 types and instead rely on functions that simply ignore the extra components, it doesn’t work so well when vector4 and quaternions are mixed. Only Realtime Math, DirectX Math and CGLM have quaternions with C-style interfaces but only the first two have a distinct type for quaternions when SIMD intrinsics are disabled. This somewhat mitigates the issue because with both Realtime Math and DirectX Math you can compile without intrinsics and still have type safety validated there. Although at the end of the day, all three have functions with distinct prefixes for vector and quaternion math and as such type safety is unlikely to be an issue.

Type and feature support

By virtue or being an animation compression library, ACL’s needs are a bit different from a traditional realtime application. This dictated the need I had for specific types and features. I had no need for general 3x3 or 4x4 matrices as well as 2D vectors which are more commonly used in gameplay and rendering. However, 3x4 affine matrices, 3D and 4D vectors, quaternions, and QVV transforms (a quaternion, a vector3 translation, and a vector3 scale) are of critical importance. Those types are front and center in an animation runtime and I needed them to be fully featured and fast. Most of the libraries under review had way more features than I cared for (mostly for rendering) but generally missed proper or any support for quaternions and QVV transforms.

MathFu appears to have a bug where the Matrix 4x4 SIMD template specialization isn’t included by default and its quaternions are 32 bytes instead of the ideal 16 due to alignment constraints.

VectorialPlusPlus quaternions also take 32 bytes instead of 16 due to alignment constraints and most of their quaternion code appears to be scalar.

UE 4 is notable for being the only other library to support QVV and it does offer a VectorRegister type to support SIMD for Vector2/3/4 although most of the code written in the engine uses the scalar version.

Library Vector2 Vector3 Vector4 Quaternion Matrix 3x3 Matrix 4x4 Matrix 3x4 QVV
Realtime Math   SIMD SIMD SIMD SIMD SIMD SIMD SIMD
MathFu SIMD SIMD SIMD Partial SIMD Scalar SIMD Scalar  
vectorial   SIMD SIMD     SIMD    
VectorialPlusPlus SIMD SIMD SIMD Scalar SIMD SIMD    
CGLM   SIMD SIMD SIMD SIMD SIMD SIMD  
GLM SIMD SIMD SIMD Partial SIMD SIMD SIMD SIMD  
ILMBase Scalar Scalar Scalar Scalar Scalar Scalar    
DirectX Math SIMD SIMD SIMD SIMD   SIMD    
Unreal Engine 4 Scalar Scalar Scalar SIMD   SIMD   SIMD

SIMD architecture support

Equally important was the SIMD architecture support. I want to run ACL everywhere with the best performance possible, especially on mobile. SSE, AVX, and NEON are all equally important to me.

Worth noting that 2 years ago DirectX NEON support appeared almost exclusively to be for Windows ARM NEON and I have no idea if it runs on iOS or Android even today.

Library SSE AVX NEON
Realtime Math Yes Yes Yes
MathFu Yes   Yes
vectorial Yes   Yes
VectorialPlusPlus Yes   Partial
CGLM Yes Yes Partial
GLM Yes    
ILMBase      
DirectX Math Yes Yes Yes
Unreal Engine 4 Yes Yes Yes

Platform and compiler support

Here things are a bit more complicated as libraries will list platforms but not compilers or compilers but not platforms. I need ACL to run everywhere and this means limiting myself to C++11 features.

  • Realtime Math: Windows (VS2015, VS2017) x86 and x64, Linux (gcc5, gcc6, gcc7, gcc8, clang4, clang5, clang6) x86 and x64, OS X (Xcode 8.3, Xcode 9.4, Xcode 10.1) x86 and x64, Android clang ARMv7-A and ARM64, iOS (Xcode 8.3, Xcode 9.4, Xcode 10.1) ARM64
  • MathFu: Windows, Linux, OS X, Android
  • vectorial: Unlisted but probably Windows, Linux, OS X, Android, and iOS
  • VectorialPlusPlus: Unlisted but probably Windows
  • CGLM: Windows, Unix, and probably everywhere
  • GLM: VS2013+, Apple Clang 6, GCC 4.7+, ICC XE 2013+, LLVM 3.4+, CUDA 7+
  • ILMBase: Unlisted but probably Windows, Linux, OS X
  • DirectX Math: VS2015 and VS2017, possibly elsewhere
  • Unreal Engine 4: Windows (VS2015, VS2017) x64, Linux x64, OS X x64, Android ARMv7-A (no NEON) and ARM64, iOS ARM64

Continuous integration support

Continuous integration is a critical part of modern software development especially with C++ when multiple platforms are supported and maintained.

Library Continuous Integration
Realtime Math Yes
MathFu No
vectorial No
VectorialPlusPlus No
CGLM Yes
GLM Yes
ILMBase No
DirectX Math No
Unreal Engine 4 Not public

Dependencies

I’m not personally a big fan of pulling in tons of dependencies, especially for a math library. As mentioned earlier, the Unreal Engine 4 math library isn’t really usable on its own because of this but is included regardless.

Library Dependencies
Realtime Math  
MathFu vectorial (BSD 2-clause)
vectorial  
VectorialPlusPlus HandyCPP (custom license)
CGLM  
GLM  
ILMBase  
DirectX Math  
Unreal Engine 4 Unreal Engine 4

Floating point support

When I got started with ACL, I wasn’t sure at the time if 64 bit floating point arithmetic might offer superior accuracy or not and if it would be worth using. As a result, I needed the math code to support both float32 and float64 types for everything with a seamless API between the two for quick testing. It later turned out that the extra floating point precision isn’t helping enough to be worth using.

Library Float 32 Support Float 64 Support
Realtime Math Yes Yes (partial SIMD)
MathFu Yes Yes (no SIMD)
vectorial Yes  
VectorialPlusPlus Yes Yes (partial SIMD)
CGLM Yes  
GLM Yes  
ILMBase Yes (no SIMD) Yes (no SIMD)
DirectX Math Yes  
Unreal Engine 4 Yes  

Inlinability

Due to the critical need for ACL to be as fast as possible on every platform, having the bulk of the math operations be inline is very important. Many things impact whether a function is inlined by the compiler but two stand out:

  • Simple and short functions inline better
  • Passing arguments by register needs fewer instructions which inlines better

Thankfully, most math function are fairly simple and short: add, mul, div, etc. C-style functions will generally have a slight advantage over C++ wrappers mainly because they also must track the implicit this pointer being passed around even if ultimately it is optimized out inside the caller. When the compiler needs to determine if it can inline a function, it uses a heuristic and the size of the intermediate assembly/IR/AST most likely plays a role. Generally speaking, C++ wrapper functions that are short will inline just fine but some operations have a harder time due to their size: matrix 4x4 multiplication, quaternion multiplication, and quaternion interpolation. For this reason, I personally favor a C-style API for this sort of code.

The second point is not to be underestimated. Most of the libraries in the list either take the arguments by value or by const reference. While passing SIMD types by value does the right thing on ARM and passes them by register (up to 4), it does not work for aggregate types like matrices and it does not work with the default x64 calling convention with MSVC. In order to be able to pass SIMD types by register with MSVC, you must use its __vectorcall calling convention. It also works for aggregate and wrapper types. Up to 6 registers can be used for this. On desktop and Xbox One, using __vectorcall is critical for high performance code and sadly, most libraries do not support it explicitly (and not all support it implicitly if the whole compilation unit is forced to use that calling convention). With Visual Studio 2015, __vectorcall is the difference between having quaternion interpolation getting inlined or not. When I added support for it in ACL, I measured a roughly 5% speedup during the decompression.

Note that once a function is inlined, whether the arguments are passed by register or not typically does not impact the generated assembly although it sometimes does (at least with MSVC especially when AVX is enabled).

Some libraries which use a generic vector template class with specializations for SIMD (like MathFu) sometime end up passing *float32 arguments by const-reference instead of by value which is often suboptimal when not inlined.*

Library Inlinability Register Passing
Realtime Math Optimal (C-style + by register) Explicit (everywhere)
MathFu Decent (C++ wrappers) None
vectorial Good (C-style), Decent (C++ wrappers) Implicit (C-style and ARM only)
VectorialPlusPlus Decent (C++ wrappers) None
CGLM Good (C-style) None
GLM Decent (C++ wrappers) None
ILMBase Decent (C++ wrappers) None
DirectX Math Optimal (C-style + by register) Explicit (vectorcall and ARM only)
Unreal Engine 4 Decent (C++ wrappers) None

Multiplication order

An important point of contention is how things are multiplied. As the list below shows, the OpenGL way is by far the most popular for open source math libraries.

It all boils down to whether vectors are represented as a row or as a column. In the former case, multiplication with a matrix takes the form v' = vM while in the later case we have v' = Mv. Linear algebra typically treats vectors as columns and OpenGL opted to use that convention for that reason. If you think of matrices as functions that modify an input and return an output it ends up reading like this: result = object_to_world(local_to_object(input)). This reads right-to-left as is common with nested function evaluation. In my opinion, this is quite awkward to work with as most modern programming languages (and western languages) read left-to-right. Most linear algebra formulas use abstract letters and names for things which somewhat hides this nuance but when I write code, I try to keep my matrix names as clear as possible: what space are the input and output in. While you could technically reverse the naming result = world_from_object * object_from_local * input so it at least reads decently right-to-left, it’s still harder to reason with because just about everything we work with in the world goes from somewhere to somewhere else and not the other way around: trains, buses, planes, Monday to Friday, 5@7, etc.

On the other hand, DirectX uses row vectors and ends up with the much more natural: result = input * local_to_object * object_to_world. Your input is in local space, it gets transformed into object space before finally ending up in world space. Clean, clear, and readable. If you instead multiply the two matrices together on their own, you get the clear local_to_world = local_to_object * object_to_world instead of the awkward local_to_world = object_to_world * local_to_object you would get with OpenGL and column vectors.

At the end of the day, which way you choose largely boils down to a personal choice (or whatever library you use for rendering) as I don’t think there’s a big performance difference between the two on modern hardware. For ACL, all its output data is in local space and although we evaluate the error in world space internally, this is entirely transparent to the client application and it is free to use either convention.

Library Multiplication Style
Realtime Math DirectX
MathFu OpenGL
vectorial OpenGL
VectorialPlusPlus OpenGL
CGLM OpenGL
GLM OpenGL
ILMBase OpenGL
DirectX Math DirectX
Unreal Engine 4 DirectX

Conclusion

Ultimately, which math library you choose for a particular project boils down to a matter of personal preference to a large extent. For the vast majority of the code you’ll write, the performance and code generation is likely to be very close if not identical. Two years ago, I knew regardless of which option I picked I would have to do a lot of work to add what was missing. This greatly motivated me to just start from scratch as many middleware do and I do not regret the experience or results.

My top two favorite libraries are Realtime Math and DirectX Math. Both are quite similar today although DirectX Math wasn’t quite as attractive when I started.

Next steps

Over the next few days I will populate various issues on GitHub to document things that are missing or that could benefit from some love.

A core part that is partially missing at the moment is the quantization and packing logic that ACL already contains. I have not migrated that code yet in large part because I am not sure how to best expose it in a clean and consistent API. I do believe it belongs in RTM where everyone can benefit from it.

ACL does not yet use RTM but that migration is planned for ACL v2.0.

Smaller, faster: ACL lets you cut your animation costs in half

I am excited to announce the open source Animation Compression Library has reached v1.1 along with an updated Unreal Engine 4 plugin v0.2.

ACL now beats Unreal Engine 4 on all the important metrics. The plugin is 1.7x smaller, 3x more accurate, 2.5x faster to compress, and up to 4.8x faster to decompress!

What’s new

The latest release focused on decompression performance. ACL v1.1 is about 30% faster than the previous version on every platform when decompressing a whole pose. Decompressing a single bone, a common operation in Unreal, is now about 3-4x faster. Also, ARM-based products will now use NEON SIMD acceleration when available.

The UE4 plugin was reworked to be more tightly integrated and is about twice as fast compared to the previous version.

ACL has now reached a point where I can confidently say that it is the best overall animation compression algorithm in the video games industry. While other techniques might beat ACL on some of these metrics, beating it simultaneously on speed, size, and accuracy will prove to be very challenging. In particular, unlike other algorithms that offer very fast decompression, ACL has no extra runtime memory cost beyond the compressed clip data.

New data!

One year ago, Epic generously agreed to let me use the Paragon animations for research purposes. This helped me find and fix bugs in Unreal Engine and ACL, and see how well both animation compression approaches perform in a real game. Paragon also allows each release to be rigorously tested against a large, relevant, and varied data set.

I am excited to announce that Epic is allowing me to use Fortnite to further my research as well! While Paragon will continue to play its role in tracking compression performance and regression testing, Fortnite will allow me to measure decompression performance in real world scenarios much more easily. Testing with Fortnite should highlight new ways ACL can be improved further.

What’s next

I am shifting my focus to add animation compression plugin support to UE4 during the next few months. If everything goes well, when UE 4.22 is released next year, I will be able to add the ACL plugin to the Unreal Engine Marketplace for everyone to use, for free.

Proper plugin support will remove overhead and help make ACL’s in-game decompression faster still.

Due to the rigorous testing and extensive statistics extraction every release now requires, I expect the release cycle to slow down. I will aim to perform non-bug fix releases about twice a year.

Compression performance overview

Here is a quick glance of how well it performs on the animations from Paragon:

  ACL Plugin v0.2.0 UE v4.19.2
Compressed size 226.09 MB 392.97 MB
Compression ratio 18.91 : 1 10.88 : 1
Compression time 6h 4m 18.21s 15h 10m 23.56s
Bone Error 99th percentile 0.0116 cm 0.0328 cm
Samples below 0.01 cm error threshold 98.85 % 84.88 %

The 99th percentile and the number of samples below the 0.01 cm error threshold are calculated by measuring the world-space error of every bone at every sample in each of the 6558 animation clips. To put this into perspective, over 99 % of the compressed data has an error lower than the width of a human hair. More details on how the error is measured can be found here.

Decompression performance overview

Decompression performance is currently tracked with the Matinee fight scene. The troopers have around 70 bones each while the main trooper has 541.

Matinee S8 Median Performance

Much care was taken to ensure that ACL has consistent decompression performance. The following two images show the time taken to decompress a pose at every point of the Matinee fight scene which highlights how regular ACL is.

Matinee UE4 S8 Performance Variance Matinee ACL S8 Performance Variance

It also has consistent decompression performance regardless of the playback direction and it works on every modern platform making it a safe choice when using it as the default algorithm in your games.

Overall, ACL is ideal for games with large amounts of animations playing concurrently such as those with large crowds, MMOs, and e-sports as well as those that run on mobile or slower platforms.

Animation Compression Library: Release 1.0.0

The long awaited ACL v1.0 release is finally here! And it comes with the brand new Unreal Engine 4 plugin v0.1! It took over 15 months of late nights, days off, and weekends to reach this point and I couldn’t be more pleased with the results.

Recap

The core idea behind ACL was to explore a different way to perform animation compression, one that departed from classic methods. Unlike the vast majority of algorithms in the wild, it uses bit aligned values as opposed to naturally aligned integers. This is slower to unpack but I hoped to compensate by not performing any sort of key reduction. By retaining every sample, the data is uniform in memory and offsets are trivially calculated, keeping things fast, the memory touched contiguous, and the hardware happy. While the technique itself isn’t novel and is often used with compression algorithms in other fields, to my knowledge it had never been tried to the extent ACL pushes it with animation compression, at least not publicly.

Very early, the technique proved competitive and over time it emerged as a superior alternative over traditional techniques involving key reduction. I then spent about 8 months writing the necessary infrastructure to make ACL not only production ready but production quality: unit tests were written, extensive regression tests were introduced, documentation was added as well as comments, scripts to replicate the results, cross platform support (ACL now runs on every platform!), etc. All that good stuff that one would expect from a professional product.

But don’t take my word for it! Check out the 100% C++ code (MIT license), the statistics below, and take the plugin out for a spin!

Performance

While ACL provides various synthetic test hardnesses to benchmark and extract statistics, nothing beats running it within a real game engine. This is where the UE4 plugin comes in and really shines. Just as with ACL, three data sets are measured: CMU, Paragon, and the Matinee fight scene.

Note that there are small differences between measuring with the UE4 plugin and with the ACL test harnesses due to implementation choices in the plugin.

Carnegie-Mellon University (CMU)

  ACL Plugin v0.1.0 UE v4.19.2
Compressed size 70.60 MB 99.94 MB
Compression ratio 20.25 : 1 14.30 : 1
Max error 0.0722 cm 0.0996 cm
Compression time 34m 30.51s 1h 27m 40.15s

ACL was smaller for 2532 clips (99.92 %)
ACL was more accurate for 2486 clips (98.11 %)
ACL has faster compression for 2534 clips (100.00 %)
ACL was smaller, better, and faster for 2484 clips (98.03 %)

Would the ACL Plugin have been included in the Automatic Compression permutations tried, it would have won for 2534 clips (100.00 %)

Data tracked here by the plugin, and here by ACL.

Paragon

  ACL Plugin v0.1.0 UE v4.19.2
Compressed size 226.02 MB 392.97 MB
Compression ratio 18.92 : 1 10.88 : 1
Max error 0.8566 cm 0.6424 cm
Compression time 6h 35m 03.24s 15h 10m 23.56s

ACL was smaller for 6413 clips (97.79 %)
ACL was more accurate for 4972 clips (75.82 %)
ACL has faster compression for 5948 clips (90.70 %)
ACL was smaller, better, and faster for 4499 clips (68.60 %)

Would the ACL Plugin have been included in the Automatic Compression permutations tried, it would have won for 6098 clips (92.99 %)

Data tracked here by the plugin, and here by ACL.

Matinee fight scene

  ACL Plugin v0.1.0 UE v4.19.2
Compressed size 8.67 MB 23.67 MB
Compression ratio 7.20 : 1 2.63 : 1
Max error 0.0674 cm 0.0672 cm
Compression time 52.44s 54m 03.18s

ACL was smaller for 1 clip (20 %)
ACL was more accurate for 4 clips (80 %)
ACL has faster compression for 5 clips (100 %)
ACL was smaller, better, and faster for 0 clip (0 %)

Would the ACL Plugin have been included in the Automatic Compression permutations tried, it would have won for 3 clips (60 %)

Data tracked here by the plugin, and here by ACL.

Decompression performance

Matinee S8 Median Performance

Playground S8 Median Performance

Data tracked here by the plugin, and here by ACL (they also include other platforms and more data).

Performance summary

As the numbers clearly show, ACL beats UE4 across every compression metric, sometimes by a significant margin: it is MUCH faster to compress, the quality is just as good, and the memory footprint is significantly reduced. ACL achieves all of this with default settings that animators rarely if ever need to tweak. What’s not to love?

However, the ACL decompression performance is sometimes ahead, sometimes behind, or the same. There are a few reasons for this, most of which I am hoping to fix in the next version to take the lead: NEON (SIMD) is not yet used on ARM, the ACL plugin needlessly performs MUCH more work than UE4 when decompressing, and many low hanging fruits were left to be fixed post-1.0 release.

ACL is just getting started!

How to use the ACL Plugin

As the documentation states here, a few minor engine changes are required in order to support the ACL plugin. These changes mostly consist of bug fixes and changes to expose the necessary hooks to plugins.

For the time being, the plugin is not yet on the marketplace as it is not fully plug-and-play. However, this summer I am working with Epic to introduce the necessary changes in order to publish the ACL plugin on the marketplace. Stay tuned!

Note that the ACL Plugin will reach v1.0 once it can be published on the marketplace but it is production ready regardless.

What’s new in ACL v1.0

Few things actually changed in between v0.8 and v1.0. Most of the changes revolved around minor additions, documentation updates, etc. There are two notable changes:

  • The first is visible in the decompression graphs: we now yield the thread before measuring every sample. This helps ensure more stable results by reducing the likelihood that the kernel will swap out the thread and interrupt it while executing the decompression code.
  • The second is visible in the compression stats for Paragon: a bug was causing the visible error to sometimes be partially hidden when 3D scale is present. While the new version is not less accurate than the previous, the measured error can be higher in very rare cases (only 1 clip is higher).

Regardless, the measuring should now be much more stable.

What’s next

The next release of ACL will focus on improving the compression and decompression performance. While ACL was built from the ground up to be fast to decompress; so far the focus has been on making sure things function properly and safely to establish a solid baseline to work with. Now that this work is done, the fun part can begin: making it the best it can be! I have many improvements planned and while some of them will make it in v1.1, others will have to wait for future versions.

Special care will be taken to make sure ACL performs at its best in UE4 but there is no reason why it couldn’t be used in your own favorite game engine or animation middleware. Developing with UE4 is easier for me in large part because of my past experience with it, my relationship with Epic, and the fact that it is open source. Other game engines like Unity explicitly forbid their use for benchmarking purposes in their EULA which prevents me from publishing any results without prior written agreement form their legal departement. Furthermore, without access to the source code, creating a plugin for it requires a lot more work. In due time, I hope to support Unity, Godot, and anyone else willing to try it out.