Comparing past and present Unreal Engine 4 animation compression

Ever since the Animation Compression Library UE4 Plugin was released in July 2018, I have been comparing and tracking its progress against Unreal Engine 4.19.2. For the sake of consistency, even when newer UE versions came out, I did not integrate ACL and measure from scratch. This was convenient and practical since animation compression doesn’t change very often in UE and the numbers remain fairly close over time. However, in UE 4.21 significant changes were made and comparing against an earlier release was no longer a fair and accurate comparison.

As such, I have updated my baseline to UE 4.22.2 and am releasing a new version of the plugin to go along with it: v0.4. This new plugin release does not bring significant improvements on the previous one (they both still use ACL v1.2) but it does bring the necessary changes to integrate cleanly with the newer UE API. One of the most notable changes in this release is the introduction of git patches for the custom engine changes required to support the ACL plugin. Note that earlier engine versions are not officially supported although if there is interest, feel free to reach out to me.

One benefit of the thorough measurements that I regularly perform is that not only can I track ACL’s progress over time but I can also do the same with Unreal Engine. Today we’ll talk about Unreal a bit more.

What changed in UE 4.21

Two years ago Epic asked me to improve their own animation compression. The primary focus was on improving the compression speed of their automatic compression codec while maintaining the existing memory footprint and decompression speed. Improvements to the other metrics was considered a stretch goal.

The automatic compression codec in UE 4.20 and earlier tried 35+ codec permutations and picked the best overall. Understandably, this could be quite slow in many cases.

To speed it up, two important changes were made:

  • The codecs tried were distilled into a white list of the 11 most commonly used codecs
  • The codecs are now evaluated in parallel

This brought a significant win. As mentioned at the Games Developer Conference 2019, the introduction of the white list brought the time to compress Fortnite (while cooking for the Xbox One) from 6h 25mins down to 1h 50mins, a speed up of 3.5x. Executing them in parallel lowered this even more, down to 40mins or about 2.75x faster. Overall, it ended up 9.6x faster.

While none of this impacted the codecs or the API in any way, UE 4.21 also saw significant changes to the codec API. The reason for this will be the subject of my next blog post: failed optimization attempts and the lessons learned from them. In particular, I will show why removing samples that can be reconstructed through interpolation has significant drawbacks. Sometimes despite your best intentions and research, things just don’t pan out.

The juicy numbers

For the GDC presentation we only measured a few metrics and only while cooking Fortnite. However, every release I use a much more extensive set of animations to track progress and many more metrics. Here are all the numbers.

Note: The ACL plugin v0.3 numbers are from its integration in UE 4.19.2 while v0.4 is with UE 4.22.2. Because the Unreal codecs didn’t change, the decompression performance remained roughly the same. The ACL plugin accuracy numbers are slightly different due to fixes in the commandlet used to extract them. The compression speedup for the ACL plugin largely comes from the switch to Visual Studio 2017 that came with UE 4.20.

Carnegie-Mellon University database performance

For details about this data set and the metrics used, see here.

  ACL Plugin v0.4.0 ACL Plugin v0.3.0 UE v4.22.2 UE v4.19.2
Compressed size 74.42 MB 74.42 MB 100.15 MB 99.94 MB
Compression ratio 19.21 : 1 19.21 : 1 14.27 : 1 14.30 : 1
Compression time 5m 10s 6m 24s 11m 11s 1h 27m 40s
Compression speed 4712 KB/sec 3805 KB/sec 2180 KB/sec 278 KB/sec
Max ACL error 0.0968 cm 0.0702 cm 0.1675 cm 0.1520 cm
Max UE4 error 0.0816 cm 0.0816 cm 0.0995 cm 0.0996 cm
ACL Error 99th percentile 0.0089 cm 0.0088 cm 0.0304 cm 0.0271 cm
Samples below ACL error threshold 99.90 % 99.93 % 47.81 % 49.34 %

Paragon database performance

For details about this data set and the metrics used, see here.

  ACL Plugin v0.4.0 ACL Plugin v0.3.0 UE v4.22.2 UE v4.19.2
Compressed size 234.76 MB 234.76 MB 380.37 MB 392.97 MB
Compression ratio 18.22 : 1 18.22 : 1 11.24 : 1 10.88 : 1
Compression time 23m 58s 30m 14s 2h 5m 11s 15h 10m 23s
Compression speed 3043 KB/sec 2412 KB/sec 582 KB/sec 80 KB/sec
Max ACL error 0.8623 cm 0.8623 cm 0.8619 cm 0.8619 cm
Max UE4 error 0.8601 cm 0.8601 cm 0.6424 cm 0.6424 cm
ACL Error 99th percentile 0.0100 cm 0.0094 cm 0.0438 cm 0.0328 cm
Samples below ACL error threshold 99.00 % 99.19 % 81.75 % 84.88 %

Matinee fight scene performance

For details about this data set and the metrics used, see here.

  ACL Plugin v0.4.0 ACL Plugin v0.3.0 UE v4.22.2 UE v4.19.2
Compressed size 8.77 MB 8.77 MB 23.67 MB 23.67 MB
Compression ratio 7.11 : 1 7.11 : 1 2.63 : 1 2.63 : 1
Compression time 16s 20s 9m 36s 54m 3s
Compression speed 3790 KB/sec 3110 KB/sec 110 KB/sec 19 KB/sec
Max ACL error 0.0588 cm 0.0641 cm 0.0426 cm 0.0671 cm
Max UE4 error 0.0617 cm 0.0617 cm 0.0672 cm 0.0672 cm
ACL Error 99th percentile 0.0382 cm 0.0382 cm 0.0161 cm 0.0161 cm
Samples below ACL error threshold 94.61 % 94.52 % 94.23 % 94.22 %

Conclusion and what comes next

Overall, the new parallel white list is a clear winner. It is dramatically faster to compress and none of the other metrics measurably suffered. However, despite this massive improvement ACL remains much faster.

For the past few months I have been working with Epic to refactor the engine side of things to ensure that animation compression plugins are natively supported by the engine. This effort is ongoing and will hopefully land soon in an Unreal Engine near you.

Faster arithmetic by flipping signs

Over the years, I picked up a number of tricks to optimize code. Today I’ll talk about one of them.

I first picked it up a few years ago when I was tasked with optimizing the cloth simulation code in Shadow of the Tomb Raider. It had been fine tuned extensively with PowerPC intrinsics for the Xbox 360 but its performance was lacking on XboxOne (x64). While I hoped to talk about it 2 years ago, I ended up side tracked with the Animation Compression Library (ACL). However, last week while introducing ARM64 fused muptiply-add support into ACL and the Realtime Math (RTM) library, I noticed an opportunity to use this trick when linearly interpolating quaternions and it all came back to me.

TL;DR: Flipping the sign of some arithmetic operations can lead to shorter and faster assembly.

Flipping for fun and profit on x64

To understand how this works, we need to give a bit of context first.

On PowerPC, ARM, and many other platforms, before an instruction can use a value it must first be loaded from memory into a register explicitly with a load type instruction. This is not always the case with x86 and x64 where many instructions can take either a register or a memory address as input. In the later case, the load still happens behind the scenes and while it isn’t really much faster by itself it does have a few benefits.

Not having an explicit load instruction means that we do not use one of the precious named registers. While under the hood the CPU has tons of registers (e.g modern processors have 50+ XMM registers), the instructions can only reference a few of them: only 16 named XMM registers can be referenced. This can be very important if a piece of code places considerable pressure on the amount of registers it needs. Fewer named registers used means a reduced likelihood that registers have to spill on the stack and spilling introduces quite a few instructions as well. Altogether, removing that single load instruction can considerably improve the chances of a function getting inlined.

Fewer instructions also means a lower code cache footprint and better overall performance although in practice, I have found this impact very hard to measure.

Two things are important to keep in mind:

  • An instruction that operates directly from memory will be slower than the same instruction working from a register: it has to perform the load operation and even if the value resides in the CPU cache, it still takes time.
  • Most arithmetic instructions that take two inputs only support having one of them come from memory: addition, subtraction, multiplication, etc. Typically, only the second input (on the right) can come from memory.

To illustrate how flipping the sign of arithmetic operations can lead to improved code generation, we will use the following example:

Both MSVC 19.20 and GCC 9 generate the above assembly. The 1.0f constant must be loaded from memory because subss only supports the second argument coming from memory. Interestingly, Clang 8 figures out that it can use the sign flip trick all on its own:

Because we multiply our input value by a constant, we are in control of its sign and we can leverage that fact to change our subss instruction into an addss instruction that can work with another constant from memory directly. Both functions are mathematically equivalent and their results are identical down to every last bit.

Short and sweet!

Not every mainstream compiler today is smart enough to do this on its own especially in more complex cases where other instructions will end up in between the two sign flips. Doing it by hand ensures that they have all the tools to do it right. Should the compiler think that performance will be better by loading the constant into a register anyway, it will also be able to do so (for example if the function is inlined in a loop).

But wait, there’s more!

While that trick is very handy on x64, it can also be used in a different context and I found a good example on ARM64: when linearly interpolating quaternions.

Typically linear interpolation isn’t recommended with quaternions as it is not guaranteed to give very accurate results if the two quaternions are far apart. However, in the context of animation compression, successive quaternions are often very close to one another and linear interpolation works just fine. Here is what the function looked like before I used the trick:

The code is fairly simple:

  • We calculate a bias and if both quaternions are on opposite sides of the hypersphere (negative dot product), we apply a bias to flip one of them to make sure that they lie on the same side. This guarantees that the path taken when interpolating will be the shortest.
  • We linearly interpolate: (end - start) * alpha + start
  • Last but not least, we normalize our quaternion to make sure that it represents a valid 3D rotation.

When I introduced the fused multiply-add support to ARM64, I looked at the above code and its generated assembly and noticed that we had a multiplication instruction followed by a subtraction instruction before our final fused multiply-add instruction. Can we do better?

While FMA3 has a myriad of instructions for all sorts of fused multiply-add variations, ARM64 does not: it only has 2 such instructions: fmla ((a * b) + c) and fmls (-(a * b) + c).

Here is the interpolation broken down a bit more clearly:

  • x = (end * bias) - start (fmul, fsub)
  • result = (x * alpha) + start (fmla)

After a bit of thinking, it becomes obvious what the solution is:

  • -x = -(end * bias) + start (fmls)
  • result = -(-x * alpha) + start (fmls)

By flipping the sign of our intermediate value with the fmls instruction, we can flip it again and cancel it out by using it once more while removing an instruction in the process. This simple change resulted in a 2.5 % speedup during animation decompression (which also does a bunch of other stuff) on my iPad.

Note: because fmls reuses one of the input registers for its output, a new mov instruction was added in the final inlined decompression code but it executes much faster than a floating point instruction.

You can find the change in RTM here.

The Animation Compression Library just got even faster

Slowly but surely, the Animation Compression Library has now reached v1.2 along with an updated v0.3 Unreal Engine 4 plugin. The most notable changes in this release are as follow:

  • More compilers and architectures added to continuous integration
  • Accuracy bug fixes
  • Floating point sample rate support
  • Dramatically faster compression through the introduction of a compression level setting

TL;DR: Compared to UE 4.19.2, the ACL plugin compresses up to 1.7x smaller, is up to 3x more accurate, up to 158x faster to compress, and up to 7.5x faster to decompress (results may vary depending on the platform and data).

Note that UE 4.21 introduced changes that significantly sped up the compression with its Automatic Compression codec but I haven’t had the time to setup a new branch with it to measure.

UE 4 plugin support and progress

Now that ACL properly supports a floating point sample rate, the UE4 plugin has reached feature parity with the stock codecs.

As announced at the GDC 2019, work is ongoing to refactor the Unreal Engine to natively support animation compression plugins and is currently on track to land with UE 4.23. Once it does, the plugin will be updated once more, finally reaching v1.0 on the Unreal marketplace for free.

Lighting fast compression

One of the most common feedback I received from those that use ACL in the wild (both within UE4 and outside) was the desire for faster compression. The optimization algorithm is very aggressive and despite its impressive performance overall (as highlighted in prior releases), some clips with deep bone hierarchies could take a very long time to compress, prohibitively so.

In order to address this, a new compression level was introduced in the compression settings to better control how much time should be spent attempting to find an optimal bit rate. Higher levels take more time but yield a lower memory footprint. A total of five levels were introduced but the lowest three currently behave the same for now: Lowest, Low, Medium, High, Highest. The Highest level corresponds to what prior releases did by default. After carefully reviewing the impact of each level, a decision was made to make the default level be Medium instead. This translates in dramatically faster compression, identical accuracy, with a very small and acceptable increase in memory footprint. This should provide for a much better experience for animators during production. Once the game is ready to be released, the animations can easily and safely be recompressed with the Highest setting in order to squeeze out every byte.

In order to extract the following results, I compressed the Carnegie-Mellon University motion capture database, Paragon, and Fortnite in parallel with 4 threads using ACL standalone. Numbers in parenthesis represent the delta again Highest.

Compressed Size Highest High Medium
CMU 67.05 MB 68.85 MB (+2.7%) 71.01 MB (+5.9%)
Paragon 206.87 MB 211.81 MB (+2.4%) 218.58 MB (+5.7%)
Fortnite 491.79 MB 497.60 MB (+1.2%) 507.11 MB (+3.1%)
Compression Time Highest High Medium
CMU 24m 57.59s 11m 51.48s 6m 20.89s
Paragon 4h 55m 42.57s 1h 19m 36.01s 29m 21.65s
Fortnite 8h 13m 1.66s 2h 29m 59.37s 1h 3m 18.17s
Compression Speed Highest High Medium
CMU 977.36 KB/sec 2057.24 KB/sec (+2.1x) 3842.79 KB/sec (+3.9x)
Paragon 246.79 KB/sec 916.82 KB/sec (+3.7x) 2485.58 KB/sec (+10.1x)
Fortnite 613.56 KB/sec 2016.82 KB/sec (+3.3x) 4778.65 KB/sec (+7.8x)

And here are the default settings in action on the animations from Paragon with the ACL plugin inside UE4:

  ACL Plugin v0.3.0 ACL Plugin v0.2.0 UE v4.19.2
Compressed size 234.76 MB 226.09 MB 392.97 MB
Compression ratio 18.22 : 1 18.91 : 1 10.88 : 1
Compression time 30m 14.69s 6h 4m 18.21s 15h 10m 23.56s
Compression speed 2412.94 KB/sec 200.32 KB/sec 80.16 KB/sec
Max ACL error 0.8623 cm 0.8590 cm 0.8619 cm
Max UE4 error 0.8601 cm 0.8566 cm 0.6424 cm
ACL Error 99th percentile 0.0094 cm 0.0116 cm 0.0328 cm
Samples below ACL error threshold 99.19 % 98.85 % 84.88 %

The 99th percentile and the number of samples below the 0.01 cm error threshold are calculated by measuring the error of every bone at every sample in each of the 6558 animation clips. More details on how the error is measured can be found here.

In this new release, the decompression performance remains largely unchanged. It is worth noting that a month ago my Google Nexus 5X died abruptly and as such performance numbers will no longer be tracked on it. Instead, my new Google Pixel 3 will be used from here on out.

What’s next

The next release v1.3 currently scheduled for the Fall 2019 will aim to tackle commonly requested features:

  • Faster decompression in long clips by optimizing seeking
  • Multiple root transform support (e.g. rigid body simulation compression)
  • Scalar track support (e.g. float curves for blend shapes)
  • Faster compression in part by using multiple threads to compress a single clip (which will help the UE4 plugin a lot)

If you use ACL and would like to help prioritize the work I do, feel free to reach out and provide feedback or requests!

Compressing Fortnite Animations

New year, new stats! A few months ago, Epic agreed to let me use their Fortnite animations for my open source research with the Animation Compression Library (ACL). Following months of work to refactor Unreal Engine 4 in order to natively support animation compression plugins, it has finally entered the review stage on Epic’s end. While I had hoped the changes could make it in time for Unreal Engine 4.22, due to unforeseen delays, 4.23 seems a much more likely candidate.

Even though the code isn’t public yet, the new updated ACL plugin kicks ass and Fortnite is a great title to showcase it with. The real game uses the classic UE4 codecs but I recompressed everything with the latest and greatest. After spending several hundred hours compressing the animations, fixing bugs, and iterating I can finally present the results.

TL;DR: Inside Fortnite, ACL shines bright with 2x faster compression, 2x smaller memory footprint, and higher accuracy. Decompression is 1.6x faster on desktop, 2.3x faster on a Samsung S8, and 1.2x faster on the Xbox One.

Methodology

For the UE4 measurements I used a modified UE 4.21 with its default Automatic Compression. It tries a list of codecs in parallel and selects the optimal result by considering both size and accuracy.

ACL uses a modified version of the open source ACL Plugin v0.2.2. It uses its own default compression settings and in the rare event where the error is above 1cm, it falls back automatically to safer settings.

Although the UE4 refactor doesn’t change the legacy codecs, it does speed up their decompression a bit compared to previous UE4 releases. That is one of many benefits everyone will get to enjoy as a result of my refactor work regardless of which codec is used.

Error measurements

While the UE4 and ACL error measurements never exactly matched, they historically have been very close for every single clip I had tried, until Fortnite. As it turns out, some exotic animations brought to light the fact that some subtle differences in how they both measure the error can lead to some large perceived discrepancies. This has now been documented in the plugin here.

Three differences stand out: how the error is measured, where the error is measured in space, and where the error is measured in time. You can follow the link above for the gory details but the jist is that ACL is more conservative and more accurate in how it measures the error and it should always be trusted over what UE4 reports in case of doubt or disagreement.

It is worth noting that because ACL does not currently support a floating point sample rate (e.g 28.3 FPS), those clips (and there are many) have a higher reported error with UE4 because by rounding, we are effectively time stretching those clips a tiny bit. They still look just as good though. This will be fixed in the next version.

The animations

I extracted all the non-additive animations regardless of whether they were used by the game or not: a grand total of 8304 clips! A total raw size of 17 GB and roughly 17.5 hours worth of playback.

Fortnite has a surprising number of exotic clips. Some take hours to compress with UE4 and others have a range of motion as wide as the distance between the earth and the moon! These allowed me to identify a number of very subtle bugs in ACL and to fix them.

Compression stats

  ACL Plugin UE4
Compressed size 498.21 MB 1011.84 MB
Compression ratio 35.55 : 1 17.50 : 1
Compression time 12h 38m 04.99s 23h 8m 58.94s
Compression speed 398.72 KB/sec 217.62 KB/sec
Max ACL error 0.9565 cm 8392339 cm
Max UE4 error 108904.6797 cm 8397727 cm
ACL Error 99th percentile 0.0309 cm 2.1856 cm
Samples below ACL error threshold 97.71 % 77.37 %

Once again, ACL performs admirably: the compression speed is twice as fast (1.83x), the memory footprint reduces in half (2.03x smaller), and the accuracy is right where we want it. This is also in line with the previous results from Paragon.

Fortnite Max Error Distribution

UE4’s accuracy struggles a bit with a few clips but in practice the error might not be visible as the overwhelming majority of samples are very accurate. This is consistent as well with previous results.

UE4 Import Comic

A handful of clips contribute to a large portion of the UE4 compression time and its high error. One clip in particular stands out: it has 1167 bones, 8371 samples at 120 FPS, and a total raw size of 372.66 MB. Its range of motion peaks at 477000 kilometers away from the origin! It truly pushes the codecs to their absolute limits.

  ACL Plugin UE4
Compressed size 71.53 MB 220.87 MB
Compression ratio 5.21 : 1 1.69 : 1
Compression time 1m 38.07s 4h 51m 59.13s
Compression speed 3891.19 KB/sec 21.78 KB/sec
Max ACL error 0.0625 cm 8392339 cm
Max UE4 error 108904.6797 cm 8397727 cm

It takes almost 5 hours to compress with UE4! In comparison, ACL zips through in well under 2 minutes. While it tries its best with the default settings it ultimately ends up using the safety fallback and thus compresses twice in that amount of time.

Overall, if you added the ACL codec to the Automatic Compression list, here is how it would perform:

  • ACL is smaller for 7711 clips (92.86 %)
  • ACL is more accurate for 7576 clips (91.23 %)
  • ACL has faster compression for 5704 clips (68.69 %)
  • ACL is smaller, better, and faster for 5017 clips (60.42 %)
  • ACL wins Automatic Compression for 7863 clips (94.69 %)

Decompression stats

Fortnite has the handy ability to create replays. These make gathering deterministic profiling numbers a breeze. The numbers that follow are from a 50 vs 50 replay. On each platform, a section of the replay with some high intensity action was profiled.

Desktop

Fortnite Desktop Decompression Time

The performance on desktop looks pretty good. ACL is consistently faster, about 38% on average. It also appears a bit less noisy, a direct benefit of the improved cache friendliness of its algorithm.

Samsung S8

Fortnite Samsung S8 Decompression Time

Fortnite Samsung S8 ACL Decompression Time

ACL really shines on mobile. On average it is 56% faster but that is only part of the story. On the S8 it appears that the core is hyperthreaded and another thread does heavy work and applies cache pressure. This causes all sorts of spikes with UE4 but in comparison, the cache aware ACL allows it to maintain a clean and consistent performance.

Hyperthreading on the CPU (and the GPU) works, roughly speaking, by the processor switching to another thread already executing when it notices that the current thread is stalled waiting on a slow operation, typically when memory needs to be pulled into the cache. Both threads are executing in the sense that they have data being held in registers but only one of them advances at a time on that core. When one stalls, the other executes.

When you have a piece of code that triggers a lot of cache misses, such as some of the legacy UE4 codecs, the processor will be more likely to switch to the other hyperthread. When this happens, execution is suspended and it will only resume once the other thread stalls or the time slice expires. This could be a long time especially if the other hyperthread is executing cache friendly code and doesn’t otherwise stall often.

This translates into the type of graph above where there is heavy fluctuation as the execution time varies widely from the noise of the neighbor hyperthread.

On the other hand, when the code is cache friendly, it doesn’t give the chance to the other thread to run. This gives a nice and smooth graph for that current thread as the risk of long interruptions is reduced. When the code is that optimized, hyperthreading typically doesn’t help speed things up much as both threads compete for the same time slice with few opportunities to hide stalling latency. This is also what I observed when measuring the compression performance. In theory due to the higher cache pressure, performance could even degrade with hyperthreading but in practice I haven’t observed it, not with ACL at least.

Xbox One

Fortnite Xbox One Decompression Time

On the Xbox One ACL is about 13% faster on average. Both lines seem to have very similar shapes unlike the previous two platforms due in large part to the absence of hyperthreading. There are a few possibilities as to why the gain isn’t as significant on this platform:

  • The MSVC compiler does not generate assembly that is as clean as it generates on PC, it’s certainly sub-optimal on a few points. It fails to inline many trivial functions and it leaves around unnecessary shuffle instructions.
  • Perhaps the other threads that share the L2 thrash the hardware prefetcher, preventing it from kicking in. ACL benefits heavily from hardware prefetching.
  • The CPU is quite slow compared to the speed of its memory. This reduces the benefit of cache locality as it keeps L2 cache misses fairly cheap in comparison.

The last two points seems the most likely culprits. ACL does a bit more work per sample decompressed than UE4 but everything is cache friendly. This gives it a massive edge when memory is slow compared to the CPU clock as is the case on my desktop, the Samsung S8, and lots of other platforms.

Conclusion

With faster compression, faster decompression on every platform, a smaller memory footprint, and higher accuracy, ACL continues to shine in UE4 and it won’t be long now before everyone can find it on the marketplace for free.

In the meantime, in the next few months I will perform another release of ACL and its plugin with all the latest fixes made possible with Fortnite’s data.

Status update

To my knowledge, the first game released with ACL came out in November 2018 with the public UE4 plugin: OVERKILL’s The Walking Dead. I was told it reduced their animation memory footprint by over 50% helping them fit within their console budgets.

A number of people have also integrated it into their own custom game engines and although I have no idea if they are using it or not, Remedy Entertainment has forked ACL!

Last but not least, I’d like to extend a special shout-out to Epic for allowing me to do this and to the ACL contributors!

Introducing Realtime Math v1.0

Almost two years ago now, I began writing the Animation Compression Library. I set out to build it to be production quality which meant I needed a whole lot of optimized math. At the time, I took a look at the landscape of math libraries and I opted to roll out my own. It has served me well, propelling ACL to success with some of the fastest compression and decompression performance in the industry. I am now proud to announce that the code has been refactored out into its own open source library: Realtime Math v1.0 (RTM) (MIT license).

There were a few reasons that motivated the choice to move the code out on its own:

  • A significant amount of the ACL Continuous Integration build time is compiling and running the math unit tests which slows things down a bit more than I’d like
  • It decouples code that will benefit from being on its own
  • I believe it has its place in the landscape of math libraries out there

In order to support that last point, I reviewed 9 other popular and usable math libraries for realtime applications. I looked at these with the lenses of my own needs and experience, your mileage may vary.

Disclaimer: the list of reviewed libraries is in no way exhaustive but I believe it is representative. Note that Unreal Engine 4 is included for informational purposes as it isn’t really usable on its own. Libraries are listed in no particular order and I tried to be as objective as possible. If you spot any inaccuracies, don’t hesitate to reach out.

The list: Realtime Math, MathFu, vectorial, VectorialPlusPlus, C OpenGL Graphics Math (CGLM), OpenGL Graphics Math (GLM), Industrial Light & Magic Base (ILMBase), DirectX Math, and Unreal Engine 4.

TL;DR: How Realtime Math stands out

I believe Realtime Math stands out for a few reasons.

It is geared for high performance, deeply hot code. Most functions will end up inlined but the price to pay is an API that is a bit more verbose as a result of being C-style. When the need arises to use intrinsics, it gets out of the way and lets you do your thing. Only two libraries had what I would call optimal inlinability: Realtime Math and DirectX Math. Only those two libraries properly support the __vectorcall calling convention explicitly and only RTM handles GCC and Clang argument passing explicitly.

While it still needs a bit of love, quaternions are a first class citizen and it is the only standalone open source library I could find that supports QVV transforms (a rotation quaternion, a 3d scale vector, and a translation vector).

Realtime Math uses a coding style similar to the C++ standard library and feels clean and natural to read and write.

It consists entirely of C++11 headers, it runs almost everywhere, it supports 64 bit floating point arithmetic, and it sports a very permissive MIT license.

License

ACL is open source and uses the MIT license. I am never keen on adding dependencies and if I really have to, I want a permissive license free of constraints.

Library License
Realtime Math MIT
MathFu Apache 2.0
vectorial BSD 2-clause
VectorialPlusPlus BSD 2-clause
CGLM MIT
GLM Modified MIT
ILMBase Custom but permissive
DirectX Math MIT
Unreal Engine 4 UE4 EULA

Header only

For simplicity and ease of integration, I want ACL to be entirely made of C++11 headers. This also constrains any dependencies to the same requirement.

Library Header Only
Realtime Math Yes
MathFu Yes
vectorial Yes
VectorialPlusPlus Yes
CGLM Yes (optional lib)
GLM Yes
ILMBase No
DirectX Math Yes
Unreal Engine 4 No

Verbosity, readability, and power

An important requirement for a math library is to be reasonably concise with average code without getting in the way if the need arises to dive right into raw intrinsics. In my experience, general math type abstractions take you very far but in order to squeeze out every cycle it is sometimes necessary to write custom per platform code. When this is required, it is important for the library to not hide its internals and leave the door open.

I am personally more a fan of C-style interfaces for a math library for various reasons: I can infer very well what happens under the hood (I have seen many libraries make fancy use of some operators that leave many newcomers to wonder what they do) and they are optimal for performance as we will discuss later. The downside of course is that they tend to be a bit more verbose. However, this largely boils down to a matter of personal taste.

vectorial is one of the few libraries that offers both a C-style interface and C++ wrappers and at the other end of the spectrum DirectX Math has both a namespace and a prefix for every type, constant and function.

Library Verbosity
Realtime Math Medium (C-style)
MathFu Light (C++ wrappers)
vectorial Light (C++ wrappers) and Medium (C-style)
VectorialPlusPlus Light (C++ wrappers)
CGLM Medium (C-style)
GLM Light (C++ wrappers)
ILMBase Light (C++ wrappers)
DirectX Math Medium++ (C-style with prefix and namespace)
Unreal Engine 4 Light (C++ wrappers)

It is very common for C-style math APIs to typedef their types to the underlying SIMD type. Realtime Math, DirectX Math, and many others do this. While this is great for performance, it does raise one problem: type safety is reduced. While usually those interfaces will opt to not expose proper vector2 or vector3 types and instead rely on functions that simply ignore the extra components, it doesn’t work so well when vector4 and quaternions are mixed. Only Realtime Math, DirectX Math and CGLM have quaternions with C-style interfaces but only the first two have a distinct type for quaternions when SIMD intrinsics are disabled. This somewhat mitigates the issue because with both Realtime Math and DirectX Math you can compile without intrinsics and still have type safety validated there. Although at the end of the day, all three have functions with distinct prefixes for vector and quaternion math and as such type safety is unlikely to be an issue.

Type and feature support

By virtue or being an animation compression library, ACL’s needs are a bit different from a traditional realtime application. This dictated the need I had for specific types and features. I had no need for general 3x3 or 4x4 matrices as well as 2D vectors which are more commonly used in gameplay and rendering. However, 3x4 affine matrices, 3D and 4D vectors, quaternions, and QVV transforms (a quaternion, a vector3 translation, and a vector3 scale) are of critical importance. Those types are front and center in an animation runtime and I needed them to be fully featured and fast. Most of the libraries under review had way more features than I cared for (mostly for rendering) but generally missed proper or any support for quaternions and QVV transforms.

MathFu appears to have a bug where the Matrix 4x4 SIMD template specialization isn’t included by default and its quaternions are 32 bytes instead of the ideal 16 due to alignment constraints.

VectorialPlusPlus quaternions also take 32 bytes instead of 16 due to alignment constraints and most of their quaternion code appears to be scalar.

UE 4 is notable for being the only other library to support QVV and it does offer a VectorRegister type to support SIMD for Vector2/3/4 although most of the code written in the engine uses the scalar version.

Library Vector2 Vector3 Vector4 Quaternion Matrix 3x3 Matrix 4x4 Matrix 3x4 QVV
Realtime Math   SIMD SIMD SIMD SIMD SIMD SIMD SIMD
MathFu SIMD SIMD SIMD Partial SIMD Scalar SIMD Scalar  
vectorial   SIMD SIMD     SIMD    
VectorialPlusPlus SIMD SIMD SIMD Scalar SIMD SIMD    
CGLM   SIMD SIMD SIMD SIMD SIMD SIMD  
GLM SIMD SIMD SIMD Partial SIMD SIMD SIMD SIMD  
ILMBase Scalar Scalar Scalar Scalar Scalar Scalar    
DirectX Math SIMD SIMD SIMD SIMD   SIMD    
Unreal Engine 4 Scalar Scalar Scalar SIMD   SIMD   SIMD

SIMD architecture support

Equally important was the SIMD architecture support. I want to run ACL everywhere with the best performance possible, especially on mobile. SSE, AVX, and NEON are all equally important to me.

Worth noting that 2 years ago DirectX NEON support appeared almost exclusively to be for Windows ARM NEON and I have no idea if it runs on iOS or Android even today.

Library SSE AVX NEON
Realtime Math Yes Yes Yes
MathFu Yes   Yes
vectorial Yes   Yes
VectorialPlusPlus Yes   Partial
CGLM Yes Yes Partial
GLM Yes    
ILMBase      
DirectX Math Yes Yes Yes
Unreal Engine 4 Yes Yes Yes

Platform and compiler support

Here things are a bit more complicated as libraries will list platforms but not compilers or compilers but not platforms. I need ACL to run everywhere and this means limiting myself to C++11 features.

  • Realtime Math: Windows (VS2015, VS2017) x86 and x64, Linux (gcc5, gcc6, gcc7, gcc8, clang4, clang5, clang6) x86 and x64, OS X (Xcode 8.3, Xcode 9.4, Xcode 10.1) x86 and x64, Android clang ARMv7-A and ARM64, iOS (Xcode 8.3, Xcode 9.4, Xcode 10.1) ARM64
  • MathFu: Windows, Linux, OS X, Android
  • vectorial: Unlisted but probably Windows, Linux, OS X, Android, and iOS
  • VectorialPlusPlus: Unlisted but probably Windows
  • CGLM: Windows, Unix, and probably everywhere
  • GLM: VS2013+, Apple Clang 6, GCC 4.7+, ICC XE 2013+, LLVM 3.4+, CUDA 7+
  • ILMBase: Unlisted but probably Windows, Linux, OS X
  • DirectX Math: VS2015 and VS2017, possibly elsewhere
  • Unreal Engine 4: Windows (VS2015, VS2017) x64, Linux x64, OS X x64, Android ARMv7-A (no NEON) and ARM64, iOS ARM64

Continuous integration support

Continuous integration is a critical part of modern software development especially with C++ when multiple platforms are supported and maintained.

Library Continuous Integration
Realtime Math Yes
MathFu No
vectorial No
VectorialPlusPlus No
CGLM Yes
GLM Yes
ILMBase No
DirectX Math No
Unreal Engine 4 Not public

Dependencies

I’m not personally a big fan of pulling in tons of dependencies, especially for a math library. As mentioned earlier, the Unreal Engine 4 math library isn’t really usable on its own because of this but is included regardless.

Library Dependencies
Realtime Math  
MathFu vectorial (BSD 2-clause)
vectorial  
VectorialPlusPlus HandyCPP (custom license)
CGLM  
GLM  
ILMBase  
DirectX Math  
Unreal Engine 4 Unreal Engine 4

Floating point support

When I got started with ACL, I wasn’t sure at the time if 64 bit floating point arithmetic might offer superior accuracy or not and if it would be worth using. As a result, I needed the math code to support both float32 and float64 types for everything with a seamless API between the two for quick testing. It later turned out that the extra floating point precision isn’t helping enough to be worth using.

Library Float 32 Support Float 64 Support
Realtime Math Yes Yes (partial SIMD)
MathFu Yes Yes (no SIMD)
vectorial Yes  
VectorialPlusPlus Yes Yes (partial SIMD)
CGLM Yes  
GLM Yes  
ILMBase Yes (no SIMD) Yes (no SIMD)
DirectX Math Yes  
Unreal Engine 4 Yes  

Inlinability

Due to the critical need for ACL to be as fast as possible on every platform, having the bulk of the math operations be inline is very important. Many things impact whether a function is inlined by the compiler but two stand out:

  • Simple and short functions inline better
  • Passing arguments by register needs fewer instructions which inlines better

Thankfully, most math function are fairly simple and short: add, mul, div, etc. C-style functions will generally have a slight advantage over C++ wrappers mainly because they also must track the implicit this pointer being passed around even if ultimately it is optimized out inside the caller. When the compiler needs to determine if it can inline a function, it uses a heuristic and the size of the intermediate assembly/IR/AST most likely plays a role. Generally speaking, C++ wrapper functions that are short will inline just fine but some operations have a harder time due to their size: matrix 4x4 multiplication, quaternion multiplication, and quaternion interpolation. For this reason, I personally favor a C-style API for this sort of code.

The second point is not to be underestimated. Most of the libraries in the list either take the arguments by value or by const reference. While passing SIMD types by value does the right thing on ARM and passes them by register (up to 4), it does not work for aggregate types like matrices and it does not work with the default x64 calling convention with MSVC. In order to be able to pass SIMD types by register with MSVC, you must use its __vectorcall calling convention. It also works for aggregate and wrapper types. Up to 6 registers can be used for this. On desktop and Xbox One, using __vectorcall is critical for high performance code and sadly, most libraries do not support it explicitly (and not all support it implicitly if the whole compilation unit is forced to use that calling convention). With Visual Studio 2015, __vectorcall is the difference between having quaternion interpolation getting inlined or not. When I added support for it in ACL, I measured a roughly 5% speedup during the decompression.

Note that once a function is inlined, whether the arguments are passed by register or not typically does not impact the generated assembly although it sometimes does (at least with MSVC especially when AVX is enabled).

Some libraries which use a generic vector template class with specializations for SIMD (like MathFu) sometime end up passing *float32 arguments by const-reference instead of by value which is often suboptimal when not inlined.*

Library Inlinability Register Passing
Realtime Math Optimal (C-style + by register) Explicit (everywhere)
MathFu Decent (C++ wrappers) None
vectorial Good (C-style), Decent (C++ wrappers) Implicit (C-style and ARM only)
VectorialPlusPlus Decent (C++ wrappers) None
CGLM Good (C-style) None
GLM Decent (C++ wrappers) None
ILMBase Decent (C++ wrappers) None
DirectX Math Optimal (C-style + by register) Explicit (vectorcall and ARM only)
Unreal Engine 4 Decent (C++ wrappers) None

Multiplication order

An important point of contention is how things are multiplied. As the list below shows, the OpenGL way is by far the most popular for open source math libraries.

It all boils down to whether vectors are represented as a row or as a column. In the former case, multiplication with a matrix takes the form v' = vM while in the later case we have v' = Mv. Linear algebra typically treats vectors as columns and OpenGL opted to use that convention for that reason. If you think of matrices as functions that modify an input and return an output it ends up reading like this: result = object_to_world(local_to_object(input)). This reads right-to-left as is common with nested function evaluation. In my opinion, this is quite awkward to work with as most modern programming languages (and western languages) read left-to-right. Most linear algebra formulas use abstract letters and names for things which somewhat hides this nuance but when I write code, I try to keep my matrix names as clear as possible: what space are the input and output in. While you could technically reverse the naming result = world_from_object * object_from_local * input so it at least reads decently right-to-left, it’s still harder to reason with because just about everything we work with in the world goes from somewhere to somewhere else and not the other way around: trains, buses, planes, Monday to Friday, 5@7, etc.

On the other hand, DirectX uses row vectors and ends up with the much more natural: result = input * local_to_object * object_to_world. Your input is in local space, it gets transformed into object space before finally ending up in world space. Clean, clear, and readable. If you instead multiply the two matrices together on their own, you get the clear local_to_world = local_to_object * object_to_world instead of the awkward local_to_world = object_to_world * local_to_object you would get with OpenGL and column vectors.

At the end of the day, which way you choose largely boils down to a personal choice (or whatever library you use for rendering) as I don’t think there’s a big performance difference between the two on modern hardware. For ACL, all its output data is in local space and although we evaluate the error in world space internally, this is entirely transparent to the client application and it is free to use either convention.

Library Multiplication Style
Realtime Math DirectX
MathFu OpenGL
vectorial OpenGL
VectorialPlusPlus OpenGL
CGLM OpenGL
GLM OpenGL
ILMBase OpenGL
DirectX Math DirectX
Unreal Engine 4 DirectX

Conclusion

Ultimately, which math library you choose for a particular project boils down to a matter of personal preference to a large extent. For the vast majority of the code you’ll write, the performance and code generation is likely to be very close if not identical. Two years ago, I knew regardless of which option I picked I would have to do a lot of work to add what was missing. This greatly motivated me to just start from scratch as many middleware do and I do not regret the experience or results.

My top two favorite libraries are Realtime Math and DirectX Math. Both are quite similar today although DirectX Math wasn’t quite as attractive when I started.

Next steps

Over the next few days I will populate various issues on GitHub to document things that are missing or that could benefit from some love.

A core part that is partially missing at the moment is the quantization and packing logic that ACL already contains. I have not migrated that code yet in large part because I am not sure how to best expose it in a clean and consistent API. I do believe it belongs in RTM where everyone can benefit from it.

ACL does not yet use RTM but that migration is planned for ACL v2.0.