02 Apr 2018
Almost a year ago, I began working on ACL and it is now one step closer to being production ready with the new v0.7 release!
This new release is significant for several reasons but two stand out above all others:
- Exhaustive automated testing
- Full multi-platform support
Unlike previous releases, the performance remained unchanged since v0.6 but I went ahead and updated the stats and graphs regardless.
Testing, Testing, One, Two
This new release introduces extensive unit testing for all the core and math functions on top of which ACL is built. There is still lots of room for improvement here, contributions welcome! Continuous integration also executes the unit tests for every platform except iOS and Android where they must be executed manually for now.
More significant is the addition of exhaustive regression testing. A total of 42 clips from the Carnegie-Mellon University motion capture database are each compressed under a mix of 7 configurations. At the moment, these must be run manually for every platform but scripts are present to automate the whole process. The primary reason why it remains manual is that the data is too large for GitHub and I do not have a webserver to host it. The instructions can be found here.
It runs everywhere
ACL now officially supports 12 different compiler toolchains and all the major platforms: Windows, Linux, OS X, iOS, and Android. Both compression and decompression are supported and can easily be tested with the provided unit and regression tests.
But this is only the list of platforms I can reliably and easily test. In practice, since all the code is now pure C++11, if it compiles it should run just fine as-is. Although I cannot test them yet, I fully expect all major consoles to work out of the box: Nintendo Switch (ARM), PlayStation 4 (x64), and Xbox One (x64).
The data I obtained from Paragon under NDA last year may or may not differ from what has now been publicly released by Epic. As soon as I get the chance, I will update the published stats with the new public data. This also means that I will be able to include Paragon clips into the regression tests as well to increase our coverage.
The next v0.8 release aims to achieve three goals (roadmap):
- Document as much as possible
- Add the remaining features to support real games
- Add the necessary decompression profiling infrastructure
Decompression performance is one of the most important metric for modern games on both mobile and consoles. Measuring it accurately and reliably in an environment that is as close to a real game as possible is challenging which is why it was left last. However, ACL was built from the ground up in order to decompress as fast as possible: all memory reads are contiguous and linear and writes can be too depending on the host game engine integration. I am quite confident it will end up competitive with the state of the art codecs within UE4 and there are many opportunities left to optimize that I have delayed until I can measure their individual impact properly.
This upcoming release is likely to be the last before the first production release which aims to be a drop-in replacement within UE4. If everything goes according to plan and no delays surface, at the current pace, I should be able to reach this milestone around June 2018.
10 Jan 2018
Hot off the press, ACL v0.6 has just been released and contains lots of great things!
This release focused primarily on extending the platform support as well as improving the accuracy. Proper Linux and OS X support was added as well as the x86 architecture. As always, the list of supported platforms is in the readme. This was made possible thanks to continuous build integration which has been added and contributed in part by Michał Janiszewski!
Another notable mention is that the backlog and roadmap have been migrated to GitHub Issues. This ensures complete transparency with where the project is going.
Compiler battle royal
Now that we have all of these compilers and platforms supported, I thought it would make sense to measure everything at least once on the full data set from Carnegie-Mellon University.
Another thing I wanted to measure is how much do we gain from hyper-threading and last but not least, I thought it would be interesting to include x86 as well as x64.
Here is my setup to measure:
- Windows 10 running on an Intel i7-6850K with 6 physical cores and 12 logical cores
- Ubuntu 16.04 running in VirtualBox with 6 cores assigned
- OS X running on an Intel i5-4288U with 2 physical cores and 4 logical cores
The acl_compressor.py script is used to compress multiple clips in parallel in independent processes. Each clip runs in its own process.
Every platform used a Release build with AVX enabled. The wall clock time is the cummulative time it took to run everything: compression, decompression to measure accuracy, reading the clip, writing the stats, etc. On the other hand, the total thread time measures the total sum of time the threads each spent on compression.
A number of things stand out:
- x86 is slower for VS 2015 (66.5% slower) , VS 2017 (64.0% slower with 11 cores, 108.8% slower with 3 cores), and Clang 5 (36.0% slower) but it seems to be faster for GCC 5 (10.9% faster)
- Hyper-threading barely helps at all: going from 6 cores to 11 with VS 2017 was only 7.8% faster but the total thread time increases by 69.9%
- Clang 5 with x64 wins hands down, it is 25.2% faster than VS 2017 and 220.8% faster than GCC 5
GCC 5 performs so bad here that I am wondering if the default CMake compiler flags for Release builds are sane or if I made a mistake somewhere. Clang 5 really blew me away: despite running in a VM it significantly outperforms all the other compilers with both x86 and x64.
As expected, hyper-threading does not help all that much. When clips are compressed, most of the data manipulated can fit within the L2 or L3 caches. With so little IO made, animation compression is primarily CPU bound. Overall this leaves very little opportunity for a neighbor thread to execute since they hardly ever stall on expensive operations.
As I mentioned when the Paragon data set was announced, some exotic clips brought to the surface some unusual accuracy issues. These were all investigated and they generally fell into one or both of these categories:
- Very small and very large scale coupled with very large translation caused unacceptable accuracy loss when using affine matrices to calculate the error
- Very large translations in a long bone chain can lead to significant accuracy loss
In order to fix the first issue, how we handle the error metric was refactored to better allow a game engine to supply their own. This is documented here. Ultimately what is most important about the error metric is that it closely approximates how the error will look in the host game engine. Some game engines use affine matrices to convert the local space bone transform into object or world space while others use Vector-Quaternion-Vector (VQV). ACL now supports both ways to calculate the error and the default we will be using for all of our statistics is the later as it more closely matches what Unreal 4 does. This did not measurably impact the compression results but it did improve the accuracy of the more exotic clips and the overall compression time is faster.
However, the problem of large translations in long bone chains has not been addressed. I compared how the error looked in Unreal 4 and it does a much better job than ACL for the time being on those few clips. This is because they implement error compensation which is something that ACL has not implemented yet. In the meantime, ACL is perfectly safe for production use and if these rare clips with a visible error do pop up, less aggressive compression settings can be used. Only 3 clips within the Paragon data set suffer from this.
Ultimately a lot of the error introduced for both ACL and Unreal 4 comes from the rotation format we use internally: we drop the quaternion W component. This works well enough when its value is close to 1.0 as the square-root used to reconstruct it is accurate in that range but it fails spectacularly when the value is very small and close to 0.0. I already have plans to try two other rotation formats to help resolve this issue: dropping the largest quaternion component and using the quaternion logarithm instead.
While investigating the accuracy issues and comparing against Unreal 4 I noticed that a fix I previously made locally was partially incorrect and in rare cases could lead to bad things happening. This has been fixed and the statistics and graphs for UE 4.15 were updated for CMU and Paragon. The results are very close to what they were before.
The accuracy improvements from this release are a bit more visible on the Paragon data set.
At this point, I can pretty confidently say that ACL is ready for production use but many things are still missing for the library to be of production quality. While the performance and accuracy are good enough, iOS support is still missing, support for additive animations is missing, as well as lots of unit testing, documentation, and clean up.
The next release will focus on:
- Cleaning up
- Adding lots of unit tests
- iOS support
- Better Android support
- Many other things
29 Dec 2017
As I mentioned in my previous post, ACL still suffers from some less then ideal accuracy in some exotic situations. Since the next release will have a strong focus on fixing this, I wanted to investigate using float64 and fixed point arithmetic. It is general knowledge that float32 arithmetic incurs rounding and can lead to severe accuracy loss in some cases. The question I hoped to answer was whether or not this had a significant impact on ACL. Originally ACL performed the compression entirely with float64 arithmetic but this was removed because it caused more issues than it was worth but I did not publish numbers to back this claim up. Now we revisit it once and for all.
To this end, the first research branch was created. Research branches will play an important role in ACL. Their aim is to explore small and large ideas that we might not want to support in their entirety in the main branches while keeping them close. Unless otherwise specified, research branches will not be actively maintained. Once their purpose is complete, they will live on to document what worked and didn’t work and serve as a starting point for anyone hoping to investigate them further.
float64 vs float32 arithmetic
In order to fully test float64 arithmetic, I templated a few things to abstract the arithmetic used between float32 and float64. This allowed easy conversion and support of both with nearly the same code path. The results proved somewhat underwhelming:
As it turns out, the small accuracy loss from float32 arithmetic has a barely measurable impact on the memory footprint for CMU and a 0.6% reduction for Paragon. However, the compression (and decompression) time is much faster.
With float64, the max error for CMU and Paragon is slightly improved for the majority of the clips but not by a significant margin and 4 exotic Paragon clips end up with a worse error.
Consequently, it is my opinion that float32 is the superior choice between the two. The small increase in accuracy and reduction in memory footprint is not significant enough to outweigh the degradation of the performance. Even though the float64 code path isn’t as optimized, it will remain slower due to the individual instructions being slower and the increased number of registers needed. It’s possible the performance might be improved considerably with AVX and so this is something we’ll keep in mind going forward.
Fixed point arithmetic
Another popular alternative to floating point arithmetic is fixed point arithmetic. Depending on the situation it can yield higher accuracy and depending on the hardware it can also be faster. Prior to this, I had never worked with fixed point arithmetic. There was a bit of a learning curve but it proved intuitive soon enough.
I will not explain in great detail how it works but intuitively, it is the same as floating point arithmetic minus the exponent part. For our purposes, during the decompression (and part of the compression), most of the values we work with are normalized and unsigned. This means that the range is known ahead of time and fixed which makes it a good candidate for fixed point arithmetic.
Sadly, it differs so much from floating point arithmetic that I could not as easily support it in parallel with the other two. Instead, I created an arithmetic_playground and tried a whole bunch of things within.
I focused on reproducing the decompression logic as close as possible. The original high level logic to decompress a single value is simple enough to include here:
Not quite 1.0
The first obstacle to using fixed point arithmetic is the fact that our quantized values do not map 1:1. Many engines dequantize with code that looks like this (including Unreal 4 and ACL):
This is great in that it allows us to exactly represent both 0.0 and 1.0, we can support the full range we care about: [0.0 .. 1.0]. A case could be made to use a multiplication instead but it doesn’t matter all that much for the present discussion. With fixed point arithmetic we want to use all of our bits to represent the fractional part between those two values. This means the range of values we support is: [0.0 … 1.0). This is because both 0.0 and 1.0 have the same fractional value of 0 and as such we cannot tell them apart without an extra bit to represent the integral part.
In order to properly support our full range of values, we must remap it with a multiplication.
Fast coercion to float32
The next hurdle I faced was how to convert the fixed point number into a float32 value efficiently. I independently found a simple, fast, and elegant way and of course it turned out to be very popular for those very reasons.
For all of our possible values, we know their bit width and a shift can trivially be calculated to align it with the float32 mantissa. All that remains is or-ing the exponent and the sign. In our case, our values are between [0.0 … 1.0[ and thus by using a hex value of 0x3F800000 for
exponent_sign, we end up with a float32 in the range of [1.0 … 2.0[. A final subtraction yields us the range we want.
Using this trick with the float32 implementation gives us the following code:
It does lose out a tiny bit of accuracy but it is barely measurable. In order to be sure, I tried exhaustively all possible sample and segment range values up to a bit rate of 16 bits per component. The up side is obvious, it is 14.9% faster!
32 bit vs 64 bit variants
Many variants were implemented: some performed the segment range expansion with fixed point arithmetic and the clip range expansion with float32 arithmetic and others do everything with fixed point. A mix of 32 bit and 64 bit arithmetic was also tried to compare the accuracy and performance tradeoff.
Generally, the 32 bit variants had a much higher loss of accuracy by 1-2 orders of magnitude. It isn’t clear how much this would impact the overall memory footprint on CMU and Paragon. The 64 bit variants had comparable accuracy to float32 arithmetic but ended up using more registers and more instructions. This often degraded the performance to the point of making them entirely uncompetitive in this synthetic test. Only a single variant came close to the original float32 performance but it could never beat the fast coercion derivative.
The fastest 32 bit variant is as follow:
Despite being 3 instructions shorter and using faster instructions, it was 14.4% slower than the fast coercion float32 variant. This is likely a result of pipelining not working out as well. It is entirely possible that in the real decompression code things could end up pipelining better making this a faster variant. Other processors such as those used in consoles and mobile devices also might perform differently and proper measuring will be required to get a definitive answer.
The general consensus seems to be that fixed point arithmetic can yield higher accuracy and performance but it is highly dependent on the data, the algorithm, and the processor it runs on. I can corroborate this and conclude that it might not help out all that much for animation compression and decompression.
All of this work was performed in a branch that will NOT be merged into develop. However, some changes will be cherry picked by hand. In the short term, the conclusions reached here will not be integrated just yet into the main branches. The primary reason for this is that while I have extensive scripts and tools to track the accuracy, memory footprint, and compression performance; I do not have robust tooling in place to track decompression performance on the various platforms that are important to us.
Once we are ready, the fast coercion variant will land first as it appears to be an obvious drop-in replacement and some fixed point variants will also be tried on various platforms.
The accuracy issues will have to be fixed some other way and I already have some good ideas how: idea 1, idea 2, idea 3, idea 4, idea 5.
05 Dec 2017
While working for Epic to improve Unreal 4’s own animation compression and decompression, I asked for permission to use the Paragon animations for research purposes and they generously agreed. Today I have the pleasure to report the findings from that new data set!
This is significant for two reasons:
- It allows for extensive stress testing with new data
- Paragon is a real game with high animation quality
Thus far, the Carnegie-Mellon University data set has been the performance benchmark.
The data set contains 2534 clips. Each clip contains an animated character with 44 bones. The version of the data that I found comes from the Unity store where it is distributed in FBX form but sampled at 24 FPS. The total duration of the database is 09h 49m 37.58s. It does not contain any 3D scale and its raw size is 1429.38 MB. It exclusively contains motion capture animation data. It is publicly available and well known within the animation compression research community.
While the database is valuable, it is not entirely representative of all the animation assets that a AAA game might use for a few reasons:
- Most AAA games today have well over 100 bones per character and sometimes as high as 500
- The sample rate is lower than the 30 FPS typically used in games
- Motion capture data is often very noisy
- Games often animate things other than characters such as cloth, objects, destruction, etc.
- Many games make use of 3D scale
For these reasons, this data set is wonderful for unit testing and establishing a baseline for comparison but it falls a bit short with what I would ideally like.
You can see how Unreal and ACL compare against it here.
The Paragon data set contains 6558 clips for a total duration of 07h 00m 45.27s and a raw size of 4276.11 MB. As you can see, despite being shorter than CMU, it is about 3x larger in size.
The data set contains among other things:
- Lots of characters with varying number of bones
- Animated objects of various shape and form
- Very short and very long clips
- Clips with unusual sample rate (as low as 2 FPS!)
- World space clips
- Lots of 3D scale
- Lots of other exotic clips
This is great to stress test any compression algorithm and the results will be very representative of what could be expected in a AAA game.
To extract the animation clips, I used the Unreal 4 animation recompression commandlet and modified it to skip clips that ACL does not yet support (e.g. additive animations). I did my best to retain as many clips as possible. Every clip was saved in the ACL file format allowing a binary exact representation.
Sadly, I am not at liberty to share this data set as I am only allowed to use it under a non-disclosure agreement. All hope is not lost though, Epic has expressed interest in perhaps making a small subset of the data publicly available for research purposes. Stay tuned!
The value of undertaking this quickly became obvious when an exotic clip from the data set highlighted a bug in the variable bit rate selection that ACL used. A fix was made and the results were breathtaking: CMU reduced in size by 19% (and Paragon reduced by 20%)! You can read about it here in my previous blog post.
Three clips stress tested the accuracy of ACL and ended up with an unacceptable error as a result. This will be made evident by the graphs and numbers below. I am hoping to fix a number of accuracy issues in the next ACL release now that I have new data to validate against.
The bugs I found were not exclusively within ACL: two were found and still present in the latest Unreal 4 version. Thankfully, I was able to get in touch with Epic and these should be fixed in a future release.
In order to make the comparison as fair as possible, I had to locally disable the down-sampling variants within the Unreal 4 automatic compression method. One of the two bugs caused these variants to sometime crash. While down-sampling isn’t often selected by the algorithm as the optimal choice for any given clip, disabling it means that compression is faster and possibly a bit larger as a result. Out of the 600 clips I managed to compress before finding the bug, only 3 ended up down-sampled. There are 9 down-sampled variants out of 27 in total (33%).
UE 4.15 took 19h 56m 50.37s single threaded to compress. It yielded a compressed size of 496.24 MB for a compression ratio of 8.62 : 1. The max error is 0.8619cm.
ACL 0.5 took 19h 04m 25.11s single threaded to compress (01h 53m 42.84s with 11 threads). It yielded a compressed size of 205.69 MB for a compression ratio of 20.79 : 1. The max error is 9.7920cm.
On the surface, the compression time remains faster with ACL even with a significant portion of the variants disabled in the Unreal automatic compression. However, the memory footprint is dramatically smaller, a whooping 58.6% smaller! As will be made apparent in the graphs below, once again the maximum error proves to be a poor metric of the true performance: 3 clips have an error above 0.8cm with ACL.
The results in images
All the results and many more images are also on GitHub here for Paragon just like they are for CMU here. I will only show a few selected images in this post for brevity.
As expected, ACL outperforms Unreal by a significant margin. Some clips on the right are truncated with unusually high compression ratios as high as 900 : 1 for some exotic clips but those are likely very long with little to no animated data and aren’t too interesting or representative.
Here again ACL outperforms Unreal over the overwhelming majority of the data set. On the right there are a small number of clips that perform somewhat poorly with both compression methods: a total of 101 clips have an error above 0.1cm with ACL and 153 clips for Unreal.
As I have previously mentioned, the max clip error is a poor measure of accuracy. Once again the full picture is much better and tells a different story.
ACL continues to shine, crossing the 0.01cm threshold at the 99.23th percentile. Unreal crosses the same threshold at the 89th percentile.
Despite having a maximum error that is entirely unacceptable, it turns out that only 0.77% of the compressed samples (out of 112 million) exceed a sub-millimeter threshold. Aside from the 3 worst offending clips, everything else is cinematic and production quality. Not bad!
As is apparent now, ACL performs admirably in a myriad of scenarios and continues to improve month after month. Real world data now confirms it. Half the memory footprint of Unreal is not insignificant even for a PC or PS4 game: less data to load into memory means faster streaming, less data to transfer means faster game download and installation times, and it can correlate with faster decompression performance too. For many PS4 and XB1 games, 200 MB is perhaps small enough to load them all into memory up front and never stream them from disk afterwards.
As I continue to improve ACL, I will update the graphs and numbers with the latest significant releases. I also expect the improvements that I made to Unreal’s own animation compression over the last few months to be part of a future release and when that happens I will again update everything.
Special thanks to Raymond Barbiero for his very valuable feedback and to the continued support of many others!
23 Nov 2017
Today marks the release of ACL v0.5. Once again, lots of great things were included in this release but three things stand out:
- Full 3D scale support
- Android support (tested within Unreal Engine 4.15)
- A fix to the variable quantization optimization algorithm
The third point in particular needs explaining. Initially, I did not intend to make significant changes in this release to the way compression was done beyond the scale support and whatever fixes Android required.
However, while investigating accuracy issues within an exotic clip, I noticed a bug. Upon fixing it (and very unexpectedly), everything kicked into overdrive.
On the Carnegie-Mellon University (CMU) data set, the memory footprint reduced by 18.4% with little to no change to the accuracy of the overwhelming majority of clips and a slight accuracy increase to some of them!
Sadly, the compression speed suffered a bit as a result and it is now about 1.5x slower than v0.4. In my opinion, this is an entirely acceptable trade-off!
Compared to UE 4.15, ACL now stands 37.8% smaller and 2.82x faster (single threaded) to compress on CMU. No small feat!
In light of these new numbers, all the charts have been updated and can be found here. Here are the most interesting:
I also extracted two new charts: the distribution of clip durations within CMU and the distribution of which bit rates ended up selected by the algorithm. A bit rate of 6 means that 6 bits per component are used. Every track (rotation, translation, and scale) sample has 3 components (X, Y, Z) which means 18 bits per sample.
The focus of the next few months will be more platform support (Linux, OS X, and iOS in that order) as well as improving the accuracy. A new data set I got my hands on showed edge cases that are not too uncommon from real video games where the accuracy is not good enough. Part of the accuracy loss comes from storing the segment range on 8 bits per component and the fact that we use 32 bit floats to perform the decompression arithmetic. As such, a new research branch will be created to investigate using 64 bit floats to perform the arithmetic and a fixed point represetation as well. A separate blog post will be written with the conclusion of this research.