Arithmetic Accuracy and Performance
29 Dec 2017As I mentioned in my previous post, ACL still suffers from some less then ideal accuracy in some exotic situations. Since the next release will have a strong focus on fixing this, I wanted to investigate using float64 and fixed point arithmetic. It is general knowledge that float32 arithmetic incurs rounding and can lead to severe accuracy loss in some cases. The question I hoped to answer was whether or not this had a significant impact on ACL. Originally ACL performed the compression entirely with float64 arithmetic but this was removed because it caused more issues than it was worth but I did not publish numbers to back this claim up. Now we revisit it once and for all.
To this end, the first research branch was created. Research branches will play an important role in ACL. Their aim is to explore small and large ideas that we might not want to support in their entirety in the main branches while keeping them close. Unless otherwise specified, research branches will not be actively maintained. Once their purpose is complete, they will live on to document what worked and didn’t work and serve as a starting point for anyone hoping to investigate them further.
float64 vs float32 arithmetic
In order to fully test float64 arithmetic, I templated a few things to abstract the arithmetic used between float32 and float64. This allowed easy conversion and support of both with nearly the same code path. The results proved somewhat underwhelming:
As it turns out, the small accuracy loss from float32 arithmetic has a barely measurable impact on the memory footprint for CMU and a 0.6% reduction for Paragon. However, the compression (and decompression) time is much faster.
With float64, the max error for CMU and Paragon is slightly improved for the majority of the clips but not by a significant margin and 4 exotic Paragon clips end up with a worse error.
Consequently, it is my opinion that float32 is the superior choice between the two. The small increase in accuracy and reduction in memory footprint is not significant enough to outweigh the degradation of the performance. Even though the float64 code path isn’t as optimized, it will remain slower due to the individual instructions being slower and the increased number of registers needed. It’s possible the performance might be improved considerably with AVX and so this is something we’ll keep in mind going forward.
Fixed point arithmetic
Another popular alternative to floating point arithmetic is fixed point arithmetic. Depending on the situation it can yield higher accuracy and depending on the hardware it can also be faster. Prior to this, I had never worked with fixed point arithmetic. There was a bit of a learning curve but it proved intuitive soon enough.
I will not explain in great detail how it works but intuitively, it is the same as floating point arithmetic minus the exponent part. For our purposes, during the decompression (and part of the compression), most of the values we work with are normalized and unsigned. This means that the range is known ahead of time and fixed which makes it a good candidate for fixed point arithmetic.
Sadly, it differs so much from floating point arithmetic that I could not as easily support it in parallel with the other two. Instead, I created an arithmetic_playground and tried a whole bunch of things within.
I focused on reproducing the decompression logic as close as possible. The original high level logic to decompress a single value is simple enough to include here:
Not quite 1.0
The first obstacle to using fixed point arithmetic is the fact that our quantized values do not map 1:1. Many engines dequantize with code that looks like this (including Unreal 4 and ACL):
This is great in that it allows us to exactly represent both 0.0 and 1.0, we can support the full range we care about: [0.0 .. 1.0]. A case could be made to use a multiplication instead but it doesn’t matter all that much for the present discussion. With fixed point arithmetic we want to use all of our bits to represent the fractional part between those two values. This means the range of values we support is: [0.0 … 1.0). This is because both 0.0 and 1.0 have the same fractional value of 0 and as such we cannot tell them apart without an extra bit to represent the integral part.
In order to properly support our full range of values, we must remap it with a multiplication.
Fast coercion to float32
The next hurdle I faced was how to convert the fixed point number into a float32 value efficiently. I independently found a simple, fast, and elegant way and of course it turned out to be very popular for those very reasons.
For all of our possible values, we know their bit width and a shift can trivially be calculated to align it with the float32 mantissa. All that remains is or-ing the exponent and the sign. In our case, our values are between [0.0 … 1.0[ and thus by using a hex value of 0x3F800000 for exponent_sign
, we end up with a float32 in the range of [1.0 … 2.0[. A final subtraction yields us the range we want.
Using this trick with the float32 implementation gives us the following code:
It does lose out a tiny bit of accuracy but it is barely measurable. In order to be sure, I tried exhaustively all possible sample and segment range values up to a bit rate of 16 bits per component. The up side is obvious, it is 14.9% faster!
32 bit vs 64 bit variants
Many variants were implemented: some performed the segment range expansion with fixed point arithmetic and the clip range expansion with float32 arithmetic and others do everything with fixed point. A mix of 32 bit and 64 bit arithmetic was also tried to compare the accuracy and performance tradeoff.
Generally, the 32 bit variants had a much higher loss of accuracy by 1-2 orders of magnitude. It isn’t clear how much this would impact the overall memory footprint on CMU and Paragon. The 64 bit variants had comparable accuracy to float32 arithmetic but ended up using more registers and more instructions. This often degraded the performance to the point of making them entirely uncompetitive in this synthetic test. Only a single variant came close to the original float32 performance but it could never beat the fast coercion derivative.
The fastest 32 bit variant is as follow:
Despite being 3 instructions shorter and using faster instructions, it was 14.4% slower than the fast coercion float32 variant. This is likely a result of pipelining not working out as well. It is entirely possible that in the real decompression code things could end up pipelining better making this a faster variant. Other processors such as those used in consoles and mobile devices also might perform differently and proper measuring will be required to get a definitive answer.
The general consensus seems to be that fixed point arithmetic can yield higher accuracy and performance but it is highly dependent on the data, the algorithm, and the processor it runs on. I can corroborate this and conclude that it might not help out all that much for animation compression and decompression.
Next steps
All of this work was performed in a branch that will NOT be merged into develop. However, some changes will be cherry picked by hand. In the short term, the conclusions reached here will not be integrated just yet into the main branches. The primary reason for this is that while I have extensive scripts and tools to track the accuracy, memory footprint, and compression performance; I do not have robust tooling in place to track decompression performance on the various platforms that are important to us.
Once we are ready, the fast coercion variant will land first as it appears to be an obvious drop-in replacement and some fixed point variants will also be tried on various platforms.
The accuracy issues will have to be fixed some other way and I already have some good ideas how: idea 1, idea 2, idea 3, idea 4, idea 5.