Animation Compression Library: Release 0.4.0

This marks the fourth release of ACL. It contains a lot of good stuff but most notable is the addition of segmenting support. I have not had the chance to play with the settings much yet but using segments of 16 key frames reduces the memory footprint by about 13% with variable quantization under uniform sampling. Adding range reduction on top of it (per segment), further reduces the memory footprint by another 10%. This is very significant!

Some optimizations also made it in to the compression time, reducing it by 4.3x with no compromise to quality.

You can see the latest numbers here as well as how they compare against the previous releases here. Note that the documentation contains more graphs than I will share here.

This also represents the first release where graphs have been generated allowing us an unprecedented view into how the ACL and Unreal algorithms perform. As such, I will detail what is note-worthy and thus this blog post will be a bit long. Grab a coffee and buckle up!


  • ACL compresses better than Unreal for nearly every clip in the CMU database.
  • ACL is much smaller than Unreal (23.4%), is more accurate (2x+), and compresses much faster (4.68x).
  • ACL performs as expected and optimizes properly for the error threshold used, validating our assumptions.
  • A threshold of 0.1cm is good enough for production use in Unreal as the overwhelming majority (98.15%) of the samples have an error smaller than 0.02cm.

Why compare against Unreal?

As I have previously mentioned, Unreal 4 has a very solid error metric and good implementations of common animation compression techniques. It most definitely is well representative of the state of animation compression in game engines everywhere.

NOTE: In the images that follow, the results for an error threshold of UE4 @ 1.0cm were nearly identical to 0.1cm and were thus omitted for brevity

Performance results

ACL 0.4 compresses the CMU database down to 82.25mb in 50 minutes single-threaded and 5 minutes multi-threaded with a maximum error of 0.0635cm. Unreal 4.15 compresses it down to 107.94mb in 3 hours and 54 minutes single-threaded with a maximum error of 0.0850cm (1.0cm threshold used). Importantly, this is achieved with no compromise to decompression speed (although not yet measured, is estimated to be faster or just as fast with ACL).

Compression ratio VS max error per clip

As can be seen on the above image, ACL performs quite well here. The error is very low and the compression quite high in comparison to Unreal.

Compression ratio distribution

Here we see the full distribution of the compression ratio over the CMU database. UE4 @ 0.01cm fails to do better than dropping the quaternion W and storing everything as full precision most of the time which is why the compression ratio is so consistent. UE4 @ 0.1cm performs similarly in that key reduction fails very often on this database and as a result simple quantization is most often selected.

Compression ratio distribution (bottom 10%)

Here is a snapshot of the bottom 10% (10th percentile and lower). We can see some similarities in shape at the bottom and top 10%.

Compression ratio by clip duration

We can see on the above image that Unreal performs consistently regardless of the animation clip duration but ACL performs slightly better the longer the clip is. This is most likely a direct result of using range reduction twice: once per clip, and once per segment.

Compression ratio by clip duration (shortest 100)

Both algorithms perform similarly for the shortest clips.

How accurate are we?

Max error distribution

The above image gives a good view of how accurate the algorithms are. We can see ACL @ 0.01cm and UE4 @ 0.01cm quickly reach the error threshold and only about 10% of the clips exceed it. UE4 @ 0.1cm is less accurate but still pretty good overall.

The biggest source of error in both ACL and Unreal comes from the usage of the simple quaternion format consisting of dropping the W component to later reconstruct it at runtime. As it turns out, this is terribly inaccurate when that component is very small. Better formats exist and will be implemented later.

ACL performs worse on a larger number of clips likely as a result of range reduction sometimes causing a precision loss for some clips. At some point ACL should be able to detect this and turn it off if it isn’t needed.

Max error by clip duration

There does not appear to be any correlation between the max error in a clip and its duration, as expected. One thing stands out though, the longer a clip is, the noisier the error appears to be. This is because the longer a clip is the more likely it is to contain a bad quanterion W that fails to reconstruct properly.

Over the years, I’ve read my fair share of animation compression papers and posts. And while they all measure the error differently the one thing they have in common is that they only talk about the worst error within a clip (or whole set of clips). As I have previously mentioned, how you measure the error is very important and must be done carefully but that is not all. Using the worst error within a given clip does not give a full picture. What about the other bones in the clip? What about the other key frames? Do I have a single bone on a single key frame that violates my threshold or do I have many?

In order to get a full and clear picture, I dumped the error of every bone at every key frame in the original clips. This represents over 37 million samples for the CMU database.

Distribution of the error for every bone at every key frame

The above image is amazing!

Distribution of the error for every bone at every key frame (top 10%)

The above two images clearly show how terrible the max clip error is at giving insight into the true error. Here are some numbers visible only in the exhaustive graphs:

  • ACL crosses the 0.01cm error threshold at the 99.85th percentile (only 0.15% of our values exceed the threshold!)
  • UE4 @ 0.01cm crosses 0.01cm at the 99.57th percentile, almost just as good
  • UE4 @ 0.1cm crosses 0.01cm at the 49.8th percentile
  • UE4 @ 0.1cm crosses 0.02cm at the 98.15th percentile

This clearly shows why 0.1cm might be good enough for production use in Unreal: half our values remain at or below 0.01cm and 98% of the values are below 0.02cm.

The previous images also clearly show how aggressive ACL is at reducing the memory footprint and at maximizing the error up to the error threshold. Therefore, the error threshold must be very conservative, much more so than for Unreal.

Why ACL is re-inventing the wheel

As some have commented in the past, ACL is largely re-inventing the wheel here. As such I will detail the rational for it a bit further.

Writing a whole animation blending middleware such as Granny or Morpheme would not have been practical. Just to match production quality implementations out there would have taken 1+ year part time. Even assuming I could have managed to implement something compelling, the cost of switching to a new animation runtime for a game team is very very high. Animators need to learn new tools and workflow, the engine integration might be tightly coupled, and there is no clear way to migrate old assets to the new format. Middlewares are also getting deprecated increasingly frequently. In that regard, the market has largely spoken: most games released today do so either with one of the major engines (Unreal, Unity, Lumberyard, Stingray, etc.) or large studios such as Activision, Electronic Arts, and Ubisoft routinely have in-house custom engines with their own custom animation runtime. Regardless of the quality or feature set, it would have been highly unlikely that it would ever have been used for something significant.

On the other hand, animation compression is a much smaller problem. Integration is easy: everything is pure C++ headers and most engines out there already support more than one animation compression algorithm. This makes migrating existing assets a trivial task providing the few required features are supported (e.g. 3D scale). Any engine or middleware could integrate ACL with few to no issues to be expected once it is production ready.

Animation compression is also a wheel that NEEDS re-inventing. Of all my blog posts, a single post receives the overwhelming majority of my traffic: animation compression in Unity. Why is it so popular? Because as I mention in said post, accuracy issues will be common in Unity and the memory footprint large for high accuracy settings as a direct result of their error metric. Unity is also not alone, Stingray and Lumberyard both use the same metric. It is a VERY common error metric and it is terrible. Academic papers on this topic are often using different and poor error metrics and show very little to no data to back their results and claims. This makes evaluating these papers for real world usage in games very problematic.

Take this paper for example. They use the CMU database as well. Their error metric uses the leaf bone positions in object/world space as a measure of accuracy. This entirely ignores the rotational error of the leaf bone. They show a single graph of their results and two short tables. They do not detail the data further. Compare this with the wealth of information I was able to pull out and publish here. Even though ACL is much stricter when measuring the error, it is obvious that wavelets fail terribly to compete at the same level of accuracy (which barely makes it in their published findings). Note that they make no mention of what is an acceptable quality level that one might be able to realistically use.

Here is another recent paper published by someone I have met and have great respect for. The paper does not mention which error metric was used to compared against what they had prior nor does it mention how competitive their previous implementation was. It does not publish any concrete data either and only claims that the memory footprint reduces by 65% on average against their previous in-house techniques. It does provide a supplemental video which shows a small curated list of clips along with some statistics but without further information, it is impossible to objectively evaluate how it performs and where it lies on the spectrum of published techniques. Despite these shortcomings, it looks very promising (David knows his stuff!) and I am certainly looking forward to implementing this within ACL.

ACL does not only strive to improve on existing techniques; it will also establish a much-needed baseline to compare against and set a standard for how animation compression should be measured.

Next steps

The results so far clearly show that ACL is one step closer to being production ready. The next few months will focus on bridging that gap towards reaching v1.0.0. In the coming releases, scale support will be added as well as support for other leading platforms. This will be done through a rudimentary Unreal 4 integration to make sure it is tested in a real engine and thus real world settings.

No further effort on my part will be made towards improving the above results until our first production release is made. However, Cody Jones is working on integrating curve key reduction in the meantime.

Special thanks to Cody and Martin Turcotte for their constant feedback and contributions!

Math accuracy: Normalizing quaternions

While investigating precision issues with ACL, I ran into two problems that I hadn’t seen documented elsewhere and that slightly surprised me.

Dot product

Calculating the dot product between two vectors is a very common operation used for all sorts of things. In an animation compression library, it’s primary use is normalizing quaternions. Due to the nature of the code, accuracy is very important as it can impact the final compressed size as well as the resulting decompression error.

SSE 4 introduced a dot product instruction: DPPS. It allows the generated code to be more concise and compact by using fewer registers and instructions. I won’t speak to its performance here but sadly; its accuracy is not good enough for us by a tiny, yet important, sliver.

For the purpose of this blog post, we will use the following nearly normalized quaternion as an example: { X, Y, Z, W } = { -0.6767403483, 0.7361232042, 0.0120376134, -0.0006215832 }. This is a real quaternion from a real clip of the Carnegie-Mellon University (CMU) motion capture database that proved to be problematic. With doubles, the dot product is 1.0000001612809224.

Using plain C++ yields the following code and assembly (compiled with AVX support under Visual Studio 2015 with an x64 target):

  • The result is: 1.00000024. Not quite the same but close.

Using the SSE 4 dot product instruction yields the following code and assembly:

  • The result is: 1.00000024.

Using a pure SSE 2 implementation yields the following assembly:

  • The result is: 1.00000012.

These are all nice but it isn’t immediately obvious how big the impact can be. Let’s see how they perform after taking the square root (note that the SSE 2 SQRT instruction is used here):

  • C++: 1.00000012
  • SSE 4: 1.00000012
  • SSE 2: 1.00000000

Again, these are all pretty much the same. What happens when we take the square root reciprocal after 2 iterations of Newton-Raphson?

  • C++: 0.999999881
  • SSE 4: 0.999999881
  • SSE 2: 0.999999940

With this square root reciprocal, here is how our quaternions look after being multiplied to normalize them and their associated dot product.

  • C++: { -0.676740289, 0.736123145, 0.0120376116, -0.000621583138 } = 0.999999940
  • SSE 4: { -0.676740289, 0.736123145, 0.0120376116, -0.000621583138 } = 1.00000000
  • SSE 2: { -0.676740289, 0.736123145, 0.0120376125, -0.000621583138 } = 0.999999940

Here is the dot product calculated with doubles:

  • C++: 0.99999999381912441
  • SSE 4: 0.99999999381912441
  • SSE 2: 0.99999999384079208

And the new square root:

  • C++: 0.999999940
  • SSE 4: 1.00000000
  • SSE 2: 0.999999940

Now the new reciprocal square root:

  • C++: 1.00000000
  • SSE 4: 1.00000000
  • SSE 2: 1.00000000

After all of this, our delta from a true length of 1.0 before (as calculated with doubles) was 1.612809224e-7 before normalization. Here is how they fare afterwards:

  • C++: 6.18087559e-9
  • SSE 4: 6.18087559e-9
  • SSE 2: 6.15920792e-9

And thus, the difference between using SSE 4 and SSE 2 is just 2.166767e-11.

As it turns out, the SSE 2 implementation appears the most accurate one and yields the lowest decompression error as well as a smaller memory footprint (by a tiny bit).

Normalizing a quaternion

There are two mathematically equivalent ways to normalize a quaternion: taking the dot product, calculating the square root, and dividing the quaternion with the result, or taking the dot product, calculating the reciprocal square root, and multiplying the quaternion with the result.

Are the two methods equivalent with floating point mathematics? Again, we will not discuss the performance implications as we are only concerned with accuracy here. Using the previous example quaternion and using the SSE 2 dot product yields the following result with the first method:

  • Dot product: 1.00000012
  • Length: sqrt(1.00000012) = 1.00000000
  • Normalized quaternion using division: { -0.6767403483, 0.7361232042, 0.0120376134, -0.0006215832 }
  • New dot product: 1.00000012
  • New length: 1.00000000

And now using the reciprocal square root with 2 Newton-Raphson iterations:

  • Dot product: 1.00000012
  • Reciprocal square root: 0.999999940
  • Normalized quaternion using multiplication: { -0.676740289, 0.736123145, 0.0120376125, -0.000621583138 }
  • New dot product: 0.999999940
  • New length: 0.999999940
  • New reciprocal square root: 1.00000000

By using the division, normalization fails to yield us a more accurate quaternion because of square root is 1.0. The reciprocal square root instead allows us to get a more accurate quaternion as demonstrated in the previous section.


It is hard to see if the numerical difference is meaningful but over the entire CMU database, both tricks together help reduce the memory footprint by 200 KB and lower our error by a tiny bit.

For most game purposes, the accuracy implication of these methods does not matter all that much and rarely have a measurable impact. Picking whichever method is fastest to execute might just be good enough.

But when accuracy is of a particular concern, special care must be taken to ensure every bit of precision is retained. This is one of the motivating reasons for ACL having its own internal math library: granular control over performance and accuracy.

Animation Compression Library: Release 0.3.0

This release marks an important milestone. It now supports a fully variable bit rate and it performs admirably so far. The numbers don’t lie. Without using any form of key reduction, we match the compression ratio of Unreal 4 (which uses a mix of linear key reduction with a form of variable quantization) and many more tricks will follow to push this even further. It is worth noting that this new variable bit rate algorithm is entirely different from the one I presented at the GDC 2017 and it should outperform it. In due time, more stats and graphs will be published to outline how the data looks across the whole dataset.

While v0.3.0 remains a pre-release, we are quickly approaching a production ready state. Already for the vast majority of clips the error introduced is invisible to the naked eye and the performance is there to match. Major features missing to reach the production ready state are: scale support (sadly the Carnegie-Mellon data set does not contain any scale as such testing this will be problematic), and proper multi-platform support (iOS, OS X, android, clang, gcc, etc.). Both of these things are easily solved problems which is why they were deferred into future releases.

Version 0.4.0 will aim to introduce clip segmenting and hopefully curve based key reduction. Segmenting should improve our accuracy further and at the same time allow us to reduce the memory footprint even further. Curve key reduction will of course allow us to reduce the memory footprint further as well, perhaps dramatically so. Stay tuned!

Introducing ACL

Over the years, I’ve had my fare share of discussions about animation compression and two things became obvious over time: we were all (re-)doing similar things and none of us had access to a state of art implementation to compare against. This lead to rampant speculation about which algorithm was superior or inferior. Having implemented a few algorithms in the past, I have finally decided to redo all that work once more and in the open this time. Say ‘Hello’ to the Animation Compression Library (ACL for short).

To quote the readme:

This library has two primary goals:

  • Implement state of the art and production ready animation compression algorithms
  • Serve as a benchmark to compare various techniques against one another

Over the next few months, I hope to implement state of the art versions of common algorithms and to surpass what current game engines currently offer. It is my hope that this library can serve as the foundation for an industry standard so that together we may be able to move forward and well past the foot sliding issues of yester-year!

Optimizing 4x4 matrix multiplication

In modern video games, the 4x4 matrix multiplication is an important cornerstone. It is used for a very long list of things: moving individual character joints, physics simulation, rendering, etc. To generate a single video game image (and we typically generate between 25 and 60 per second), several thousand matrix multiplications will take place. Today, we will take an in-depth look at such a fundamental piece of code.

As I will show, we can improve on some of the most common implementations out in the wild. I will use DirectX Math as a reference here but I have seen identical implementations in many state of the art game engines.

This blog post will be broken down into four sections:

Note that all the code for this can be found here. Feel free to toy with it and replicate the results or contribute your own.

Our test cases

In order to keep our observations grounded in reality, we will use three test cases that represent common heavy usage of 4x4 matrix multiplication. These are very synthetic in nature but they will make profiling and measuring immensely easier. Modern game engines do many things with many threads which can make profiling on PC somewhat a bit more complicated especially for something as short and simple as matrix multiplication.

Test case #1

Our first test case applies a constant matrix to an array of 64 matrices. Aside from our constant matrix, each input will be read from memory (here everything fits in our processor cache but it doesn’t matter much) and each output will be written back to memory. This code is meant to simulate the common operation of transforming an array of object space matrices into world space matrices, perhaps for skinning purposes.

Test case #2

Our second test case transforms an array of 64 local space matrices into an array of 64 object space matrices. To perform this operation, each local space matrix is multiplied by the parent object space matrix. The root matrix is trivial and equal in both local and object space as it has no parent. In this contrived example the parent matrix is always the previous entry in the array but in practice it would be some arbitrary index previously transformed. This operation is common at the end of the animation runtime where the pose generated will typically be in local space.

Test case #3

Our third test case takes two constant matrices and writes the result to a static array. The array is made static to prevent the compiler from stripping the code. This code is synthetic and meant to try and profile one off multiplications that happen everywhere in gameplay code. We perform the operation 64 times to help us measure the impact since the code is very fast to begin with.

Function signature variations

Our reference implementation taken from DirectX Math has the following signature:

There a few things that are noteworthy here. The function is marked inline but due to its considerable size, the function is generally never inlined. It also uses the __vectorcall calling convention with the macro XM_CALLCONV. This allows up to 6 SIMD input arguments to be passed by register (the default calling convention passes them by value on the stack, unlike with PowerPC) and the return value can also be up to 4 SIMD outputs passed by register. This also works for aggregate types such as XMMatrix. The function takes 2 arguments: M1 is passed by register with the help of FXMMATRIX and M2 is passed by const & with the help of CXMMATRIX.

This function signature will be called in our data: reg

We can vary the function signature in a number of ways and it will be interesting to compare the results. I came up with a number of variations. They are as follow.

Force inline

As mentioned, since our function is very large, inlining will typically fail to happen. However in very hot code, it still makes sense to inline the function.

This function signature will be called: inl

Pass everything from memory

An obvious change we can make is to pass both input arguments as const &. In many cases our matrices might not be cached in local registers to begin with and we have to load them from memory anyway (such as in test case #2).

This function signature will be called: mem

Flip our matrix arguments

In our matrix implementation, the rows of M2 are multiplied whole with each matrix element from M1. The code ends up looking like this:

This repeats 4 times for each row of M1. It is obvious that we can cache our 4 values from M2 and indeed the compiler typically does so for us in our reference implementation. Each of those 4 rows will be needed again and again but the same cannot be said of the 4 rows of M1 which are only needed temporarily. It would thus make sense to pass the matrix arguments in the opposite order: M2 first by register and M1 second by const &.

Note that we use a macro to perform the flip cleanly. I would have preferred a force inlined function but the compiler was not generating clean assembly from it.

This function signature will be called: flip

Expanded matrix argument

Even though the __vectorcall calling convention conveniently passes our matrix in 4 registers, it might help the compiler make different decisions if we are explicit about our intentions.

Our expanded variant will always use the flipped argument ordering. Measuring the non-flipped ordering is left as an exercise to the reader.

This function signature will be called: exp

Return value by argument

Another thing that is very common for a matrix multiplication implementation is to have the return value as a pointer or reference in a third argument.

Again this might help the compiler make different optimization choices. Note as well that implementations with this variant must explicitly cache the rows of M2 in order to have the correct result in case where the result is written to M2. It also improves the generated assembly as otherwise the output matrix would alias the arguments causing the compiler to not perform the caching automatically for you.

This function signature can be applied to all our variants and it will add the suffix 2

Permute all the things!

Taking all of this together and permuting everything yields 12 variants as follow:

In our data, they are called:

  • reg
  • reg2
  • reg_flip
  • reg_flip2
  • reg_exp
  • reg_exp2
  • mem
  • mem2
  • inl
  • inl2
  • inlexp
  • inlexp2

Hopefully this covers a large extent of common and sensible variations.

Our competing implementations

I was able to come up with six distinct implementations of the matrix multiplication, including the original reference. Note that I did not attempt to make the fastest implementation possible, there are other things we could try to make them faster. I also made sure that each version gave a result that was a exactly the same as the reference implementation, down to the last bit (binary exact).


The reference implementation is quite large as such I will not include the full source here but the code can be found here.

The reference regexp2 variant uses 10 XMM registers and totals 70 instructions.


In our reference implementation, an important part can be tweaked a bit.

We load a row from M1, extract each component, and replicate it into the 4 lanes of our SIMD register. This will compile down to 1 load instruction followed by 4 shuffle instructions. This was very common on older consoles: loads from memory were very expensive and none of the other instructions could work directly from memory. However, on SSE and in particular with AVX, we can do a bit better. We can use the _mm_broadcast_ss instruction. It takes as input a pointer to a scalar floating point value and it will output the replicated value over our 4 SIMD lanes. We thus save and avoid our load instruction.

The code for this variant can be found here.

The version 0 regexp2 variant uses 7 XMM registers and totals 58 instructions.


Another change we can perform is inspired from this post on StackOverflow. I rewrote the assembly into C++ code that uses intrinsics to try and keep it comparable.

Two versions were written: version 2 uses load/shuffle (code here) and version 1 uses broadcast (code here).

Branching was notoriously slow on the old consoles, it will be interesting to see how newer hardware performs.

The version 1 regexp2 variant uses 7 XMM registers and totals 23 instructions. The version 2 regexp2 variant uses 10 XMM registers and totals 37 instructions.

Handwritten assembly

Similar to our looping versions, I also kept the hand written assembly version referenced. I made a few tweaks to make sure the results were binary exact. Sadly, the tweaks required the usage of one extra register. Having run out of volatile registers, I elected to load the first row of M2 directly from memory with the multiply instruction during every iteration.

Only two variants were implemented: regexp2 and mem2.

Two versions were written: version 3 uses load/shuffle (code here) and version 4 uses broadcast (code here).

The version 3 regexp2 variant uses 5 XMM registers and totals 21 instructions. The version 4 regexp2 variant uses 5 XMM registers and totals 17 instructions.

The results

For our purposes, each test will be run 1000000 times and the cumulative time will be considered to be 1 sample. We will repeat this to gather 100 samples. To avoid skewing in our data that might result from various external sources (CPU frequency changes, other OS work, etc.), we will retain and use the 80th percentile from our dataset. Due to the simple nature of the code, this should be good enough for us to draw meaningful conclusions. All measurements are in milliseconds.

All of my raw results are parsed with a simple python script to extract the desired percentile and format it in a table form. The script can be found here.

I ran everything on my desktop computer which has an Intel i7-6850K processor. While my CPU differs from what the Xbox One and PlayStation 4 use, they both support AVX and could benefit from the same changes.

I also measured running the test cases with multithreading enabled to see if the results were consistent. Since the results were indeed consistent, we will only talk about the single threaded case but all the data for the test results can be found here:

Test case results

Here are the results for our 3 test cases:

Test Case #1 Results Test Case #2 Results Test Case #3 Results

A few things are immediately obvious:

  • Version 1 and 2, the looping intrinsic versions, are terribly slow. I moved them to the right so we can focus on the left part.
  • Version 0 is consistently faster than our reference implementation.

Here are the same results but only considering the best 3 variants (regexp2, mem2, and inlexp2) and the best 4 versions (reference, version 0, version 3, and version 4).

Best Results

Load/shuffle versus broadcast

Load/Shuffle VS Broadcast

Overwhelmingly, we can see that the versions that use broadcast are faster than their counterpart that uses load/shuffle. This is not too surprising: we use fewer registers, as a result fewer registers spill on the stack, and fewer instructions are used. This is more significant when the function isn’t force inlined since in our test cases, whatever we spill on the stack ends up hoisted outside of our loops when inlined.

The fact that we use fewer registers and instructions also has other side effects, namely it can help the compiler to inline functions. In particular, this is the case for version 1 and 2: version 1 uses broadcast and gets inlined automatically while version 2 uses load/shuffle and does not get inlined.

Output in registers versus memory

Output: Register VS Memory

For test cases #1 and #3, passing our return value as an argument is a net win when there is no inlining. This remains true to a lesser extent even when the functions are force inlined which means it helps the compiler make better choices.

However, for test case #2, it can sometimes be a bit slower. It seems that the assembly generated at the call site isn’t as clean as it could be. It’s possible that by tweaking the test case code a bit, performance could be improved.

Flipped versus expanded

Flipped VS Expanded

Looking only at version 0, the behaviour seems to differ depending when the result is passed as an argument or by register. In the regflip and regexp variants, performance can be faster (test case #2), the same (test case #1), or slower (test case #3). It seems there is high variability with what the compiler chooses to do. On the other hand, with the regflip2 and regexp2 variants, performance is generally faster. Test case #2 has about equal performance but as we have seen, that test case seems to favour results being returned by register.


Inlining On/Off

As it turns out, inlining sometimes gives a massive performance gain and sometimes it comes down to about the same. In general, it is best to let the compiler make inlining decisions but sometimes in very hot code, it is desirable to manually force the inlining for performance reasons. It thus makes sense to provide at least 2 versions of matrix multiplication: with and without force inlining.


Looping VS Unrolled

The looping versions are quite interesting. The 2 versions that use intrinsics perform absolutely terribly. They are worse by far, generally breaking out of the charts above. Strangely, they seem to benefit massively from passing the result as an argument (not shown on the graph above). Even with the handwritten assembly versions, we can see that they are generally slower than our unrolled intrinsic version 0. As it turns out, branching is still not a great idea in hot code even with modern hardware.

Is handwritten assembly worth it?

Handwritten Assembly VS Intrinsics

Looking at our looping versions, it is obvious that carefully crafting the assembly by hand can still give significant results. However, we must be careful when doing so. In particular, with Visual Studio, hand written assembly functions will never be inlined in x64, even by the linker. Something to keep in mind.

Best of the best

Reference VS The Best

In our results, a clear winner stands above all others: version 0 inlexp2:

  • In test case #1, it is 34% faster then the reference implementation
  • In test case #2, it is 16% faster then the reference implementation
  • In test case #3, it is 31% faster then the reference implementation

Even when it isn’t the fastest implementation, it is within measuring error of the leading alternative. And that leading alternative is always a variant of version 0.


As demonstrated by our data, even a hyper optimized piece of code such as matrix multiplication can sometimes be improved by new hardware features such as the AVX broadcast instruction. In particular, the broadcast instruction allows us to reduce register pressure which avoids spilling registers on the stack and saves on the corresponding instructions that do so. On a platform such as x64, register pressure is a real and important problem that must be taken into account for complex and hot code.

From our results, it seems to make sense to provide 2 implementations for matrix multiplication:

  • One of regflip2, regexp2, or mem2 that does not force inlining, suitable for everyday usage
  • inlexp2 that forces inlining, perfect for that piece of hot code that needs to save every cycle

This keeps things simple for the user: all variants return the result in a 3rd argument. Macros can be used to keep things clean and fast.

As always with optimizations, it is important to measure often and to never blindly make a change without measuring first.

DirectX Math commit d1aa003