Introducing Realtime Math v1.0
19 Jan 2019Almost two years ago now, I began writing the Animation Compression Library. I set out to build it to be production quality which meant I needed a whole lot of optimized math. At the time, I took a look at the landscape of math libraries and I opted to roll out my own. It has served me well, propelling ACL to success with some of the fastest compression and decompression performance in the industry. I am now proud to announce that the code has been refactored out into its own open source library: Realtime Math v1.0 (RTM) (MIT license).
There were a few reasons that motivated the choice to move the code out on its own:
- A significant amount of the ACL Continuous Integration build time is compiling and running the math unit tests which slows things down a bit more than I’d like
- It decouples code that will benefit from being on its own
- I believe it has its place in the landscape of math libraries out there
In order to support that last point, I reviewed 9 other popular and usable math libraries for realtime applications. I looked at these with the lenses of my own needs and experience, your mileage may vary.
Disclaimer: the list of reviewed libraries is in no way exhaustive but I believe it is representative. Note that Unreal Engine 4 is included for informational purposes as it isn’t really usable on its own. Libraries are listed in no particular order and I tried to be as objective as possible. If you spot any inaccuracies, don’t hesitate to reach out.
The list: Realtime Math, MathFu, vectorial, VectorialPlusPlus, C OpenGL Graphics Math (CGLM), OpenGL Graphics Math (GLM), Industrial Light & Magic Base (ILMBase), DirectX Math, and Unreal Engine 4.
TL;DR: How Realtime Math stands out
I believe Realtime Math stands out for a few reasons.
It is geared for high performance, deeply hot code. Most functions will end up inlined but the price to pay is an API that is a bit more verbose as a result of being C-style. When the need arises to use intrinsics, it gets out of the way and lets you do your thing. Only two libraries had what I would call optimal inlinability: Realtime Math and DirectX Math. Only those two libraries properly support the __vectorcall calling convention explicitly and only RTM handles GCC and Clang argument passing explicitly.
While it still needs a bit of love, quaternions are a first class citizen and it is the only standalone open source library I could find that supports QVV transforms (a rotation quaternion, a 3d scale vector, and a translation vector).
Realtime Math uses a coding style similar to the C++ standard library and feels clean and natural to read and write.
It consists entirely of C++11 headers, it runs almost everywhere, it supports 64 bit floating point arithmetic, and it sports a very permissive MIT license.
License
ACL is open source and uses the MIT license. I am never keen on adding dependencies and if I really have to, I want a permissive license free of constraints.
Library | License |
---|---|
Realtime Math | MIT |
MathFu | Apache 2.0 |
vectorial | BSD 2-clause |
VectorialPlusPlus | BSD 2-clause |
CGLM | MIT |
GLM | Modified MIT |
ILMBase | Custom but permissive |
DirectX Math | MIT |
Unreal Engine 4 | UE4 EULA |
Header only
For simplicity and ease of integration, I want ACL to be entirely made of C++11 headers. This also constrains any dependencies to the same requirement.
Library | Header Only |
---|---|
Realtime Math | Yes |
MathFu | Yes |
vectorial | Yes |
VectorialPlusPlus | Yes |
CGLM | Yes (optional lib) |
GLM | Yes |
ILMBase | No |
DirectX Math | Yes |
Unreal Engine 4 | No |
Verbosity, readability, and power
An important requirement for a math library is to be reasonably concise with average code without getting in the way if the need arises to dive right into raw intrinsics. In my experience, general math type abstractions take you very far but in order to squeeze out every cycle it is sometimes necessary to write custom per platform code. When this is required, it is important for the library to not hide its internals and leave the door open.
I am personally more a fan of C-style interfaces for a math library for various reasons: I can infer very well what happens under the hood (I have seen many libraries make fancy use of some operators that leave many newcomers to wonder what they do) and they are optimal for performance as we will discuss later. The downside of course is that they tend to be a bit more verbose. However, this largely boils down to a matter of personal taste.
vectorial is one of the few libraries that offers both a C-style interface and C++ wrappers and at the other end of the spectrum DirectX Math has both a namespace and a prefix for every type, constant and function.
Library | Verbosity |
---|---|
Realtime Math | Medium (C-style) |
MathFu | Light (C++ wrappers) |
vectorial | Light (C++ wrappers) and Medium (C-style) |
VectorialPlusPlus | Light (C++ wrappers) |
CGLM | Medium (C-style) |
GLM | Light (C++ wrappers) |
ILMBase | Light (C++ wrappers) |
DirectX Math | Medium++ (C-style with prefix and namespace) |
Unreal Engine 4 | Light (C++ wrappers) |
It is very common for C-style math APIs to typedef their types to the underlying SIMD type. Realtime Math, DirectX Math, and many others do this. While this is great for performance, it does raise one problem: type safety is reduced. While usually those interfaces will opt to not expose proper vector2 or vector3 types and instead rely on functions that simply ignore the extra components, it doesn’t work so well when vector4 and quaternions are mixed. Only Realtime Math, DirectX Math and CGLM have quaternions with C-style interfaces but only the first two have a distinct type for quaternions when SIMD intrinsics are disabled. This somewhat mitigates the issue because with both Realtime Math and DirectX Math you can compile without intrinsics and still have type safety validated there. Although at the end of the day, all three have functions with distinct prefixes for vector and quaternion math and as such type safety is unlikely to be an issue.
Type and feature support
By virtue or being an animation compression library, ACL’s needs are a bit different from a traditional realtime application. This dictated the need I had for specific types and features. I had no need for general 3x3 or 4x4 matrices as well as 2D vectors which are more commonly used in gameplay and rendering. However, 3x4 affine matrices, 3D and 4D vectors, quaternions, and QVV transforms (a quaternion, a vector3 translation, and a vector3 scale) are of critical importance. Those types are front and center in an animation runtime and I needed them to be fully featured and fast. Most of the libraries under review had way more features than I cared for (mostly for rendering) but generally missed proper or any support for quaternions and QVV transforms.
MathFu appears to have a bug where the Matrix 4x4 SIMD template specialization isn’t included by default and its quaternions are 32 bytes instead of the ideal 16 due to alignment constraints.
VectorialPlusPlus quaternions also take 32 bytes instead of 16 due to alignment constraints and most of their quaternion code appears to be scalar.
UE 4 is notable for being the only other library to support QVV and it does offer a VectorRegister type to support SIMD for Vector2/3/4 although most of the code written in the engine uses the scalar version.
Library | Vector2 | Vector3 | Vector4 | Quaternion | Matrix 3x3 | Matrix 4x4 | Matrix 3x4 | QVV |
---|---|---|---|---|---|---|---|---|
Realtime Math | SIMD | SIMD | SIMD | SIMD | SIMD | SIMD | SIMD | |
MathFu | SIMD | SIMD | SIMD | Partial SIMD | Scalar | SIMD | Scalar | |
vectorial | SIMD | SIMD | SIMD | |||||
VectorialPlusPlus | SIMD | SIMD | SIMD | Scalar | SIMD | SIMD | ||
CGLM | SIMD | SIMD | SIMD | SIMD | SIMD | SIMD | ||
GLM | SIMD | SIMD | SIMD | Partial SIMD | SIMD | SIMD | SIMD | |
ILMBase | Scalar | Scalar | Scalar | Scalar | Scalar | Scalar | ||
DirectX Math | SIMD | SIMD | SIMD | SIMD | SIMD | |||
Unreal Engine 4 | Scalar | Scalar | Scalar | SIMD | SIMD | SIMD |
SIMD architecture support
Equally important was the SIMD architecture support. I want to run ACL everywhere with the best performance possible, especially on mobile. SSE, AVX, and NEON are all equally important to me.
Worth noting that 2 years ago DirectX NEON support appeared almost exclusively to be for Windows ARM NEON and I have no idea if it runs on iOS or Android even today.
Library | SSE | AVX | NEON |
---|---|---|---|
Realtime Math | Yes | Yes | Yes |
MathFu | Yes | Yes | |
vectorial | Yes | Yes | |
VectorialPlusPlus | Yes | Partial | |
CGLM | Yes | Yes | Partial |
GLM | Yes | ||
ILMBase | |||
DirectX Math | Yes | Yes | Yes |
Unreal Engine 4 | Yes | Yes | Yes |
Platform and compiler support
Here things are a bit more complicated as libraries will list platforms but not compilers or compilers but not platforms. I need ACL to run everywhere and this means limiting myself to C++11 features.
- Realtime Math: Windows (VS2015, VS2017) x86 and x64, Linux (gcc5, gcc6, gcc7, gcc8, clang4, clang5, clang6) x86 and x64, OS X (Xcode 8.3, Xcode 9.4, Xcode 10.1) x86 and x64, Android clang ARMv7-A and ARM64, iOS (Xcode 8.3, Xcode 9.4, Xcode 10.1) ARM64
- MathFu: Windows, Linux, OS X, Android
- vectorial: Unlisted but probably Windows, Linux, OS X, Android, and iOS
- VectorialPlusPlus: Unlisted but probably Windows
- CGLM: Windows, Unix, and probably everywhere
- GLM: VS2013+, Apple Clang 6, GCC 4.7+, ICC XE 2013+, LLVM 3.4+, CUDA 7+
- ILMBase: Unlisted but probably Windows, Linux, OS X
- DirectX Math: VS2015 and VS2017, possibly elsewhere
- Unreal Engine 4: Windows (VS2015, VS2017) x64, Linux x64, OS X x64, Android ARMv7-A (no NEON) and ARM64, iOS ARM64
Continuous integration support
Continuous integration is a critical part of modern software development especially with C++ when multiple platforms are supported and maintained.
Library | Continuous Integration |
---|---|
Realtime Math | Yes |
MathFu | No |
vectorial | No |
VectorialPlusPlus | No |
CGLM | Yes |
GLM | Yes |
ILMBase | No |
DirectX Math | No |
Unreal Engine 4 | Not public |
Dependencies
I’m not personally a big fan of pulling in tons of dependencies, especially for a math library. As mentioned earlier, the Unreal Engine 4 math library isn’t really usable on its own because of this but is included regardless.
Library | Dependencies |
---|---|
Realtime Math | |
MathFu | vectorial (BSD 2-clause) |
vectorial | |
VectorialPlusPlus | HandyCPP (custom license) |
CGLM | |
GLM | |
ILMBase | |
DirectX Math | |
Unreal Engine 4 | Unreal Engine 4 |
Floating point support
When I got started with ACL, I wasn’t sure at the time if 64 bit floating point arithmetic might offer superior accuracy or not and if it would be worth using. As a result, I needed the math code to support both float32 and float64 types for everything with a seamless API between the two for quick testing. It later turned out that the extra floating point precision isn’t helping enough to be worth using.
Library | Float 32 Support | Float 64 Support |
---|---|---|
Realtime Math | Yes | Yes (partial SIMD) |
MathFu | Yes | Yes (no SIMD) |
vectorial | Yes | |
VectorialPlusPlus | Yes | Yes (partial SIMD) |
CGLM | Yes | |
GLM | Yes | |
ILMBase | Yes (no SIMD) | Yes (no SIMD) |
DirectX Math | Yes | |
Unreal Engine 4 | Yes |
Inlinability
Due to the critical need for ACL to be as fast as possible on every platform, having the bulk of the math operations be inline is very important. Many things impact whether a function is inlined by the compiler but two stand out:
- Simple and short functions inline better
- Passing arguments by register needs fewer instructions which inlines better
Thankfully, most math function are fairly simple and short: add, mul, div, etc. C-style functions will generally have a slight advantage over C++ wrappers mainly because they also must track the implicit this pointer being passed around even if ultimately it is optimized out inside the caller. When the compiler needs to determine if it can inline a function, it uses a heuristic and the size of the intermediate assembly/IR/AST most likely plays a role. Generally speaking, C++ wrapper functions that are short will inline just fine but some operations have a harder time due to their size: matrix 4x4 multiplication, quaternion multiplication, and quaternion interpolation. For this reason, I personally favor a C-style API for this sort of code.
The second point is not to be underestimated. Most of the libraries in the list either take the arguments by value or by const reference. While passing SIMD types by value does the right thing on ARM and passes them by register (up to 4), it does not work for aggregate types like matrices and it does not work with the default x64 calling convention with MSVC. In order to be able to pass SIMD types by register with MSVC, you must use its __vectorcall calling convention. It also works for aggregate and wrapper types. Up to 6 registers can be used for this. On desktop and Xbox One, using __vectorcall is critical for high performance code and sadly, most libraries do not support it explicitly (and not all support it implicitly if the whole compilation unit is forced to use that calling convention). With Visual Studio 2015, __vectorcall is the difference between having quaternion interpolation getting inlined or not. When I added support for it in ACL, I measured a roughly 5% speedup during the decompression.
Note that once a function is inlined, whether the arguments are passed by register or not typically does not impact the generated assembly although it sometimes does (at least with MSVC especially when AVX is enabled).
Some libraries which use a generic vector template class with specializations for SIMD (like MathFu) sometime end up passing *float32 arguments by const-reference instead of by value which is often suboptimal when not inlined.*
Library | Inlinability | Register Passing |
---|---|---|
Realtime Math | Optimal (C-style + by register) | Explicit (everywhere) |
MathFu | Decent (C++ wrappers) | None |
vectorial | Good (C-style), Decent (C++ wrappers) | Implicit (C-style and ARM only) |
VectorialPlusPlus | Decent (C++ wrappers) | None |
CGLM | Good (C-style) | None |
GLM | Decent (C++ wrappers) | None |
ILMBase | Decent (C++ wrappers) | None |
DirectX Math | Optimal (C-style + by register) | Explicit (vectorcall and ARM only) |
Unreal Engine 4 | Decent (C++ wrappers) | None |
Multiplication order
An important point of contention is how things are multiplied. As the list below shows, the OpenGL way is by far the most popular for open source math libraries.
It all boils down to whether vectors are represented as a row or as a column. In the former case, multiplication with a matrix takes the form v' = vM
while in the later case we have v' = Mv
. Linear algebra typically treats vectors as columns and OpenGL opted to use that convention for that reason. If you think of matrices as functions that modify an input and return an output it ends up reading like this: result = object_to_world(local_to_object(input))
. This reads right-to-left as is common with nested function evaluation. In my opinion, this is quite awkward to work with as most modern programming languages (and western languages) read left-to-right. Most linear algebra formulas use abstract letters and names for things which somewhat hides this nuance but when I write code, I try to keep my matrix names as clear as possible: what space are the input and output in. While you could technically reverse the naming result = world_from_object * object_from_local * input
so it at least reads decently right-to-left, it’s still harder to reason with because just about everything we work with in the world goes from somewhere to somewhere else and not the other way around: trains, buses, planes, Monday to Friday, 5@7, etc.
On the other hand, DirectX uses row vectors and ends up with the much more natural: result = input * local_to_object * object_to_world
. Your input is in local space, it gets transformed into object space before finally ending up in world space. Clean, clear, and readable. If you instead multiply the two matrices together on their own, you get the clear local_to_world = local_to_object * object_to_world
instead of the awkward local_to_world = object_to_world * local_to_object
you would get with OpenGL and column vectors.
At the end of the day, which way you choose largely boils down to a personal choice (or whatever library you use for rendering) as I don’t think there’s a big performance difference between the two on modern hardware. For ACL, all its output data is in local space and although we evaluate the error in world space internally, this is entirely transparent to the client application and it is free to use either convention.
Library | Multiplication Style |
---|---|
Realtime Math | DirectX |
MathFu | OpenGL |
vectorial | OpenGL |
VectorialPlusPlus | OpenGL |
CGLM | OpenGL |
GLM | OpenGL |
ILMBase | OpenGL |
DirectX Math | DirectX |
Unreal Engine 4 | DirectX |
Conclusion
Ultimately, which math library you choose for a particular project boils down to a matter of personal preference to a large extent. For the vast majority of the code you’ll write, the performance and code generation is likely to be very close if not identical. Two years ago, I knew regardless of which option I picked I would have to do a lot of work to add what was missing. This greatly motivated me to just start from scratch as many middleware do and I do not regret the experience or results.
My top two favorite libraries are Realtime Math and DirectX Math. Both are quite similar today although DirectX Math wasn’t quite as attractive when I started.
Next steps
Over the next few days I will populate various issues on GitHub to document things that are missing or that could benefit from some love.
A core part that is partially missing at the moment is the quantization and packing logic that ACL already contains. I have not migrated that code yet in large part because I am not sure how to best expose it in a clean and consistent API. I do believe it belongs in RTM where everyone can benefit from it.
ACL does not yet use RTM but that migration is planned for ACL v2.0.