Realtime Math 2.0 is out, cleaner, and faster!

28 Jun 2020

A lot of work went into this latest release and here is the gist of what changed:

Added support for GCC10, clang8, clang9, clang10, VS 2019 clang, and emscripten
Added a lot of matrix math
Added trigonometric functions (scalar and vector)
Angle types have been removed
Lots of optimizations and improvements
Tons of cleanup to ensure a consistent API

It should now contain everything needed by most realtime applications. The one critical feature that is missing at the moment, is a proper browsable documentation. While every function is currently documented, the lack of a web browsable documentation makes using it a bit more difficult than I’d like. Hopefully I can remedy this in the coming months.

Migrating from 1.x

Most of the APIs haven’t changed materially and simply recompiling should work depending on what you use. Where compilation fails, a few minor fixes might be required. There are two main reasons why this release is a major one:

The anglef and angled types have been removed
Extensive usage of return type overloading

The angle types have been removed because I could not manage to come up with a clean API for angular constants that would work well without introducing a LOT more complexity while remaining optimal for code generation. Angular constants (and constants in general) are used with all sorts of code. In particular, SIMD code (such as SSE2 or NEON) often ends up needing to use them and I wanted to be able to efficiently do so. As such they are now simple typedefs for floating point types and can easily be used with ordinary scalar or SIMD code. The pattern used for constants is inspired from Boost.

I had originally introduced them in hope of providing added type safety but the constants weren’t really usable in RTM 1.1. For now, it is easier to document that all angles are represented as radians. The typedef remains to clarify the API.

Return type overloading

C++ doesn’t really have return type overloading but it can be faked. It looks like this in action:

vector4f vec1 = vector_set(1.0f, 2.0f, 3.0f);
vector4f vec2 = vector_set(5.0f, 6.0f, 7.0f);

scalarf dot_sse2 = vector_dot3(vec1, vec2);
float dot_x87 = vector_dot3(vec1, vec2);
vector4f dot_broadcast = vector_dot3(vec1, vec2);

Usage is very clean and the compiler can figure out what to do fairly easily in most cases. The implementation behind the scene is a bit complicated but it is worth it for the flexibility it provides:

// A few things omited for brevity
struct dot3_helper
{
    inline operator float()
    {
        return do_the_math_here1();
    }

    inline operator scalarf()
    {
        return do_the_math_here2();
    }

    inline operator vector4f()
    {
        return do_the_math_here3();
    }

    vector4f left_hand_side;
    vector4f right_hand_side;
};

constexpr dot3_helper vector_dot3(vector4f left_hand_side, vector4f right_hand_side)
{
    return dot3_helper{ left_hand_side, right_hand_side };
}

One motivating reason for this pattern is that very often we perform some operation and return a scalar value. Depending on the architecture, it might be optimal to return it as a SIMD register type instead of a regular float as those do not always mix well. ARM NEON doesn’t suffer from this issue and for that platform, scalarf is a typedef for float. But for x86 with SSE2 and for some PowerPC processors, this distinction is very important in order to achieve optimal performance. It doesn’t stop there though, even when floating point arithmetic uses the same registers as SIMD arithmetic (such as x64 with SSE2), there is sometimes a benefit to having a different type in order to improve code generation. VS2019 still struggles today to avoid extra shuffles when ordinary scalar and SIMD code are mixed. The type distinction allows for improved performance.

This pattern was present from day one inside RTM but it wasn’t as widely used. Usage of scalarf wasn’t as widespread. The latest release pushes its usage much further and as such a lot of code was modified to support both return types. This can sometime lead to ambiguous function calls (and those will need fixing in user code) but it is fairly rare in practice. It forces the programmer to be explicit about what types are used which is in line with RTM’s philosophy.

Quaternion math improvements

The Animation Compression Library (ACL) heavily relies on quaternions and as such I spend a good deal of time trying to optimize them. This release introduces the important quat_slerp function as well as many optimizations for ARM processors.

ARM NEON performance can be surprising

RTM supports both ARMv7 and ARM64 and very often what is optimal for one isn’t optimal for the other. Worse, different devices disagree about what code is optimal, sometimes by quite a bit.

I spent a good deal of time trying to optimize two functions: quaternion multiplication and rotating a 3D vector with a quaternion. Rotating a 3D vector uses two quaternion multiplications.

For quaternion multiplication, I tried a few variations:

The first two implementations are inspired from the classic SSE2 implementation. This is the same code used by DirectX Math on SSE2 and ARM as well.

The third implementation is a bit more clever. Instead of using constants that must be loaded and applied in order to align our signs to leverage fused-multiply-add, we use the floating point negation instruction. This is done once and mixed in with the various swizzle instructions that NEON supports. This ends up being extremely compact and uses only 12 instructions with ARM64!

I measured extensively using micro benchmarks (with Google Benchmark) as well as within ACL. The results turned out to be quite interesting.

On a Pixel 3 android phone, with ARMv7 the scalar version was fastest. It beat the multiplication variant by 1.05x and the negation variant by 1.10x. However, with ARM64, the negation variant was best. It beat the multiplication variant by 1.05x and the scalar variant by 1.16x.

On a Samsung S8 android phone, the results were similar: scalar wins with ARMv7 and negation wins with ARM64 (both by a significant margin again).

On an iPad Pro with ARM64 the results agreed again with the negation variant being fastest.

I hadn’t seen that particular variant used anywhere else so I was quite pleased to see it perform so well with ARM64. In light of these results, RTM now uses the scalar version with ARMv7 and the negation version with ARM64.

Since rotating a 3D vector with a quaternion is two quaternion multiplications back-to-back, I set out to use the same tricks as above with one addition.

vector4f quat_mul_vector3(vector4f vector, quatf rotation)
{
    quatf vector_quat = quat_set_w(vector_to_quat(vector), 0.0f);
    quatf inv_rotation = quat_conjugate(rotation);
    return quat_to_vector(quat_mul(quat_mul(inv_rotation, vector_quat), rotation));
}

We first extend our vector3 into a proper vector4 by padding it with 0.0. Using this information, we can strip out a few operations from the first quaternion multiplication.

Again, I tested all four variants and surprisingly, the scalar variant won out every time both with ARMv7 and ARM64 on both android devices. The iPad saw the negation variant as fastest. Code generation was identical yet it seems that the iPad CPU has very different performance characteristics. As a compromise, the scalar variant is used with all ARM flavors. It isn’t optimal on the iPad but it remains much better than the reference implementation.

I suspect that the scalar implementation performs better because more operations are independent. Despite having way more instructions, there must be fewer stalls and this leads to an overall win. It is possible that this information can be better leveraged to further improve things but that is a problem for another day.

Compiler bugs bonanza

Realtime Math appears to put considerable stress on compilers and often ends up breaking them. In the first 10 years of my career, I found maybe 2-3 C++ compiler bugs. Here are just some of the bugs I remember from the past year:

And those are just the ones I remember from the top of my head. I also found one or two with ACL not in the above list. Some of those will never get fixed because the compiler versions are too old but thankfully the Microsoft Visual Studio team has been very quick to address some of the above issues.

Nicholas Frechette's Blog Raw bits

Realtime Math 2.0 is out, cleaner, and faster!

Migrating from 1.x

Return type overloading

Quaternion math improvements

ARM NEON performance can be surprising

Compiler bugs bonanza

Related Posts

Christmas came early: ACL 2.1 is out! 16 Dec 2023

The Animation Compression Library in Unreal Engine 5.3 17 Sep 2023

Dominance based error estimation 26 Feb 2023