Animation clip metadata packing

In my previous blog post, I analyzed over 15000 animations. This offered a number of key insights about what animation data looks like for compression purposes. If you haven’t read it yet, I suggest you do so first as it covers the terminology and basics for this post. I’ll also be using the same datasets as described in the previous post.

As mentioned, most clips are fairly short and the metadata we retain ends up having a disproportionate impact on the memory footprint. This is because long and short clips have the same amount of metadata everything else being equal.

Packing range values

Each animated track has its samples normalized within the full range of motion for each clip. This ends up being stored as a minimum value and a range extent. Both are three full precision 32 bit floats. Reconstructing our original sample is done like this:

sample = (normalized sample * range extent) + range minimum

This is quick and efficient.

Originally, I decided to retain full precision here out of convenience and expediency during the original implementation. But for a while now, I’ve been wondering if perhaps we can quantize the range values as well on fewer bits. In particular, each sample is also normalized a second time within the range of motion of the segment it lies in and those per segment range values are quantized to 8 bits per component. This works great as range values do not need to be all that accurate. In fact, over two years ago I tried to quantize the segment range values on 16 bits instead to see if things would improve (accuracy or memory footprint) and to my surprise, the result was about the same. The larger metadata footprint did allow higher accuracy and fewer bits retained per animated sample but over a large dataset, the two canceled out.

In order to quantize our range values, we must first extract the total range: every sample from every track. This creates a sort of Axis Aligned Bounding Box for rotation, translation, and scale. Ideally we want to treat those separately since by their very nature, their accepted range of values can differ by quite a bit. For translation and scale, things are a bit complicated as some tracks require full precision and the range can be very dynamic from track to track. In order to test out this optimization idea, I opted to try with rotations first. Rotations are much easier to handle since quaternions have their components already normalized within [-1.0, 1.0]. I went ahead and quantized each component to 8 bits with padding to maintain the alignment. Instead of storing 6x float32 (24 bytes), we are storing 8x uint8 (8 bytes). This represents a 3x reduction in size.

Here are the results:

CMU Before After
Compressed size 70.61 MB 70.08 MB
Compressed size 50th percentile 15.35 KB 15.20 KB
Compression ratio 20.24 : 1 20.40 : 1
Max error 0.0725 cm 0.0741 cm
Track error 99th percentile 0.0089 cm 0.0089 cm
Error threshold percentile rank (0.01 cm) 99.86 % 99.86 %
Paragon Before After
Compressed size 208.04 MB 205.92 MB
Compressed size 50th percentile 15.83 KB 15.12 KB
Compression ratio 20.55 : 1 20.77 : 1
Max error 2.8824 cm 3.5543 cm
Track error 99th percentile 0.0099 cm 0.0111 cm
Error threshold percentile rank (0.01 cm) 99.04 % 98.89 %
Fortnite Before After
Compressed size 482.65 MB 500.95 MB
Compressed size 50th percentile 9.69 KB 9.65 KB
Compression ratio 36.72 : 1 35.38 : 1
Max error 69.375 cm 69.375 cm
Track error 99th percentile 0.0316 cm 0.0319 cm
Error threshold percentile rank (0.01 cm) 97.69 % 97.62 %

At first glance, it appears a small net win with CMU and Paragon but then everything goes downhill with Fortnite. Even though all three datasets see a win in the compressed size for 50% of their clips, the end result is a significant net loss for Fortnite. The accuracy is otherwise slightly lower. As I’ve mentioned before, the max error although interesting can be very misleading.

It is clear that for some clips this is a win, but not always nor overall. Due to the added complexity and the small gain for CMU and Paragon, I’ve opted not to enable this optimization nor push further at this time. It requires more nuance to get it right but it is regardless worth revisiting at some point in the future. In particular, I want to wait until I have rewritten how constant tracks are identified. Nearly every animation compression implementation out there that detects constant tracks (ACL included) does so by using a local space error threshold. This means that it ignores the object space error that it contributes to. In turn, this sometimes causes undesirable artifacts in very exotic cases where a track needs to be animated below the threshold where it is detected to be constant. I plan to handle this more holistically by integrating it as part of the global optimization algorithm: a track will be constant for the clip only if it contributes an acceptable error in object space.

Packing constant samples

Building on the previous range packing work, we can also use the same trick to quantize our constant track samples. Here however, 8 bits is too little so I quantized the constant rotation components to 16 bits. Instead of storing 3x float32 (12 bytes) for each constant rotation sample, we’ll be storing 4x uint16 (8 bytes): a 1.33x reduction in size.

Before results contain packed range values as described above.

CMU Before After
Compressed size 70.08 MB 72.54 MB
Compressed size 50th percentile 15.20 KB 15.72 KB
Compression ratio 20.40 : 1 19.70 : 1
Max error 0.0741 cm 0.0734 cm
Track error 99th percentile 0.0089 cm 0.0097 cm
Error threshold percentile rank (0.01 cm) 99.86 % 99.38 %
Paragon Before After
Compressed size 205.92 MB 213.17 MB
Compressed size 50th percentile 15.12 KB 15.43 KB
Compression ratio 20.77 : 1 20.06 : 1
Max error 3.5543 cm 5.8224 cm
Track error 99th percentile 0.0111 cm 0.0344 cm
Error threshold percentile rank (0.01 cm) 98.89 % 96.84 %
Fortnite Before After
Compressed size 500.95 MB 663.01 MB
Compressed size 50th percentile 9.65 KB 9.83 KB
Compression ratio 35.38 : 1 26.73 : 1
Max error 69.375 cm 5537580.0 cm
Track error 99th percentile 0.0319 cm 0.9272 cm
Error threshold percentile rank (0.01 cm) 97.62 % 88.53 %

Even though our clip metadata size does reduce considerably, overall it yields a significant net loss. The reduced accuracy forces animated samples to retain more bits. It seems that lossless compression techniques might work better here although it would still be quite hard since each constant sample is disjoint: there is little redundancy to take advantage of.

With constant rotation tracks, the quaternion W component is dropped just like for animated samples. I also tried to retain the W component with full precision along with the other three. The idea being that if reducing accuracy increases the footprint, would increasing the accuracy reduce it? Sadly, it doesn’t. The memory footprint ended up being marginally higher. It seems like the sweet spot is to drop one of the quaternion components.

Is there any hope left?

Although both of these optimizations turned out to be failures, I thought it best to document them here anyway. With each idea I try, whether it pans out or not I learn more about the domain and I grow wiser.

There still remains opportunities to optimize the clip metadata but they require a bit more engineering to test out. For one, many animation clips will have constant tracks in common. For example, if the same character is animated in many different ways over several animations, each of them might find that many sub-tracks are not animated. In particular, translation is rarely animated but very often constant as it often holds the bind pose. To better optimize these, animation clips must be compressed as a whole in a database of sorts. It gives us the opportunity to identity redundancies across many clips.

In a few weeks I’ll begin implementing animation streaming and to do so I’ll need to create such a database. This will open the door to these kind of optimizations. Stay tuned!

Animation Compression Table of Contents

Animation data in numbers

For a while now, I’ve been meaning to take some time to instrument the Animation Compression Library (aka ACL) and look more in depth at the animations I have. Over the years I’ve had a lot of ideas on what could be improved and good data is critical in deciding where my time is best spent optimizing. I’m not sure if this will be of interest to many people out there but since I went ahead and did all that work, I might as well publish it!

The basics

An animation clip is made up of a number of tracks each containing an equal number of transform or scalar samples (we use uniform sampling). Transform tracks are further broken down into sub-tracks for rotation, translation, and scale. Each sub-track is in one of three states:

  • Animated: every sample is retained and quantized
  • Constant: a single sample is repeating and retained
  • Default: a single repeating sample equal to the sub-track identity, no sample retained

Each sub-track type has its own identity. Rotations have the quaternion identity, translations have 0.0, 0.0, 0.0, and scale tracks have 0.0, 0.0, 0.0 or 1.0, 1.0, 1.0 depending on whether the clip is additive or not and its additive type.

Being able to collapse constant and default tracks is an important optimization. They are fairly common and allow us to considerably trim down on the number of samples we need to compress.

Sub-tracks that are animated end up having their samples normalized within the full range of values within the clip. This range information is later used to reconstruct our original sample.

Note that constant sample values and range values are currently stored with full float32 precision.

The compression format

When an animation is compressed with ACL, it is first broken down into segments of approximately 16 samples per track. As such, we end up with data that is needed regardless where we sample our clip and data that is needed only when we need a particular segment.

The memory layout roughly breaks down like this:

  • Per clip metadata
    • Headers
    • Offsets
    • Track sub-type states
    • Per clip constant samples
    • Per clip animated sample ranges
  • Our segments
  • Optional metadata

Each segment ends up containing the following:

  • Per segment metadata
    • Number of bits used per sub-track
    • Sub-track range information (we also normalize our samples per segment)
  • The animated samples packed on a variable number of bits

In order to figure out what to try and optimize next, I wanted to see where the memory goes within the above categories.

The datasets

I use three datasets for regression testing and for research and development:

We’ll focus more on Paragon and Fortnite since they are more representative and substantial but CMU is included regardless.

Special thanks to Epic for letting me use their animations for research purposes!

When calculating the raw size of an animation clip, I assume that each track is animated and that nothing is stripped. As such, the raw size can be calculated easily: raw size = num bones * num samples * 10 * sizeof(float). Each sample is made up of 10 floats: 4x for the rotation, 3x for the translation, and 3x for the scale.

Here is how they look at a glance:

  CMU Paragon Fortnite
Number of clips 2534 6558 8310
Raw size 1429.38 MB 4276.11 MB 17724.75 MB
Compressed size 71.00 MB 208.39 MB 483.54 MB
Compression ratio 20.13 : 1 20.52 : 1 36.66 : 1
Number of tracks 111496 816700 1559545
Number of sub-tracks 334488 2450100 4678635

Sub-track breakdown

  CMU Paragon Fortnite
Number of default sub-tracks 34.09% 37.26% 42.01%
Number of constant sub-tracks 52.29% 43.95% 50.33%
Number of animated sub-tracks 13.62% 18.78% 7.66%
CMU Default Constant Animated
Number of rotation sub-tracks 0.00% 64.41% 38.59%
Number of translation sub-tracks 2.27% 95.46% 2.27%
Number of scale sub-tracks 100.00% 0.00% 0.00%
Paragon Default Constant Animated
Number of rotation sub-tracks 10.85% 47.69% 41.45%
Number of translation sub-tracks 3.32% 82.98% 13.70%
Number of scale sub-tracks 97.62% 1.19% 1.19%
Fortnite Default Constant Animated
Number of rotation sub-tracks 21.15% 62.84% 16.01%
Number of translation sub-tracks 7.64% 86.78% 5.58%
Number of scale sub-tracks 97.23% 1.38% 1.39%

Overall, across all three data sets, about half the tracks are constant. Translation tracks tend to be constant much more often. Most tracks aren’t animated but rotation tracks tend to be animated the most. Rotation tracks are 3x more likely to be animated than translation tracks and scale tracks are very rarely animated (~1.3% of the time). As such, segment data mostly contains animated rotation data.

Segment breakdown

Number of segments CMU Paragon Fortnite
Total 51909 49213 121175
50th percentile 10 3 2
85th percentile 42 9 18
99th percentile 116 117 187

Half the clips are very short and only contain 2 or 3 segments for Paragon and Fortnite. Those clips are likely to be 50 frames or less, or about 1-2 seconds at 30 FPS. For Paragon, 85% of the clips have 9 segments or less and in Fortnite we have 18 segments or less.

Compressed size breakdown

Compressed size CMU Paragon Fortnite
Total 71.00 MB 208.39 MB 483.54 MB
50th percentile 15.42 KB 15.85 KB 9.71 KB
85th percentile 56.36 KB 48.18 KB 72.22 KB
99th percentile 153.68 KB 354.54 KB 592.13 KB
Clip metadata size CMU Paragon Fortnite
Total size 4.22 MB (5.94%) 24.38 MB (11.70%) 38.32 MB (7.92%)
50th percentile 9.73% 22.03% 46.06%
85th percentile 18.68% 46.06% 97.43%
99th percentile 37.38% 98.48% 98.64%
Segment metadata size CMU Paragon Fortnite
Total size 6.23 MB (8.78%) 22.61 MB (10.85%) 54.21 MB (11.21%)
50th percentile 8.07% 6.88% 0.75%
85th percentile 9.28% 11.37% 10.95%
99th percentile 10.59% 21.00% 26.21%
Segment animated data size CMU Paragon Fortnite
Total size 60.44 MB (85.13%) 160.98 MB (77.25%) 390.15 MB (80.69%)
50th percentile 81.92% 70.55% 48.22%
85th percentile 87.08% 79.62% 81.65%
99th percentile 88.55% 87.93% 89.25%

From this data, we can conclude that our efforts might be best spent optimizing the clip metadata where the constant track data and the animated range data will contribute much more relative to the overall footprint. Short clips have less animated data but just as much metadata as longer clips with an equal number of bones. Even though overall the clip metadata doesn’t contribute much, for the vast majority of clips it does contribute a significant amount (for half the Fortnite clips, clip metadata represented 46.06% or more of the total clip size).

The numbers are somewhat skewed by the fact that a few clips are very long. Their animated footprint ends up dominating the overall numbers hence why breaking things down by percentile is insightful here.

The clip metadata isn’t as optimized and it contains more low hanging fruits to attack. I’ve spent most of my time focusing on the segment animated data but as a result, pushing things further on that front is much harder and requires more work and a higher risk that what I try won’t pan out.

Animated data breakdown

  CMU Paragon Fortnite
Total compressed size 71.00 MB 208.39 MB 483.54 MB
Animated data size 60.44 MB (85.13%) 160.98 MB (77.25%) 390.15 MB (80.69%)
70% of animated data removed 28.69 MB (2.47x smaller) 95.70 MB (2.18x smaller) 210.44 MB (2.30x smaller)
50% of animated data removed 40.78 MB (1.74x smaller) 127.90 MB (1.63x smaller) 288.47 MB (1.68x smaller)
25% of animated data removed 55.89 MB (1.27x smaller) 168.15 MB (1.24x smaller) 386.00 MB (1.25x smaller)

A common optimization is to strip a number of frames from the animation data (aka sub-sampling or frame stripping). This is very destructive but can yield good memory savings. Since we know how much animated data we have and its relative footprint, we can compute ballpark numbers for how much smaller removing 70%, 50%, or 25% of our animated data might be. The numbers above represent the total compressed size after stripping and the reduction factor.

In my next post, I’ll explore the results of quantizing the constant sample values and the clip range values, stay tuned!

Animation Compression Table of Contents

Zero overhead backward compatibility

The Animation Compression Library finally supports backward compatibility (going forward once 2.0 comes out). I’m really happy with how it turned out so I thought I would write a bit about how the ACL decompression is designed.


At runtime, decompressing animation data could not be easier:

acl::decompression_context<custom_decompression_settings> context;

// Seek 1.0 second into our compressed animation
// and don't use rounding to interpolate, acl::sample_rounding_policy::none);

custom_writer writer(output_data);

A small context object is created and bound to our compressed data. Its construction is cheap enough that it can be done on the stack on demand. It can then subsequently be used (and re-used) to seek and decompress.

A key design goal is to have as little overhead as possible: pay only for what you use. This is achieved through templating in two ways:

  • The custom_decompression_settings argument on the context object controls what features are enabled or disabled.
  • The custom_writer wraps whatever container you might be using in your game engine to represent the animation data. This is to make sure no extra copying is required.

The decompression settings are where the magic happens.

Compile time user control

There are many game engines out there each handling animation in their own specific way. In order to be able to integrate as seamlessly as possible, ACL exposes a small struct that can be overriden to control its behavior. By leveraging constexpr and templating, features that aren’t used or needed can be removed entirely at compile time to ensure zero cost at runtime.

Here is how it looks:

struct decompression_settings
  // Whether or not to clamp the sample time when `seek(..)`
  // is called. Defaults to true.
  static constexpr bool clamp_sample_time() { return true; }

  // Whether or not the specified track type is supported.
  // Defaults to true.
  // If a track type is statically known not to be supported,
  // the compiler can strip the associated code.
  static constexpr bool is_track_type_supported(track_type8 /*type*/)
  { return true; }

  // Other stuff ...

  // Which version we should optimize for.
  // If 'any' is specified, the decompression context will
  // support every single version with full backwards
  // compatibility.
  // Using a specific version allows the compiler to
  // statically strip code for all other versions.
  // This allows the creation of context objects specialized
  // for specific versions which yields optimal performance.
  static constexpr compressed_tracks_version16 version_supported()
  { return compressed_tracks_version16::any; }

  // Whether the specified rotation/translation/scale format
  // are supported or not.
  // Use this to strip code related to formats you do not need.
  static constexpr bool is_rotation_format_supported(rotation_format8 /*format*/)
  { return true; }

  // Other stuff ...

  // Whether rotations should be normalized before being
  // output or not. Some animation runtimes will normalize
  // in a separate step and do not need the explicit
  // normalization.
  // Enabled by default for safety.
  static constexpr bool normalize_rotations() { return true; }

Extending this is simple and clean:

struct default_transform_decompression_settings : public decompression_settings
  // Only support transform tracks
  static constexpr bool is_track_type_supported(track_type8 type)
  { return type == track_type8::qvvf; }

  // By default, we only support the variable bit rates as
  // they are generally optimal
  static constexpr bool is_rotation_format_supported(rotation_format8 format)
  { return format == rotation_format8::quatf_drop_w_variable; }

  // Other stuff ...

A new struct is created to inherit from the desired decompression settings and specific functions are defined to hide the base implementations, thus replacing them.

By templating the decompression_context with the settings structure, it can be used to determine everything needed at compile time:

  • Which memory representation is needed depending on whether we are decompressing scalar or transform tracks.
  • Which algorithm version to support and optimize for.
  • Which features to strip when they aren’t needed.

This is much nicer than the common C way to use #define macros. By using a template argument, multiple setting objects can easily be created (with type safety) and used within the same application or file.

Backward compatibility

By using the decompression_settings, we can specify which version we optimize for. If no specific version is provided (the default behavior), we will branch and handle all supported versions. However, if a specific version is provided, we can strip the code for all other versions removing any runtime overhead. This is clean and simple thanks to templates.

template<compressed_tracks_version16 version>
struct decompression_version_selector {};

// Specialize for ACL 2.0's format
template<> struct
  static bool is_version_supported(compressed_tracks_version16 version)
  { return version == compressed_tracks_version16::v02_00_00; }

  template<class decompression_settings_type, class context_type>
  ACL_FORCE_INLINE static bool initialize(context_type& context, const compressed_tracks& tracks)
    return acl_impl::initialize_v0<decompression_settings_type>(context, tracks);

  // Other stuff ...

// Specialize to support all versions
template<> struct
  static bool is_version_supported(compressed_tracks_version16 version)
    return version >= compressed_tracks_version16::first && version <= compressed_tracks_version16::latest;

  template<class decompression_settings_type, class context_type>
  static bool initialize(context_type& context, const compressed_tracks& tracks)
    // TODO: Note that the `any` decompression can be optimized further to avoid a complex switch on every call.
    const compressed_tracks_version16 version = tracks.get_version();
    switch (version)
    case compressed_tracks_version16::v02_00_00:
      return decompression_version_selector<compressed_tracks_version16::v02_00_00>::initialize<decompression_settings_type>(context, tracks);
      ACL_ASSERT(false, "Unsupported version");
      return false;

  // Other stuff ...

This is ideal for many game engines. For example Unreal Engine 4 always compresses locally and caches the result in its Derived Data Cache. This means that the compressed format is always the latest one used by the plugin. As such, UE4 only needs to support a single version and it can do so without any overhead.

Other game engines might choose to support the latest two versions, emiting a warning to recompress old animations while still being able to support them with very little overhead: a single branch to pick which context to use.

More general applications might opt to support every version (e.g. a glTF viewer).

Note that backward compatibility will only be supported for official releases as the develop branch is constantly subject to change.


This C++ customization pattern is clean and simple to use and it allows a compact API with a rich feature set. It was present in a slightly different form ever since ACL 0.1 and the more I use it, the more I love it.

In fact, in my opinion the Animation Compression Library and Realtime Math contain some of the best code (quality wise) that I’ve ever written in my career. Free from time or budget constraints, I can carefully craft each facet to the best of my ability.

ACL 2.0 continues to progress nicely. It is still missing a few features but it is already an amazing step up from 1.3.

Realtime Math 2.0 is out, cleaner, and faster!

A lot of work went into this latest release and here is the gist of what changed:

  • Added support for GCC10, clang8, clang9, clang10, VS 2019 clang, and emscripten
  • Added a lot of matrix math
  • Added trigonometric functions (scalar and vector)
  • Angle types have been removed
  • Lots of optimizations and improvements
  • Tons of cleanup to ensure a consistent API

It should now contain everything needed by most realtime applications. The one critical feature that is missing at the moment, is a proper browsable documentation. While every function is currently documented, the lack of a web browsable documentation makes using it a bit more difficult than I’d like. Hopefully I can remedy this in the coming months.

Migrating from 1.x

Most of the APIs haven’t changed materially and simply recompiling should work depending on what you use. Where compilation fails, a few minor fixes might be required. There are two main reasons why this release is a major one:

  • The anglef and angled types have been removed
  • Extensive usage of return type overloading

The angle types have been removed because I could not manage to come up with a clean API for angular constants that would work well without introducing a LOT more complexity while remaining optimal for code generation. Angular constants (and constants in general) are used with all sorts of code. In particular, SIMD code (such as SSE2 or NEON) often ends up needing to use them and I wanted to be able to efficiently do so. As such they are now simple typedefs for floating point types and can easily be used with ordinary scalar or SIMD code. The pattern used for constants is inspired from Boost.

I had originally introduced them in hope of providing added type safety but the constants weren’t really usable in RTM 1.1. For now, it is easier to document that all angles are represented as radians. The typedef remains to clarify the API.

Return type overloading

C++ doesn’t really have return type overloading but it can be faked. It looks like this in action:

vector4f vec1 = vector_set(1.0f, 2.0f, 3.0f);
vector4f vec2 = vector_set(5.0f, 6.0f, 7.0f);

scalarf dot_sse2 = vector_dot3(vec1, vec2);
float dot_x87 = vector_dot3(vec1, vec2);
vector4f dot_broadcast = vector_dot3(vec1, vec2);

Usage is very clean and the compiler can figure out what to do fairly easily in most cases. The implementation behind the scene is a bit complicated but it is worth it for the flexibility it provides:

// A few things omited for brevity
struct dot3_helper
    inline operator float()
        return do_the_math_here1();

    inline operator scalarf()
        return do_the_math_here2();

    inline operator vector4f()
        return do_the_math_here3();

    vector4f left_hand_side;
    vector4f right_hand_side;

constexpr dot3_helper vector_dot3(vector4f left_hand_side, vector4f right_hand_side)
    return dot3_helper{ left_hand_side, right_hand_side };

One motivating reason for this pattern is that very often we perform some operation and return a scalar value. Depending on the architecture, it might be optimal to return it as a SIMD register type instead of a regular float as those do not always mix well. ARM NEON doesn’t suffer from this issue and for that platform, scalarf is a typedef for float. But for x86 with SSE2 and for some PowerPC processors, this distinction is very important in order to achieve optimal performance. It doesn’t stop there though, even when floating point arithmetic uses the same registers as SIMD arithmetic (such as x64 with SSE2), there is sometimes a benefit to having a different type in order to improve code generation. VS2019 still struggles today to avoid extra shuffles when ordinary scalar and SIMD code are mixed. The type distinction allows for improved performance.

This pattern was present from day one inside RTM but it wasn’t as widely used. Usage of scalarf wasn’t as widespread. The latest release pushes its usage much further and as such a lot of code was modified to support both return types. This can sometime lead to ambiguous function calls (and those will need fixing in user code) but it is fairly rare in practice. It forces the programmer to be explicit about what types are used which is in line with RTM’s philosophy.

Quaternion math improvements

The Animation Compression Library (ACL) heavily relies on quaternions and as such I spend a good deal of time trying to optimize them. This release introduces the important quat_slerp function as well as many optimizations for ARM processors.

ARM NEON performance can be surprising

RTM supports both ARMv7 and ARM64 and very often what is optimal for one isn’t optimal for the other. Worse, different devices disagree about what code is optimal, sometimes by quite a bit.

I spent a good deal of time trying to optimize two functions: quaternion multiplication and rotating a 3D vector with a quaternion. Rotating a 3D vector uses two quaternion multiplications.

For quaternion multiplication, I tried a few variations:

The first two implementations are inspired from the classic SSE2 implementation. This is the same code used by DirectX Math on SSE2 and ARM as well.

The third implementation is a bit more clever. Instead of using constants that must be loaded and applied in order to align our signs to leverage fused-multiply-add, we use the floating point negation instruction. This is done once and mixed in with the various swizzle instructions that NEON supports. This ends up being extremely compact and uses only 12 instructions with ARM64!

I measured extensively using micro benchmarks (with Google Benchmark) as well as within ACL. The results turned out to be quite interesting.

On a Pixel 3 android phone, with ARMv7 the scalar version was fastest. It beat the multiplication variant by 1.05x and the negation variant by 1.10x. However, with ARM64, the negation variant was best. It beat the multiplication variant by 1.05x and the scalar variant by 1.16x.

On a Samsung S8 android phone, the results were similar: scalar wins with ARMv7 and negation wins with ARM64 (both by a significant margin again).

On an iPad Pro with ARM64 the results agreed again with the negation variant being fastest.

I hadn’t seen that particular variant used anywhere else so I was quite pleased to see it perform so well with ARM64. In light of these results, RTM now uses the scalar version with ARMv7 and the negation version with ARM64.

Since rotating a 3D vector with a quaternion is two quaternion multiplications back-to-back, I set out to use the same tricks as above with one addition.

vector4f quat_mul_vector3(vector4f vector, quatf rotation)
    quatf vector_quat = quat_set_w(vector_to_quat(vector), 0.0f);
    quatf inv_rotation = quat_conjugate(rotation);
    return quat_to_vector(quat_mul(quat_mul(inv_rotation, vector_quat), rotation));

We first extend our vector3 into a proper vector4 by padding it with 0.0. Using this information, we can strip out a few operations from the first quaternion multiplication.

Again, I tested all four variants and surprisingly, the scalar variant won out every time both with ARMv7 and ARM64 on both android devices. The iPad saw the negation variant as fastest. Code generation was identical yet it seems that the iPad CPU has very different performance characteristics. As a compromise, the scalar variant is used with all ARM flavors. It isn’t optimal on the iPad but it remains much better than the reference implementation.

I suspect that the scalar implementation performs better because more operations are independent. Despite having way more instructions, there must be fewer stalls and this leads to an overall win. It is possible that this information can be better leveraged to further improve things but that is a problem for another day.

Compiler bugs bonanza

Realtime Math appears to put considerable stress on compilers and often ends up breaking them. In the first 10 years of my career, I found maybe 2-3 C++ compiler bugs. Here are just some of the bugs I remember from the past year:

And those are just the ones I remember from the top of my head. I also found one or two with ACL not in the above list. Some of those will never get fixed because the compiler versions are too old but thankfully the Microsoft Visual Studio team has been very quick to address some of the above issues.

Keep an eye out for buffer security checks

By default, when compiling C++, Visual Studio enables the /GS flag for buffer security checks.

In functions that the compiler deems vulnerable to the stack getting overwritten, it adds buffer security checks. To detect if the stack has been tampered with during execution of the function, it first writes a sentinel value past the end of the reserved space the function needs. This sentinel value is random per process to avoid an attacker guessing its value. Just before the function exits, it calls a small function that validates the sentinel value: __security_check_cookie().

The rules on what can trigger this are as follow (from the MSDN documentation):

  • The function contains an array on the stack that is larger than 4 bytes, has more than two elements, and has an element type that is not a pointer type.
  • The function contains a data structure on the stack whose size is more than 8 bytes and contains no pointers.
  • The function contains a buffer allocated by using the _alloca function.
  • The function contains a data structure that contains another which triggers one of the above checks.

This is all fine and well and you should never disable it program/file wide. But you should keep an eye out. In very performance critical code, this small overhead can have a sizable impact. I’ve observed this over the years a few times but it now popped up somewhere I didn’t expect it: my math library Realtime Math (RTM).

Non-zero cost abstractions

The SSE2 headers define a few types. Of interest to us today is __m128 but others suffer from the same issue as well (including wider types such as __m256). Those define a register wide value suitable for SIMD intrinsics: __m128 contains four 32 bit floats. As such, it takes up 16 bytes.

Because it is considered a native type by the compiler, it does not trigger buffer security checks to be inserted despite being larger than 8 bytes without containing pointers.

However, the same is not true if you wrap it in a struct or class.

struct scalarf
    __m128 value;

The above struct might trigger security checks to be inserted: it is a struct larger than 8 bytes that does not contain pointers.

Similarly, many other common math types suffer from this:

struct matrix3x3f
    __m128 x_axis;
    __m128 y_axis;
    __m128 z_axis;

Many years ago, in a discussion about an unrelated compiler bug, someone at Microsoft mentioned to me that it is best to typedef SIMD types than it is to wrap them in a concrete type; it should lead to better code generation. They didn’t offer any insights as to why that might be (and I didn’t ask) and honestly I had never noticed any difference until security buffer checks came into play, last Friday. Their math library DirectX Math uses a typedef for its vector type and so does RTM everywhere it can. But sometimes it can’t be avoided.

RTM also extensively uses a helper struct pattern to help keep the code clean and flexible. Some code such as a vector dot product returns a scalar value. But on some architectures, it isn’t desirable to treat it as a float for performance reasons (PowerPC, x86, etc). For example, with x86 float arithmetic does not use SSE2 unless you explicitly use intrinsics for it: by default it uses the x87 floating point stack (with MSVC at least). If this value is later needed as part of more SSE2 vector code (such as vector normalization), the value will be calculated from two SSE2 vectors, be stored on the x87 float stack, only to be passed back to SSE2. To avoid this roundtrip when using SSE2 with x86, RTM exposes the scalarf type. However, sometimes you really need the value as a float. The usage dictates what you need. To support both variants with as little syntactic overhead as possible, RTM leverages return type overloading (it’s not really a thing in C++ but it can be faked with implicit coercion). It makes the following possible:

vector4f vec1 = vector_set(1.0f, 2.0f, 3.0f);
vector4f vec2 = vector_set(5.0f, 6.0f, 7.0f);

scalarf dot_sse2 = vector_dot3(vec1, vec2);
float dot_x87 = vector_dot3(vec1, vec2);
vector4f dot_broadcast = vector_dot3(vec1, vec2);

This is very clean to use and the compiler can figure out what code to call easily. But it is ugly to implement; a small price to pay for readability.

// A few things omited for brevity
struct dot3_helper
    inline operator float()
        return do_the_math_here1();

    inline operator scalarf()
        return do_the_math_here2();

    inline operator vector4f()
        return do_the_math_here3();

    vector4f left_hand_side;
    vector4f right_hand_side;

constexpr dot3_helper vector_dot3(vector4f left_hand_side, vector4f right_hand_side)
    return dot3_helper{ left_hand_side, right_hand_side };

Side note, on ARM scalarf is a typedef for float in order to achieve optimal performance.

There is a lot of boilerplate but the code is very simple and either constexpr or marked inline. We create a small struct and return it, and at the call site the compiler invokes the right implicit coercion operator. It works just fine and the compiler optimizes everything away to yield the same lean assembly you would expect. We leverage inlining to create what should be a zero cost abstraction. Except, in rare cases (at least with Visual Studio as recent as 2019), inlining fails and everything goes wrong.

Side note, the above is one example why scalarf cannot be a typedef because we need it distinct from vector4f both of which are represented in SSE2 as a __m128. To avoid this issue, vector4f is a typedef while scalarf is a wrapping struct.

Many math libraries out there wrap SIMD types in a proper type and use similar patterns. And while generally most math functions are small and get inlined fine, it isn’t always the case. In particular, these security buffer checks can harm the ability of the compiler to inline while at the same time degrading performance of perfectly safe code.

8 instructions is too much

All of this worked well until I noticed, out of the blue, a performance regression in the Animation Compression Library when I updated to the latest RTM version. This was very strange as it should have only contained performance optimizations.

Here is where the code generation changed:

// A few things omited for brevity
__declspec(safebuffers) rtm::scalarf calculate_error_no_scale(const calculate_error_args& args)
    const rtm::qvvf& raw_transform_ = *static_cast<const rtm::qvvf*>(args.transform0);
    const rtm::qvvf& lossy_transform_ = *static_cast<const rtm::qvvf*>(args.transform1);

    const rtm::vector4f vtx0 = args.shell_point_x;
    const rtm::vector4f vtx1 = args.shell_point_y;

    const rtm::vector4f raw_vtx0 = rtm::qvv_mul_point3_no_scale(vtx0, raw_transform_);
    const rtm::vector4f raw_vtx1 = rtm::qvv_mul_point3_no_scale(vtx1, raw_transform_);

    const rtm::vector4f lossy_vtx0 = rtm::qvv_mul_point3_no_scale(vtx0, lossy_transform_);
    const rtm::vector4f lossy_vtx1 = rtm::qvv_mul_point3_no_scale(vtx1, lossy_transform_);

    const rtm::scalarf vtx0_error = rtm::vector_distance3(raw_vtx0, lossy_vtx0);
    const rtm::scalarf vtx1_error = rtm::vector_distance3(raw_vtx1, lossy_vtx1);

    return rtm::scalar_max(vtx0_error, vtx1_error);

Side note, as part of a prior effort to optimize that performance critical function, I had already disabled buffer security checks which are unnecessary here.

The above code is fairly simple. We take two 3D vertices in local space and transform them to world space. We do this for our source data (which is raw) and for our compressed data (which is lossy). We calculate the 3D distance between the raw and lossy vertices and it yields our compression error. We take the maximum value as our final result.

qvv_mul_point3_no_scale is fairly heavy instruction wise and it doesn’t get fully inlined. Some of it does but the quat_mul_vector3 it contains does not inline.

By the time the compiler gets to the vector_distance3 calls, the compiler struggles. Both VS2017 and VS2019 fail to inline 8 instructions (the vector_length3 it contains).

// const rtm::scalarf vtx0_error = rtm::vector_distance3(raw_vtx0, lossy_vtx0);
vsubps      xmm1,xmm0,xmm6
vmovups     xmmword ptr [rsp+20h],xmm1
lea         rcx,[rsp+20h]
call        rtm::rtm_impl::vector4f_vector_length3::operator rtm::scalarf

Side note, when AVX is enabled, Visual Studio often ends up attempting to use wider registers when they aren’t needed, causing the addition of vzeroupper and other artifacts that can degrade performance.

// inline operator scalarf() const
// {
sub         rsp,18h
mov         rax,qword ptr [__security_cookie]
xor         rax,rsp
mov         qword ptr [rsp],rax
// const scalarf len_sq = vector_length_squared3(input);
vmovups     xmm0,xmmword ptr [rcx]
vmulps      xmm2,xmm0,xmm0
vshufps     xmm1,xmm2,xmm2,1
vaddss      xmm0,xmm2,xmm1
vshufps     xmm2,xmm2,xmm2,2
vaddss      xmm0,xmm0,xmm2
// return scalar_sqrt(len_sq);
vsqrtss     xmm0,xmm0,xmm0
// }
mov         rcx,qword ptr [rsp]
xor         rcx,rsp
call        __security_check_cookie
add         rsp,18h

And that is where everything goes wrong. Those 8 instructions calculate the 3D dot product and the square root required to get the length of the vector between our two points. They balloon up to 16 instructions because of the security check overhead. Behind the scenes, VS doesn’t fail to inline 8 instructions, it fails to inline something larger than we see when everything goes right.

This was quite surprising to me, because until now I had never seen this behavior kick in this way. Indeed, try as I might, I could not reproduce this behavior in a small playground (yet).

In light of this, and because the math code within Realtime Math is very simple, I have decided to annotate every function to explicitly disable buffer security checks. Not every function requires this but it is easier to be consistent for maintenance. This restores the optimal code generation and vector_distance3 finally inlines properly.

I am now wondering if I should explicitly force inline short functions…