Zero overhead backward compatibility

The Animation Compression Library finally supports backward compatibility (going forward once 2.0 comes out). I’m really happy with how it turned out so I thought I would write a bit about how the ACL decompression is designed.

The API

At runtime, decompressing animation data could not be easier:

acl::decompression_context<custom_decompression_settings> context;
context.initialize(compressed_data);

// Seek 1.0 second into our compressed animation
// and don't use rounding to interpolate
context.seek(1.0f, acl::sample_rounding_policy::none);

custom_writer writer(output_data);
context.decompress_tracks(writer);

A small context object is created and bound to our compressed data. Its construction is cheap enough that it can be done on the stack on demand. It can then subsequently be used (and re-used) to seek and decompress.

A key design goal is to have as little overhead as possible: pay only for what you use. This is achieved through templating in two ways:

  • The custom_decompression_settings argument on the context object controls what features are enabled or disabled.
  • The custom_writer wraps whatever container you might be using in your game engine to represent the animation data. This is to make sure no extra copying is required.

The decompression settings are where the magic happens.

Compile time user control

There are many game engines out there each handling animation in their own specific way. In order to be able to integrate as seamlessly as possible, ACL exposes a small struct that can be overriden to control its behavior. By leveraging constexpr and templating, features that aren’t used or needed can be removed entirely at compile time to ensure zero cost at runtime.

Here is how it looks:

struct decompression_settings
{
  // Whether or not to clamp the sample time when `seek(..)`
  // is called. Defaults to true.
  static constexpr bool clamp_sample_time() { return true; }

  // Whether or not the specified track type is supported.
  // Defaults to true.
  // If a track type is statically known not to be supported,
  // the compiler can strip the associated code.
  static constexpr bool is_track_type_supported(track_type8 /*type*/)
  { return true; }

  // Other stuff ...

  // Which version we should optimize for.
  // If 'any' is specified, the decompression context will
  // support every single version with full backwards
  // compatibility.
  // Using a specific version allows the compiler to
  // statically strip code for all other versions.
  // This allows the creation of context objects specialized
  // for specific versions which yields optimal performance.
  static constexpr compressed_tracks_version16 version_supported()
  { return compressed_tracks_version16::any; }

  // Whether the specified rotation/translation/scale format
  // are supported or not.
  // Use this to strip code related to formats you do not need.
  static constexpr bool is_rotation_format_supported(rotation_format8 /*format*/)
  { return true; }

  // Other stuff ...

  // Whether rotations should be normalized before being
  // output or not. Some animation runtimes will normalize
  // in a separate step and do not need the explicit
  // normalization.
  // Enabled by default for safety.
  static constexpr bool normalize_rotations() { return true; }
};

Extending this is simple and clean:

struct default_transform_decompression_settings : public decompression_settings
{
  // Only support transform tracks
  static constexpr bool is_track_type_supported(track_type8 type)
  { return type == track_type8::qvvf; }

  // By default, we only support the variable bit rates as
  // they are generally optimal
  static constexpr bool is_rotation_format_supported(rotation_format8 format)
  { return format == rotation_format8::quatf_drop_w_variable; }

  // Other stuff ...
};

A new struct is created to inherit from the desired decompression settings and specific functions are defined to hide the base implementations, thus replacing them.

By templating the decompression_context with the settings structure, it can be used to determine everything needed at compile time:

  • Which memory representation is needed depending on whether we are decompressing scalar or transform tracks.
  • Which algorithm version to support and optimize for.
  • Which features to strip when they aren’t needed.

This is much nicer than the common C way to use #define macros. By using a template argument, multiple setting objects can easily be created (with type safety) and used within the same application or file.

Backward compatibility

By using the decompression_settings, we can specify which version we optimize for. If no specific version is provided (the default behavior), we will branch and handle all supported versions. However, if a specific version is provided, we can strip the code for all other versions removing any runtime overhead. This is clean and simple thanks to templates.

template<compressed_tracks_version16 version>
struct decompression_version_selector {};

// Specialize for ACL 2.0's format
template<> struct
decompression_version_selector<compressed_tracks_version16::v02_00_00>
{
  static bool is_version_supported(compressed_tracks_version16 version)
  { return version == compressed_tracks_version16::v02_00_00; }

  template<class decompression_settings_type, class context_type>
  ACL_FORCE_INLINE static bool initialize(context_type& context, const compressed_tracks& tracks)
  {
    return acl_impl::initialize_v0<decompression_settings_type>(context, tracks);
  }

  // Other stuff ...
};

// Specialize to support all versions
template<> struct
decompression_version_selector<compressed_tracks_version16::any>
{
  static bool is_version_supported(compressed_tracks_version16 version)
  {
    return version >= compressed_tracks_version16::first && version <= compressed_tracks_version16::latest;
  }

  template<class decompression_settings_type, class context_type>
  static bool initialize(context_type& context, const compressed_tracks& tracks)
  {
    // TODO: Note that the `any` decompression can be optimized further to avoid a complex switch on every call.
    const compressed_tracks_version16 version = tracks.get_version();
    switch (version)
    {
    case compressed_tracks_version16::v02_00_00:
      return decompression_version_selector<compressed_tracks_version16::v02_00_00>::initialize<decompression_settings_type>(context, tracks);
    default:
      ACL_ASSERT(false, "Unsupported version");
      return false;
    }
  }

  // Other stuff ...
};

This is ideal for many game engines. For example Unreal Engine 4 always compresses locally and caches the result in its Derived Data Cache. This means that the compressed format is always the latest one used by the plugin. As such, UE4 only needs to support a single version and it can do so without any overhead.

Other game engines might choose to support the latest two versions, emiting a warning to recompress old animations while still being able to support them with very little overhead: a single branch to pick which context to use.

More general applications might opt to support every version (e.g. a glTF viewer).

Note that backward compatibility will only be supported for official releases as the develop branch is constantly subject to change.

Conclusion

This C++ customization pattern is clean and simple to use and it allows a compact API with a rich feature set. It was present in a slightly different form ever since ACL 0.1 and the more I use it, the more I love it.

In fact, in my opinion the Animation Compression Library and Realtime Math contain some of the best code (quality wise) that I’ve ever written in my career. Free from time or budget constraints, I can carefully craft each facet to the best of my ability.

ACL 2.0 continues to progress nicely. It is still missing a few features but it is already an amazing step up from 1.3.

Realtime Math 2.0 is out, cleaner, and faster!

A lot of work went into this latest release and here is the gist of what changed:

  • Added support for GCC10, clang8, clang9, clang10, VS 2019 clang, and emscripten
  • Added a lot of matrix math
  • Added trigonometric functions (scalar and vector)
  • Angle types have been removed
  • Lots of optimizations and improvements
  • Tons of cleanup to ensure a consistent API

It should now contain everything needed by most realtime applications. The one critical feature that is missing at the moment, is a proper browsable documentation. While every function is currently documented, the lack of a web browsable documentation makes using it a bit more difficult than I’d like. Hopefully I can remedy this in the coming months.

Migrating from 1.x

Most of the APIs haven’t changed materially and simply recompiling should work depending on what you use. Where compilation fails, a few minor fixes might be required. There are two main reasons why this release is a major one:

  • The anglef and angled types have been removed
  • Extensive usage of return type overloading

The angle types have been removed because I could not manage to come up with a clean API for angular constants that would work well without introducing a LOT more complexity while remaining optimal for code generation. Angular constants (and constants in general) are used with all sorts of code. In particular, SIMD code (such as SSE2 or NEON) often ends up needing to use them and I wanted to be able to efficiently do so. As such they are now simple typedefs for floating point types and can easily be used with ordinary scalar or SIMD code. The pattern used for constants is inspired from Boost.

I had originally introduced them in hope of providing added type safety but the constants weren’t really usable in RTM 1.1. For now, it is easier to document that all angles are represented as radians. The typedef remains to clarify the API.

Return type overloading

C++ doesn’t really have return type overloading but it can be faked. It looks like this in action:

vector4f vec1 = vector_set(1.0f, 2.0f, 3.0f);
vector4f vec2 = vector_set(5.0f, 6.0f, 7.0f);

scalarf dot_sse2 = vector_dot3(vec1, vec2);
float dot_x87 = vector_dot3(vec1, vec2);
vector4f dot_broadcast = vector_dot3(vec1, vec2);

Usage is very clean and the compiler can figure out what to do fairly easily in most cases. The implementation behind the scene is a bit complicated but it is worth it for the flexibility it provides:

// A few things omited for brevity
struct dot3_helper
{
    inline operator float()
    {
        return do_the_math_here1();
    }

    inline operator scalarf()
    {
        return do_the_math_here2();
    }

    inline operator vector4f()
    {
        return do_the_math_here3();
    }

    vector4f left_hand_side;
    vector4f right_hand_side;
};

constexpr dot3_helper vector_dot3(vector4f left_hand_side, vector4f right_hand_side)
{
    return dot3_helper{ left_hand_side, right_hand_side };
}

One motivating reason for this pattern is that very often we perform some operation and return a scalar value. Depending on the architecture, it might be optimal to return it as a SIMD register type instead of a regular float as those do not always mix well. ARM NEON doesn’t suffer from this issue and for that platform, scalarf is a typedef for float. But for x86 with SSE2 and for some PowerPC processors, this distinction is very important in order to achieve optimal performance. It doesn’t stop there though, even when floating point arithmetic uses the same registers as SIMD arithmetic (such as x64 with SSE2), there is sometimes a benefit to having a different type in order to improve code generation. VS2019 still struggles today to avoid extra shuffles when ordinary scalar and SIMD code are mixed. The type distinction allows for improved performance.

This pattern was present from day one inside RTM but it wasn’t as widely used. Usage of scalarf wasn’t as widespread. The latest release pushes its usage much further and as such a lot of code was modified to support both return types. This can sometime lead to ambiguous function calls (and those will need fixing in user code) but it is fairly rare in practice. It forces the programmer to be explicit about what types are used which is in line with RTM’s philosophy.

Quaternion math improvements

The Animation Compression Library (ACL) heavily relies on quaternions and as such I spend a good deal of time trying to optimize them. This release introduces the important quat_slerp function as well as many optimizations for ARM processors.

ARM NEON performance can be surprising

RTM supports both ARMv7 and ARM64 and very often what is optimal for one isn’t optimal for the other. Worse, different devices disagree about what code is optimal, sometimes by quite a bit.

I spent a good deal of time trying to optimize two functions: quaternion multiplication and rotating a 3D vector with a quaternion. Rotating a 3D vector uses two quaternion multiplications.

For quaternion multiplication, I tried a few variations:

The first two implementations are inspired from the classic SSE2 implementation. This is the same code used by DirectX Math on SSE2 and ARM as well.

The third implementation is a bit more clever. Instead of using constants that must be loaded and applied in order to align our signs to leverage fused-multiply-add, we use the floating point negation instruction. This is done once and mixed in with the various swizzle instructions that NEON supports. This ends up being extremely compact and uses only 12 instructions with ARM64!

I measured extensively using micro benchmarks (with Google Benchmark) as well as within ACL. The results turned out to be quite interesting.

On a Pixel 3 android phone, with ARMv7 the scalar version was fastest. It beat the multiplication variant by 1.05x and the negation variant by 1.10x. However, with ARM64, the negation variant was best. It beat the multiplication variant by 1.05x and the scalar variant by 1.16x.

On a Samsung S8 android phone, the results were similar: scalar wins with ARMv7 and negation wins with ARM64 (both by a significant margin again).

On an iPad Pro with ARM64 the results agreed again with the negation variant being fastest.

I hadn’t seen that particular variant used anywhere else so I was quite pleased to see it perform so well with ARM64. In light of these results, RTM now uses the scalar version with ARMv7 and the negation version with ARM64.

Since rotating a 3D vector with a quaternion is two quaternion multiplications back-to-back, I set out to use the same tricks as above with one addition.

vector4f quat_mul_vector3(vector4f vector, quatf rotation)
{
    quatf vector_quat = quat_set_w(vector_to_quat(vector), 0.0f);
    quatf inv_rotation = quat_conjugate(rotation);
    return quat_to_vector(quat_mul(quat_mul(inv_rotation, vector_quat), rotation));
}

We first extend our vector3 into a proper vector4 by padding it with 0.0. Using this information, we can strip out a few operations from the first quaternion multiplication.

Again, I tested all four variants and surprisingly, the scalar variant won out every time both with ARMv7 and ARM64 on both android devices. The iPad saw the negation variant as fastest. Code generation was identical yet it seems that the iPad CPU has very different performance characteristics. As a compromise, the scalar variant is used with all ARM flavors. It isn’t optimal on the iPad but it remains much better than the reference implementation.

I suspect that the scalar implementation performs better because more operations are independent. Despite having way more instructions, there must be fewer stalls and this leads to an overall win. It is possible that this information can be better leveraged to further improve things but that is a problem for another day.

Compiler bugs bonanza

Realtime Math appears to put considerable stress on compilers and often ends up breaking them. In the first 10 years of my career, I found maybe 2-3 C++ compiler bugs. Here are just some of the bugs I remember from the past year:

And those are just the ones I remember from the top of my head. I also found one or two with ACL not in the above list. Some of those will never get fixed because the compiler versions are too old but thankfully the Microsoft Visual Studio team has been very quick to address some of the above issues.

Keep an eye out for buffer security checks

By default, when compiling C++, Visual Studio enables the /GS flag for buffer security checks.

In functions that the compiler deems vulnerable to the stack getting overwritten, it adds buffer security checks. To detect if the stack has been tampered with during execution of the function, it first writes a sentinel value past the end of the reserved space the function needs. This sentinel value is random per process to avoid an attacker guessing its value. Just before the function exits, it calls a small function that validates the sentinel value: __security_check_cookie().

The rules on what can trigger this are as follow (from the MSDN documentation):

  • The function contains an array on the stack that is larger than 4 bytes, has more than two elements, and has an element type that is not a pointer type.
  • The function contains a data structure on the stack whose size is more than 8 bytes and contains no pointers.
  • The function contains a buffer allocated by using the _alloca function.
  • The function contains a data structure that contains another which triggers one of the above checks.

This is all fine and well and you should never disable it program/file wide. But you should keep an eye out. In very performance critical code, this small overhead can have a sizable impact. I’ve observed this over the years a few times but it now popped up somewhere I didn’t expect it: my math library Realtime Math (RTM).

Non-zero cost abstractions

The SSE2 headers define a few types. Of interest to us today is __m128 but others suffer from the same issue as well (including wider types such as __m256). Those define a register wide value suitable for SIMD intrinsics: __m128 contains four 32 bit floats. As such, it takes up 16 bytes.

Because it is considered a native type by the compiler, it does not trigger buffer security checks to be inserted despite being larger than 8 bytes without containing pointers.

However, the same is not true if you wrap it in a struct or class.

struct scalarf
{
    __m128 value;
};

The above struct might trigger security checks to be inserted: it is a struct larger than 8 bytes that does not contain pointers.

Similarly, many other common math types suffer from this:

struct matrix3x3f
{
    __m128 x_axis;
    __m128 y_axis;
    __m128 z_axis;
};

Many years ago, in a discussion about an unrelated compiler bug, someone at Microsoft mentioned to me that it is best to typedef SIMD types than it is to wrap them in a concrete type; it should lead to better code generation. They didn’t offer any insights as to why that might be (and I didn’t ask) and honestly I had never noticed any difference until security buffer checks came into play, last Friday. Their math library DirectX Math uses a typedef for its vector type and so does RTM everywhere it can. But sometimes it can’t be avoided.

RTM also extensively uses a helper struct pattern to help keep the code clean and flexible. Some code such as a vector dot product returns a scalar value. But on some architectures, it isn’t desirable to treat it as a float for performance reasons (PowerPC, x86, etc). For example, with x86 float arithmetic does not use SSE2 unless you explicitly use intrinsics for it: by default it uses the x87 floating point stack (with MSVC at least). If this value is later needed as part of more SSE2 vector code (such as vector normalization), the value will be calculated from two SSE2 vectors, be stored on the x87 float stack, only to be passed back to SSE2. To avoid this roundtrip when using SSE2 with x86, RTM exposes the scalarf type. However, sometimes you really need the value as a float. The usage dictates what you need. To support both variants with as little syntactic overhead as possible, RTM leverages return type overloading (it’s not really a thing in C++ but it can be faked with implicit coercion). It makes the following possible:

vector4f vec1 = vector_set(1.0f, 2.0f, 3.0f);
vector4f vec2 = vector_set(5.0f, 6.0f, 7.0f);

scalarf dot_sse2 = vector_dot3(vec1, vec2);
float dot_x87 = vector_dot3(vec1, vec2);
vector4f dot_broadcast = vector_dot3(vec1, vec2);

This is very clean to use and the compiler can figure out what code to call easily. But it is ugly to implement; a small price to pay for readability.

// A few things omited for brevity
struct dot3_helper
{
    inline operator float()
    {
        return do_the_math_here1();
    }

    inline operator scalarf()
    {
        return do_the_math_here2();
    }

    inline operator vector4f()
    {
        return do_the_math_here3();
    }

    vector4f left_hand_side;
    vector4f right_hand_side;
};

constexpr dot3_helper vector_dot3(vector4f left_hand_side, vector4f right_hand_side)
{
    return dot3_helper{ left_hand_side, right_hand_side };
}

Side note, on ARM scalarf is a typedef for float in order to achieve optimal performance.

There is a lot of boilerplate but the code is very simple and either constexpr or marked inline. We create a small struct and return it, and at the call site the compiler invokes the right implicit coercion operator. It works just fine and the compiler optimizes everything away to yield the same lean assembly you would expect. We leverage inlining to create what should be a zero cost abstraction. Except, in rare cases (at least with Visual Studio as recent as 2019), inlining fails and everything goes wrong.

Side note, the above is one example why scalarf cannot be a typedef because we need it distinct from vector4f both of which are represented in SSE2 as a __m128. To avoid this issue, vector4f is a typedef while scalarf is a wrapping struct.

Many math libraries out there wrap SIMD types in a proper type and use similar patterns. And while generally most math functions are small and get inlined fine, it isn’t always the case. In particular, these security buffer checks can harm the ability of the compiler to inline while at the same time degrading performance of perfectly safe code.

8 instructions is too much

All of this worked well until I noticed, out of the blue, a performance regression in the Animation Compression Library when I updated to the latest RTM version. This was very strange as it should have only contained performance optimizations.

Here is where the code generation changed:

// A few things omited for brevity
__declspec(safebuffers) rtm::scalarf calculate_error_no_scale(const calculate_error_args& args)
{
    const rtm::qvvf& raw_transform_ = *static_cast<const rtm::qvvf*>(args.transform0);
    const rtm::qvvf& lossy_transform_ = *static_cast<const rtm::qvvf*>(args.transform1);

    const rtm::vector4f vtx0 = args.shell_point_x;
    const rtm::vector4f vtx1 = args.shell_point_y;

    const rtm::vector4f raw_vtx0 = rtm::qvv_mul_point3_no_scale(vtx0, raw_transform_);
    const rtm::vector4f raw_vtx1 = rtm::qvv_mul_point3_no_scale(vtx1, raw_transform_);

    const rtm::vector4f lossy_vtx0 = rtm::qvv_mul_point3_no_scale(vtx0, lossy_transform_);
    const rtm::vector4f lossy_vtx1 = rtm::qvv_mul_point3_no_scale(vtx1, lossy_transform_);

    const rtm::scalarf vtx0_error = rtm::vector_distance3(raw_vtx0, lossy_vtx0);
    const rtm::scalarf vtx1_error = rtm::vector_distance3(raw_vtx1, lossy_vtx1);

    return rtm::scalar_max(vtx0_error, vtx1_error);
}

Side note, as part of a prior effort to optimize that performance critical function, I had already disabled buffer security checks which are unnecessary here.

The above code is fairly simple. We take two 3D vertices in local space and transform them to world space. We do this for our source data (which is raw) and for our compressed data (which is lossy). We calculate the 3D distance between the raw and lossy vertices and it yields our compression error. We take the maximum value as our final result.

qvv_mul_point3_no_scale is fairly heavy instruction wise and it doesn’t get fully inlined. Some of it does but the quat_mul_vector3 it contains does not inline.

By the time the compiler gets to the vector_distance3 calls, the compiler struggles. Both VS2017 and VS2019 fail to inline 8 instructions (the vector_length3 it contains).

// const rtm::scalarf vtx0_error = rtm::vector_distance3(raw_vtx0, lossy_vtx0);
vsubps      xmm1,xmm0,xmm6
vmovups     xmmword ptr [rsp+20h],xmm1
lea         rcx,[rsp+20h]
vzeroupper
call        rtm::rtm_impl::vector4f_vector_length3::operator rtm::scalarf

Side note, when AVX is enabled, Visual Studio often ends up attempting to use wider registers when they aren’t needed, causing the addition of vzeroupper and other artifacts that can degrade performance.

// inline operator scalarf() const
// {
sub         rsp,18h
mov         rax,qword ptr [__security_cookie]
xor         rax,rsp
mov         qword ptr [rsp],rax
// const scalarf len_sq = vector_length_squared3(input);
vmovups     xmm0,xmmword ptr [rcx]
vmulps      xmm2,xmm0,xmm0
vshufps     xmm1,xmm2,xmm2,1
vaddss      xmm0,xmm2,xmm1
vshufps     xmm2,xmm2,xmm2,2
vaddss      xmm0,xmm0,xmm2
// return scalar_sqrt(len_sq);
vsqrtss     xmm0,xmm0,xmm0
// }
mov         rcx,qword ptr [rsp]
xor         rcx,rsp
call        __security_check_cookie
add         rsp,18h
ret

And that is where everything goes wrong. Those 8 instructions calculate the 3D dot product and the square root required to get the length of the vector between our two points. They balloon up to 16 instructions because of the security check overhead. Behind the scenes, VS doesn’t fail to inline 8 instructions, it fails to inline something larger than we see when everything goes right.

This was quite surprising to me, because until now I had never seen this behavior kick in this way. Indeed, try as I might, I could not reproduce this behavior in a small playground (yet).

In light of this, and because the math code within Realtime Math is very simple, I have decided to annotate every function to explicitly disable buffer security checks. Not every function requires this but it is easier to be consistent for maintenance. This restores the optimal code generation and vector_distance3 finally inlines properly.

I am now wondering if I should explicitly force inline short functions…

ACL right in your browser

This week, the first beta release of the Animation Compression Library JavaScript module has been released. You can find it on NPM here and you can try it live in your browser right here by drag and dropping glTF files.

The library should be usable but keep in mind until ACL reaches version 2.0 with backwards compatibility you might have to recompress when you upgrade to future versions.

The module uses the powerful emscripten compiler toolchain to take C++ and compile it into WebAssembly. That means that as part of this effort, ACL and Realtime Math have been upgraded to support emscripten as well. Note that WASM SIMD isn’t supported yet (contributions welcome).

Both compression and decompression are supported although if you enable dead code stripping in your JavaScript bundler, you should be able to only pay for what you use.

Special thanks to Arseny Kapoulkine and his excellent meshoptimizer library. Using his blog posts and his code as a guide, I was able to get up and running fairly quickly.

Next steps

Progress continues towards ACL 2.0 but I will take a few weeks to finish up RTM 2.0 first. It is now used extensively by the main ACL development branch. A few new features will be introduced, some cleanup remains, and I want to double check some optimizations before releasing it.

Morph target animation compression

The Animation Compression Library v1.3 introduced support for compressing animated floating point tracks but it wasn’t until now that I managed to get around to trying it inside the Unreal Engine 4 Plugin. Scalar tracks are used in various ways from animating the Field Of View to animating Inverse Kinematic blend targets. While they have many practical uses in modern video games, they are most commonly used to animate morph targets (aka blend shapes) in facial and cloth animations.

TL;DR: By using the morph target deformation information, we can better control how much precision each blend weight needs while yielding a lower memory footprint.

How do morph targets work?

Morph targets are essentially a set of meshes that get blended together per vertex. It is a way to achieve vertex based animation. Let’s use a simple triangle as an example:

Morph Targets Explained

In blue we have our reference mesh and in green our morph target. To animate our vertices, we apply a scaled displacement to every vertex. This scale factor is called the blend weight. Typically it lies between 0.0 (our reference mesh) and 1.0 (our target mesh). Values in between end up being a linear combination of both:

final vertex = reference vertex + (target vertex - reference vertex) * blend weight

Each morph target is thus controlled by a single blend weight and each vertex can end up having a contribution from multiple targets. With our blend weights being independent, we can compress them as a set of floating point curves. This is typically done by specifying a desired level of precision to retain on each curve. However, in practice this can be hard to control: blend weights have no units.

At the time of writing:

  • Unreal Engine 4 leaves its curves in a compact raw form (a spline) by default but they can be optionally compressed with various codecs.
  • Unity, Lumberyard, and CRYENGINE do not document how they are stored and they do not appear to expose a way to further compress them (as far as I know).

All of this animated data does add up and it can benefit from being compressed. In order to achieve this, the UE4 ACL plugin uses an intuitive technique to control the resulting quality.

Compressing blend weights

Our animated blend weights ultimately yield a mesh deformation. Can we use that information to our advantage? Let’s take a closer look at the math behind morph targets. Here is how each vertex is transformed:

final vertex = reference vertex + (target vertex - reference vertex) * blend weight

As we saw earlier, (target vertex - reference vertex) is the displacement delta to apply to achieve our deformation. Let’s simplify our equation a bit:

final vertex = reference vertex + vertex delta * blend weight

Let’s plug in some displacement numbers with some units and see what happens.

Reference Position Target Position Delta
5 cm 5 cm 0 cm
5 cm 5.1 cm 0.1 cm
5 cm 50 cm 45 cm

Let us assume that at a particular moment in time, our raw blend weight is 0.2. Due to compression, that value will change by the introduction of a small amount of imprecision. Let’s see what happens if our lossy blend weight is 0.22 instead.

Delta Delta * Raw Blend Weight Delta * Lossy Blend Weight Error
0 cm 0 cm 0 cm 0 cm
0.1 cm 0.02 cm 0.022 cm 0.002 cm
45 cm 9 cm 9.9 cm 0.9 cm

From this, we can observe some important facts:

  • When our vertex doesn’t move and has no displacement delta, the error introduced doesn’t matter.
  • The error introduced is linearly proportional to the displacement delta: when the error is fixed, a small delta will have a small amount of error while a large delta will have a large amount of error.

This means that it is not suitable to set a single precision value (in blend weight space) for all our blend weights. Some morph targets require more precision than others and ideally we would like to take advantage of that fact.

Specifying the amount of precision that a blend weight should have isn’t easy if we don’t know how much deformation it ultimately drives. A better way to specify how much precision we need is by specifying it in meaningful units. If we assume that we want our vertices to have a precision of 0.01 cm instead, how much precision do their blend weights need?

blend weight precision = vertex precision / vertex displacement delta

Notice how the vertex precision and vertex displacement delta units cancel out.

Delta Blend Weight Precision
0 cm Division by zero!
0.1 cm 0.1
45 cm 0.00022

When a vertex doesn’t move, it doesn’t matter what precision our blend weight retains since the final vertex position will be identical regardless. However, smaller displacements require less precision while larger displacements require more precision retained.

This is what the ACL plugin does: when a Skeletal Mesh is provided to the compression codec, ACL will use a vertex displacement precision value instead of a generic scalar track precision value. This is intuitive to tune for an artist because the units now have meaning: we can easily find out how much 0.01 cm represents within the context of our 3D model. Underneath the hood, this translates into unique precision requirements tailored to each morph target blend weight curve. This allows ACL to attain an even lower memory footprint while providing a strict guarantee on the visual fidelity of the animated deformations.

Curve Compression Example

A Boy and His Kite

In order to test this out, I used the GDC 2015 demo from Epic: A Boy and His Kite. The demo is available for free on the Unreal Marketplace under the Learning section.

It shows a boy running through a landscape along with his kite. He contains 811 animated curves within each of the 31 animation sequences that comprise the full cinematic. 692 of those curves end up driving morph targets for his facial and clothing deformations. Some of the shots have the camera very close to his face and as such retaining as much quality as possible is critical.

I decided to compare 5 codecs:

  • Compressed Rich Curves with an error threshold of 0.0 (default within UE 4.25)
  • Compressed Rich Curves with an error threshold of 0.001
  • Uniform Sampling
  • ACL with a generic precision of 0.001
  • ACL with a generic precision of 0.001 and a morph target deformation precision of 0.01 cm

The Compressed Rich Curves and Uniform Sampling codecs are built into UE4 and provide a trade-off between memory and speed. The rich curves will tend to have a lower memory footprint but evaluating them at runtime is slower than when uniform sampling is used.

ACL uses uniform sampling internally for very fast evaluation but it is also much more aggressive with its compression.

  Compressed Size Compression Rate
Compressed Rich Curves 0.0 3620.66 KB 1.0x (baseline)
Compressed Rich Curves 0.001 1458.77 KB 2.5x smaller
Uniform Sampling 2052.82 KB 1.8x smaller
ACL 0.001 540.24 KB 6.7x smaller
ACL with morph 0.01 cm 381.10 KB 9.5x smaller

I manually inspected the visual fidelity of each codec and everything looked flawless. By leveraging our knowledge of the individual morph target deformations, ACL manages to reduce the memory footprint by an extra 159.14 KB (29%).

The new curve compression will be available shortly in the ACL plugin v0.6 (suitable for UE 4.24) as well as v1.0 (suitable for UE 4.25 and later). The ACL plugin v1.0 isn’t out yet but it will come to the Unreal Marketplace as soon as UE 4.25 is released.