How to legally generate the Windows ISO from a Mac or Linux PC

27 May 2015

Surprisingly, it turns out that getting your hands on a legal Windows ISO after purchasing it is not as easy as it might seem. After purchasing a digital copy of Windows in the Microsoft store, you will be able to download a Windows application that downloads the ISO for you after entering your product key.

Therein lies the issue: to create a Windows install media, you need access to a PC with Windows already installed. If like me all you have on hand is a Mac or Linux PC without easy or quick access to a PC with Windows (or one that you can trust with your product key), you are left in a tight spot. Even the Microsoft support will not be able to help you and will simply point out that it can’t be that hard to find a PC with Windows.

Worse still is that not all versions of Windows are supported and for example, all versions that run on the Free Amazon Web Services are not supported so this avenue is not open to us either (at least at the time of writing).

The only viable alternative appears to be to download the ISO from an untrusted torrent site. If like me this leaves you uneasy, read on.

As it turns out, an unrelated tool that Microsoft releases will allow us to execute that ISO generating executable in a safe and legal way.

Step 1

Install Virtual Box. Virtual Box is a virtual machine application from Oracle that allows you to run an operating system in a virtualized environment as a guest.

Step 2

Download a Virtual Box image from the Internet Explorer Developer website. These images are legal versions of Windows provided by Microsoft to allow developers to test various Internet Explorer versions on various Windows versions. They expire after 90 days and cannot be activated but that doesn’t matter to us since we’ll only really use it for as long as the download of the ISO takes.

Simply grab the image for the platform that you have (Mac or Linux). Note that most images are 32 bit which will only allow you to later download the 32 bit ISO. At the time of writing, the Windows 10 image is 64 bit but requires the following fix to work properly.

Step 3

Inside Virtual Box, import the Windows image downloaded at the previous step. The documentation recommends you set at least 2GB of RAM for the virtual machine to use. This is fine since we’ll only run it for a short amount of time anyway.

Step 4

Use Virtual Box to share a directory with the virtual machine with write access and make sure it automatically mounts (easier for us). This is the directory where we will copy the ISO file into. At the time of writing it does not work with the Windows 10 image yet and as such you will need to plug in a USB key and share it with the virtual machine from the settings or share a network folder (I had more luck sharing the folder inside the Windows guest virtual machine and accessing it from my Mac).

Step 5

Launch the virtual machine instance and use it to run the executable from the Windows store. If you do not have access to it, I have not tested but presume that you might be able to use this utility as well.

Step 6

Follow the instructions and download the ISO to some directory (e.g: the desktop). For some reason Virtual Box did not allow me to download directly to the shared directory we setup in Step 4. Either way, once the download terminates, simply copy the ISO to the shared directory.

Complications

Sadly, the Windows 10 image is a bit finicky. You can’t shutdown properly and install the updates or it won’t boot again and you’ll have to start over. I could not manage to get my USB stick to work either and as such I had to create a shared directory inside the Windows guest. In order to be able to access the shared directory, I had to hot swap the virtual machine network card from NAT to Host bridge (you will need to add a host adapter in the Virtual Box preferences). I also needed to add a default gateway in the IPv4 settings of the Windows guest for my mac to be able to access it and I also needed to allow Guest access in the network settings.

Next, due to our previous hack to change the time to allow Windows to boot, the executable will refuse to download the ISO and complain it can’t connect to the internet. To resolve this I had to change the time inside the guest manually to the current date. I also needed to hot swap the network card back to the NAT setting and removed the default gateway.

Last but not least in order to copy the ISO out, I had to hot swap the network card once again to Host bridge and add back the default gateway.

Profit!

That’s it! You are done and you can now get rid of Virtual Box if you wish to. You can now continue the steps to create your bootable DVD or USB stick from the ISO.

There are so many threads out there asking how to do this that I hope this will be able to help someone avoid the hassle I went through to find this. You would think Microsoft would make it easier for you to install their operating system…

Linear Allocator

21 May 2015

The first memory allocator we will cover is by far the simplest and serves as an introduction to the material. Today we cover the linear memory allocator (code).

How it works

The internal logic is fairly simple:

We initialize the allocator with a pre-allocated memory buffer and its size.
Allocate simply increments a value indicating the current buffer offset.
Deallocate does nothing
Reallocate first checks if the allocation being resized is the last performed allocation and if it is, we return the same pointer and update our current allocation offset.
Reset is used to free all allocated memory and return the allocator to its original initialized state.

There is very little work to do for all these functions which makes it very fast. Deallocate does nothing because in practice, it makes little sense to support it and other allocators are often better suited when freeing memory at the pointer level is required. The only sane implementation we could do is similar to how Reallocate works by checking if the memory being freed is the last allocation (last in, first out). Because we work with a pre-allocated memory buffer, Reallocate does not need to perform a copy if the last allocation is being resized and there is enough free space left regardless of whether it grows or shrinks.

Interestingly, you can almost free memory at the pointer level by using Reallocate and using a new size of 0 but in practice, the alignment used for the allocation would remain lost forever (if it was originally mis-aligned).

Critical to this allocator is that it adds no overhead per allocation and it does not modify the pre-allocated memory buffer. Not all allocators will have these properties and I will always mention this important bit. This makes it ideal for low level systems or for working with read-only memory.

What can we use it for

Despite being a very simple allocator, it has a few uses. I have used it in the past with success to clean up code dealing with a lot of pointer arithmetic. The general idea is that if you have a memory buffer representing a custom binary format with most fields having a variable size and requiring alignment, you will end up with a lot of logic to take your raw buffer and split it into the various internal bits.

uintptr_t buffer;   // user supplied

size_t bufferOffset = 0;
uint32_t* numValue1 = reinterpret_cast<uint32_t*>(buffer);  // assume buffer is properly aligned for first value

bufferOffset = AlignTo(bufferOffset + 1 * sizeof(uint32_t), alignof(uint8_t));
uint8_t* values1 = reinterpret_cast<uint8_t*>(buffer + bufferOffset);
bufferOffset = AlignTo(bufferOffset + *numValue1 * sizeof(uint8_t), alignof(uint16_t));
uint16_t* numValue2 = reinterpret_cast<uint16_t*>(buffer + bufferOffset);
bufferOffset = AlignTo(bufferOffset + 1 * sizeof(uint16_t), alignof(float));
float* values2 = reinterpret_cast<float*>(buffer + bufferOffset);

Using a linear allocator to wrap the buffer allows you to manipulate it in an elegant and efficient manner.

LinearAllocator buffer(/* user supplied */);
uint32_t* numValue1 = new(buffer) uint32_t;
uint8_t* values1 = new(buffer) uint8_t[*numValue1];
uint16_t* numValue2 = new(buffer) uint16_t;
float* values2 = new(buffer) float[*numValue2];

Note that the above two versions are not 100% equivalent due to the fact C++11 offers no way to access the required alignment of the requested type when implementing the new operator. However, with macro support, the original intent above can be supported and be just as clear while supporting alignment properly. I plan to cover this important bit in a later post.

I wrote earlier that this allocator can be used with read-only memory which is a strange property for a memory allocator. Indeed, since it mostly abstracts away pointer arithmetic when partitioning a raw memory region without modifying it, we can use it to do just that over read-only memory. In the example above, this means that we can easily use it for writing and reading our custom binary format.

What we can’t use it for

Due to the fact that we do not add per allocation overhead, we cannot properly support freeing memory at the pointer level while supporting variable alignment. When support for freeing memory is needed, other allocators are better suited.

This allocator is also generally a poor fit for very large memory buffers. Due to the fact that we need to pre-allocate it up front, we bear the full cost regardless of how much actual memory we allocate internally.

Edge cases

There are two important edge cases with this allocator and they are shared by all allocators: overflow and out of memory conditions.

We can cause arithmetic overflow in two ways in this allocator: first by supplying a large alignment value, and second by attempting to allocate a large amount of memory. This is fairly simple to test but there is one small bit we must be careful with: if the alignment provided is very large, we can overflow in such a way that our resulting pointer would end up in our buffer either at a lower memory address if the allocation size is very large as well or at a higher memory address if the allocation size is small. The proper way to deal with this is to check overflow after taking into account the alignment and again after adjusting for the new allocation size.

We eventually run out of memory if we attempt to allocate more than our pre-allocated buffer owns.

Potential optimizations

While the implementation I provide aims for safety first, in practice a linear allocator should never run out of memory nor should it overflow if the logic using it is correct. This is because by providing a pre-allocated buffer, either we assume we do not need more memory or our assumption is wrong. In the later case, it is highly likely that we do not check the return value of allocations which is of little help even if our allocator is safe. Usage scenarios for this allocator are also generally simple in logic with few unknown variables to cause havoc.

In this light, a number of improvements can be made by stripping the overflow and out of memory checks and simply keeping asserts around. I may end up providing a way to do just that with template arguments or macros in the future.

Another option is to remove the branches added by the overflow and out of memory checks and simply ensure the internal state does not change instead. There is very little logic and as such an early out branch could very well save very little in the rare case it is taken and at the same time, since it is rarely taken, we always end up performing most of the logic anyway.

Last but not least, Reallocate support is often not required and could trivially be stripped as well.

Performance

Due to its simplicity, it offers great performance. All operations are O(1) and only amount to a few instructions.

On most 32 bit platforms, the size of an instance should be 24 bytes if size_t is used. On 64 bit platforms, the size should be 48 bytes with size_t. Both versions can be made even smaller with smaller integral types such as uint16_t (which would be very appropriate since this allocator is predominantly used with small buffers) or by stripping support for Reallocate. As such, either version will comfortably fit inside a single typical cache line of 64 bytes.

Conclusion

Despite its simple internals, the linear allocator is an important building block. It serves as an ideal example for a number of sibling allocators we will see in the next few posts which involve similar internal logic and edge cases.

Next up, we will cover a variation: the virtual memory aware linear allocator.

Alternate names

The most common alternate name for this allocator is called the arena allocator. However, that name is overloaded and is also often associated with a number of other allocators which manage a fixed memory region.

Note that if you know a better name or alternate names for this allocator, feel free to contact me.

Back to table of contents

Virtual memory explained

12 May 2015

Virtual memory is present on most hardware platforms and it is surprisingly simple. However, it is often poorly understood. It works so well and seamlessly that few inquire about its true nature.

Today we will take an in-depth look at why do we have virtual memory and how it works under the hood.

Physical memory: a scarce ressource

In the earlier days of computing, there was no virtual memory. The computing needs were simple and single process operating systems were common (batch processing was the norm). As the demand grew for computers, so did the complexity of the software running on them.

Eventually the need to run multiple processes concurrently appeared and soon became mainstream. The problem with multiple processes in regards to memory is three-fold:

Physical memory must be shared somehow
We must be able to access more than 64KB of memory using 16 bit registers
We must protect the memory to prevent tampering (malicious or accidental)

Memory segments

On x86, the first and second point were addressed first with real mode (1MB range) and later, memory protection was introduced with protected mode (16MB range). Virtual memory was born but it was not in the form most commonly seen today: early x86 used a segment based virtual memory model.

Segments were complex to manage but served their original purpose. Coupled with secondary storage, the operating system was now able to share the physical memory by swapping entire segments in and out of memory. Under 16 bit processors, segments had a fixed size of 64KB but later, when 32 bit processors emerged, segments grew to a variable and maximum size of 16MB. This later development had an unfortunate side-effect: due to the variable size of segments, physical memory fragmentation could now occur.

To address this, paging was introduced. Paging, like earlier segments, divides physical memory into fixed sized blocks. They also introduce an added indirection. With paging, segments were now further divided into pages and each segment further contained a page table to resolve the mapping between segment relative addresses and real effective physical addresses. By their nature, pages do not need to be contiguous within a given segment allowing them to resolve the memory fragmentation issues.

At this point in time, memory accesses now contain two indirections: first we must construct the segment relative address using a 16 bit segment index, reading the 32 bit segment base address associated and adding a 32 bit segment offset (386 CPUs shifted from 24 bit base addresses and offsets to 32 bit at the same time it introduced paging). This yields us a memory address that we must now look up in the segment page table to ultimately find the physical memory page (often called frame) that contains what we are looking for.

Modern virtual memory

Things now look much closer to modern virtual memory. Now that we have 32 bit processors and that they are common enough, there is no longer a need for segments and paging alone can be used. The x86 hardware of the time already used 32 bit segment base addresses and 32 bit segment offsets. Memory becomes far easier to manage if we assume it is a single memory segment with paging and it allows us to drop one level of indirection.

Most of this memory address translation logic now happens inside the MMU (memory management unit) and is helped by internal caches. This cache is now more commonly called the Translation Look-aside Buffer, or TLB for short.

Earlier x86 processors only supported 4KB memory pages. To accommodate this, a virtual memory address was split into three parts:

The first 10 bits represent an index in a page directory that is used to look up a page table.
The following 10 bits represent an index in the previously found page table that is used to look up a physical page frame.
The remaining 12 least significant bits are the final offset into the physical page frame leading to the desired memory address.

A dedicated register points to a page directory in memory. Both page directory entries and page table entries are 32 bits and contain flags to indicate memory protection and other attributes (cacheable, write combine, etc.). Another dedicated register holds the current process identifier which is used to tell which TLB entries are valid or not (and avoids the need for flushing the entire TLB when context switching between processes).

As memory grew, 4KB pages had difficulty scaling. Servicing large allocation requests requires mapping large numbers of 4KB pages putting more and more pressure on the TLB. The bookkeeping overhead also grows along with the number of pages used. Eventually, larger pages (4MB) were introduced to address these issues.

And then memory sizes grew even further. Soon enough, being confined to an address space of 4GB with 32 bit pointers became too small and physical address extension was introduced and extended the addressable range to 64GB. This forced larger pages to reduce to a size of 2MB as now some bits were needed for another indirection level in the page table walk.

Virtual memory today

Today, x64 hardware most commonly supports pages of the following sizes: 4KB, 2MB, and sometimes 1GB. Typically, only 48 bits are used limiting the addressable memory to 256TB.

Another important aspect of virtual memory today is how it interacts with virtualization (when present). Because physical memory must be shared between the running operating systems, page directory entries and page table entries now point into virtual memory instead of physical memory and require further TLB look-ups to resolve the complete effective physical address. Much like process identifiers, a virtualization instance identifier has been introduced for the same purpose: avoiding full TLB flushes when context switching.

Modern hardware will generally have separate caches per page size which means that to get the most performance, which page sizes are used for what data must be carefully planned. For example, in certain embedded platforms, it is not uncommon to have 4KB pages be used for code segments, data segments, and the stack while encouraging programmers to use 2MB pages inside their programs. It is also not uncommon to have virtual pages introduced: a virtual page will be composed of a smaller number of pages. For example, you might request allocations to use 64KB or 4MB pages even though the underlying hardware only supports 4KB and 2MB pages. The distinction is mostly important for the kernel since managing larger pages implies lower bookkeeping overhead and faster servicing.

An important point bears mentioning, when pages larger than 4KB are used, the kernel must find contiguous physical memory to allocate them. This can be a problem if page sizes are mixed, it opens the door to fragmentation to rear its ugly head. When the kernel fails to find enough contiguous space but it knows enough space would otherwise remain, it has two choices:

Bail out and return stating that you have run out of memory.
Defragment the physical memory by copying memory around and remapping the pages.

Generally speaking, if mixed page sizes are used, it is generally recommended to allocate large pages as early as possible in the process’s life to remedy the above problem.

Virtual memory secondary storage

The fact that modern hardware allows virtual memory to be mapped in your process without mapping all the required pages to physical memory enables modern kernels to spill memory onto secondary storage to artificially increase the amount of memory available up to the limits of that secondary storage.

As is common knowledge, the most common form of secondary storage is the swap file on your hard drive.

However, the kernel is free to place that memory anywhere and implementations exist where the memory will be distributed in a network or some other medium (e.g: memory mapped files).

Conclusion

This concludes our introduction to virtual memory. In later blog posts we will explore the TLB into further details and the implications virtual memory has on the CPU cache. Until then, here are some additional links:

A memory allocator interface

11 May 2015

As previously discussed, the memory allocator integration into C++ STL containers is far from ideal.

Attempts to improve on it have been made but I have yet to meet anyone satisfied with the current state of things.

Memory in large applications will typically come from two majors places: new/malloc calls and container allocations. Today we will only discuss the later.

Where are we at?

There appears to be two main approaches when it comes to memory allocator interfaces:

Templating the container with the allocator used (C++ STL, Unreal 3)
Implementing an interface and perform a virtual function call (Bitsquid)

The first, as previously discussed, has a number of downsides but on the up side, it is the fastest since all allocation calls are likely to either be inline or at least be a static branch.

The second, while much more flexible, introduces indirection and as previously discussed, is slower and generally less performant. Not only do we introduce an extra cache miss (and likely TLB miss) but we introduce an indirect branch which the CPU will not be able to prefetch.

Can we do better?

The compromise

It seems that the cost of added flexibility is to use an indirect branch. This is unavoidable. However, we can remove the extra cache and TLB miss by defining a partial inline virtual function dispatch table in the allocator interface.

The idea is simple:

There are typically few allocator instances (on the order of <100 instances isn’t unusual)
Allocator instances are often small in memory footprint and often will fit within one or two cache lines (even when they manage a lot of memory, the footprint will generally lie somewhere else in memory and not inline with the allocator)
Most container allocations can be implemented with a single function: realloc

We can thus conclude that it is a viable alternative to add a function pointer to a realloc function inline within the allocator and call this for all our needs.

Here is the code (which is also on github):

class Allocator
{
protected:
    typedef void* (*ReallocateFun)(Allocator*, void*, size_t, size_t, size_t);

    inline          Allocator(ReallocateFun reallocateFun);

public:
    virtual         ~Allocator() {}

    virtual void*   Allocate(size_t size, size_t alignment) = 0;
    virtual void    Deallocate(void* ptr, size_t size) = 0;

    inline void*    Reallocate(void* oldPtr, size_t oldSize, size_t newSize, size_t alignment);

    virtual bool    IsOwnerOf(void* ptr) const = 0;

protected:
    ReallocateFun   m_reallocateFun;
};

void* Allocator::Reallocate(void* oldPtr, size_t oldSize, size_t newSize, size_t alignment)
{
    assert(m_reallocateFun);
    return (*m_reallocateFun)(this, oldPtr, oldSize, newSize, alignment);
}

A number of things stand out and are important:

This is not a proper interface but instead an abstract base class.
We provide some virtual functions for common things and more will come later (debugging features, etc.)
Reallocate is inline and NOT virtual
Deallocate/Reallocate follow the latest C++ standard and include the size used when the original allocation was made to allow further optimizations within allocator implementations.
Alignment must be provided which is important for AAA video games, and more generally with SIMD code.
The first argument to the ReallocateFun is an instance of the base class itself. This is important because it points to a static free standing function as such when we call Reallocate, the implicit this present as first argument will simply be forwarded as is with no extra register shuffling (at least on x64) even if it ends up not inlined and allows the implementation to call a member function without shuffling registers as well.

Usage is very simple:

An allocator simply derives from this base class, implement the necessary function and initializes the base class with a function pointer to a suitable reallocate function.
Containers simply call Reallocate(nullptr, 0, size, alignment) to allocate, Reallocate(ptr, size, 0, alignment) to deallocate and simply supply all arguments to reallocate.

It is often the case that containers will simply reallocate (e.g: vector<..> with POD) and as such this interface is ideally suited for this. With this implementation, when a container performs an allocation or deallocation, the cache miss to access the reallocate function pointer will load relevant data in its cache line leading to less wasted cycles compared to the virtual function dispatch approach while maintaining all the flexibility needed by the indirection and the interface.

In the coming blog posts, I will introduce a number of memory allocators and containers that will use this new interface.

Caches everywhere

05 May 2015

The modern computer is littered with caches to help improve performance. Each cache has its own unique role. A solid understanding of these caches is critical in writing high performance software in any programming language. I will describe the most important ones present on modern hardware but it is not meant to be an exhaustive list.

The modern computer is heavily based on the Von Neumann architecture. This basically boils down to the fact that we need to get data into a memory of sorts before our CPU can process it.

This data will generally come from one of these sources: a chip (such as ROM), a hard drive or a network and this data must make it somehow all the way to the processor. Note that other external devices can DMA data into memory as well but the above three are the most common and relevant to today’s discussion.

I will not dive into the implications for all the networking stack and the caches involved but I will mention that there are a few and understanding them can yield good gains.

The kernel page cache

Hard drives are notoriously slow compared to memory. To speed things up, a number of caches lay in between your application and the file on disk you are accessing. From the disk up to your application, the caches are in order: disk embedded cache, disk controller cache, kernel page cache, and last but not least, whatever intermediate buffers you might have in your user space application.

For the vasty majority of applications dealing with large files, the single most important optimization you can perform at this level is to use MMAP to access your files. You can find an excellent write up about it here. I will not repeat what he says but add that this is a massive win not only because it allows avoiding needless copying of memory but also by mentioning that you can prefetch easily, something not so easily achieved with fread and the likes.

If you are writing an application that does a lot of IO or where it needs to happen as fast as possible (such as console/mobile games), you should ideally be using MMAP where it is supported (android supports it, iOS supports it but with 700MB virtual memory limit on 32bit processors, and both Windows and Linux obviously support it). I am not 100% certain, but in all likely hood, the newest game consoles (XboxOne and PlayStation 4) should support it as well. However, note that not all these platforms might have a kernel page cache but that it is a safe bet that going forward, newer devices are likely to support it.

The CPU cache

The next level of caches lays much closer to the CPU and take the form of the popular L1, L2, and L3 caches. Not all processors will have these and when they do, they might only have one or two levels. Higher levels are generally inclusive in regards to the lower levels but not always. For example, while the L3 will generally be inclusive of the L2, the L2 might not be inclusive of the L1. This is so because some instructions such as prefetching will prefetch in the L2 but not the L1 and thus potentially cause eviction of cache lines that are in the L1.

The CPU cache is always moving data in and out of memory in units called a cache line (they will always be aligned in memory to a multiple of the cache line size). Cache line sizes vary depending on the hardware. Popular values are: 32 bytes on some older mobile processors, 64 bytes on most modern mobile, desktop and game console processors, and 128 bytes on some PowerPC processors (notably the Xbox360 and the PlayStation 3).

This cache level will contain code that is executing, data being processed, and translation look-aside buffer entries (for both code and data, see next section). They might be dedicated to either code or data (e.g: L1) or be inclusive of both (e.g: L2/L3).

The CPU cache is usually N-way set associative. You can find a good write up about this here. Note that depending on the cache level, where the cache line index and tag comes from might differ. For example, the code L1 might use the physical memory address to calculate both the index and the tag while the data L1 might use the virtual memory address for both the index and the tag. The L2 is also free to choose whatever arrangement it pleases and note that in theory, they could be mixed (tag from virtual, index from physical).

When the CPU requests for a memory address, it is said to hit the cache if the desired cache line is inside a particular cache level and a miss if it is not and we must look either in a higher cache level or worse, main memory.

This level of caching is the primary reason why packing things as tight as possible in memory is important for high performance: fetching a cache line from main memory is slow and when you do, you want most of that cache line to contain useful information to reduce waste. The larger the cache line, the more waste will generally be present. Things can be pushed further by aligning your data along cache lines when you know a cache miss will happen for a particular data element and to group relevant members next to it. This should not be done carelessly, as it can easily degrade performance if you are not careful.

This level of caching is also an important reason while calling a virtual function is often painted as slow: not only must you incur a potential cache miss to read the pointer to the virtual function table but you must also incur a potential cache miss for the pointer to the actual function to call afterwards. It is also generally the case that these two memory accesses will not be on the same cache line as they will generally live in different regions in memory.

This level of caching is also partly responsible for explaining why reading and writing unaligned values is slower (when the CPU supports it). Not only must it split or reconstruct the value from potentially two different cache lines but each access is potentially a separate cache miss.

Generally speaking, all cache levels discussed up to here are shared between processes that are currently executed by the operating system.

The translation look-aside buffer

The next level of caching is the translation look-aside buffer, or TLB for short.

The TLB is responsible for caching results of the translation of virtual memory addresses into physical memory addresses. Translation happens with a granularity of a fixed page size and modern processors generally support two or more page sizes with the most popular on x64 hardware being 4KB and 2MB. Modern CPUs will often support 1GB pages but using anything but 4KB pages in a user space application is sometimes not trivial. However, on game console hardware, it is common for the kernel to expose this and using this important tool is important for high performance.

The TLB will generally have separate caches for the different page sizes and it often has several levels of caching as well (L1 and L2, note that these are separate from the CPU caches mentioned above). When the CPU requests the translation of a virtual memory address, since it does not yet know the page size used, it will look in all TLB L1 caches for all page sizes and attempt to find a match. If it fails, it will look in all L2 caches. These operations are often done in parallel at every step. If it fails to find a cached result, a table walk will generally follow or a callback into the kernel happens. This step is potentially very expensive! On x64 with 4KB pages, typically four memory accesses will need to happen to find which physical memory frame (frame is generally the word used to refer to the unit of memory management the hardware and kernel use) contains the data pointed to and a fifth memory access to finally load that data into the CPU cache. Using 2MB pages will remove one memory access reducing the total from five to four. Note that each time a TLB entry is accessed in memory, it will be cached in the CPU cache like any other code or data.

This sounds very scary but in practice, things are not as dire as they may seem. Since all the memory accesses are cached, a TLB miss does not generally result in five cache misses. Pages cover a large range of memory addresses and typically the top levels of the address used will cover a large range of virtual memory and as such, the least virtual memory touched, the less data the TLB needs to manage. However, a TLB entry miss at a level N will generally guaranty that all lower TLB entry accesses will result in cache misses as well. In essence, not all TLB misses are equal.

As must be obvious now, every time I mentioned the potential for CPU cache misses in the previous section, is now potentially even worse if it results in a TLB miss as well. For example, as previously mentioned, virtual tables require an extra memory access. This extra memory access, by virtue of being in a separate memory region (the virtual table itself is read only and compiled by the linker while the pointers to said virtual table could be on the heap, stack, or part of the data segment), will typically require a separate cache line for it and separate TLB entries. It is clear that the indirection has a very real cost, not only in terms of the branch the CPU must take that it cannot predict but also in the extra pressure on the CPU and TLB caches.

Note that when a hypervisor is used with multiple operating systems running concurrently, since the physical memory is shared between all of these, generally speaking when looking up a TLB entry, an additional virtual memory address translation might need to take place to find the true location. Depending on the security restrictions of the hypervisor, it might elect to share the TLB entries between guests and the hypervisor or not.

The micro TLB

Yet another popular cache that is not present (as far as I know) in x64 hardware but common on PowerPC and ARM processors is a higher level cache for the TLB. ARM calls this the micro TLB while PowerPC calls it ERAT.

This cache level, like the previous one, caches the result of translation of virtual addresses into physical addresses but it uses a different granularity from the TLB. For example, while the ARM processors will generally support pages of 4KB, 64KB, 1MB, and sometimes 16MB; the micro TLB will generally have a granularity of 4KB or 1MB. What this means is that the micro TLB will miss more often but will often hit in the TLB afterwards if a larger page is used.

This cache level will generally be split for code and data memory accesses but will generally contain mixed page sizes and it is often fully associative due to its reduced size.

The CPU, TLB, and micro TLB caches are not only shared by currently running processes of the current operating system but when running in a virtualized environment, they are also shared between all the other operating systems. When the hardware does not support the sharing by means of registers holding a process identifier and virtual environment identifier, generally these caches must be flushed or cleared when a switch happens.

The CPU register

The last important cache level in a computer I will discuss today is the CPU register. The register is the basic unit modern processors use to manipulate data. As has been outlined so far, getting data here was not easy and the journey was long. It is no surprise that at this level, everything is now very fast and as such packing information in a register can yield good performance gains.

Values are loaded in and out of registers directly into the L1 assisted by the TLB. Register sizes keep growing over the years: 16bit is a thing of the past, 32bit is still very common on mobile processors, 64bit is now the norm on newer mobile and desktop processors, and processors with multimedia or vector processing capability will often have 128bit or even 256bit registers. In this later case, this means that we only need two 256bit registers to hold an entire cache line (generally 64 bytes on these processors).

Conclusion

This last points hammers in the overall high performance mantra: don’t touch what you don’t need and do the most work you can with as little memory as possible.

This means loading as little as possible from a hard drive or the network when such accesses are time sensitive or look into compression: it is not unusual that simple compression or packing algorithms will improve throughput significantly.

Use as little virtual memory as you can and ideally use large pages if you can to reduce TLB pressure. Data used together should ideally resides close by in memory. Note that virtual memory exists primarily to help reduce physical memory fragmentation but the larger the pages you use, the less it helps you.

Pack as much relevant data as possible together to help ensure that it ends up on the same cache line or better yet, align it explicitly instead of leaving it to chance.

Pack as much relevant data as possible in register wide values and manipulate them as an aggregate to avoid individual memory accesses (bit packing).

Ultimately, each cache level is like an instrument in an orchestra: they must play in concert to sound good. Each has their part and purpose. You can tune individually all you like but if the overall order is not present, it will not sound good. It is thus not about the mastery of any one topic but in understanding how to make them work well together.

This post is merely an overview of the various caches and their individual impacts. A lot of information is incomplete and missing to keep this post concise (I tried..). I hope to revisit each of these topics in separate posts in the future when time will allow.

Older Newer

Nicholas Frechette's Blog Raw bits

How to legally generate the Windows ISO from a Mac or Linux PC

Step 1

Step 2

Step 3

Step 4

Step 5

Step 6

Complications

Profit!

Linear Allocator

How it works

What can we use it for

What we can’t use it for

Edge cases

Potential optimizations

Performance

Conclusion

Alternate names

Virtual memory explained

Physical memory: a scarce ressource

Memory segments

Modern virtual memory

Virtual memory today

Virtual memory secondary storage

Conclusion

A memory allocator interface

Where are we at?

The compromise

Caches everywhere

The kernel page cache

The CPU cache

The translation look-aside buffer

The micro TLB

The CPU register

Conclusion