The Fixed Point
open main menu

Low-level Hacks

/ 5 min read
Last updated:

📝 NOTE
The code examples used in this post can be found in this GitHub repository.

Low-level programming is often seen as a dark art - bit manipulation, memory tricks, and performance hacks that seem to defy the rules of “proper” programming. But these techniques, when used carefully and deliberately, can unlock significant performance gains in systems programming.

In this post, we’ll explore some fascinating low-level programming techniques that I’ve encountered in my journey through systems programming. We’ll look at type punning and struct reordering that can make your code run faster.

⚠️ DISCLAIMER
Don’t do these tricks in your production code unless you know what you’re doing.

Type Punning

What is type punning?

Type punning is a technique in low-level programming that allows you to reinterpret the bits of a value as a different type. For example, you might view a sequence of bytes as an integer, or interpret a floating point number as its raw binary representation.

From a low-level perspective, it’s essentially using reinterpret_cast in C++ or transmute in Rust to reinterpret the bits of one type as another type. While the concept is straightforward, this technique can provide significant performance benefits

Fixed-sized key

The first time I encountered type punning was while reading the Masstree codebase. Masstree is a high-performance key-value store that combines the benefits of B-trees and tries. Though it’s published in 2012, it’s still the state-of-the-art in the in-memory key-value store. The implementation uses type punning extensively to optimize key comparisons and memory layout.

According to the Masstree paper, this simple type punning optimization improves performance by 13% - 19% compared to byte-by-byte key comparisons. The elegance of this technique lies in its simplicity - by reinterpreting the bytes as integers, we can compare multiple bytes at once using native CPU instructions.

Here’s a simple example of how it works in Rust:

#![allow(incomplete_features)]
#![feature(generic_const_exprs)]

#[repr(C)]
union FixSizedKeyInner<T: FixSizedKeyParams>
where
    [(); T::KEY_SIZE]: Sized,
    [(); T::KEY_SIZE / 8]: Sized,
{
    i_: [u64; T::KEY_SIZE / 8],
    c_: [u8; T::KEY_SIZE],
}

pub struct FixSizedKey<T: FixSizedKeyParams>
where
    [(); T::KEY_SIZE]: Sized,
    [(); T::KEY_SIZE / 8]: Sized,
{
    key_u_: FixSizedKeyInner<T>,
    _phantom: PhantomData<T>,
}

(Note: The where clauses are necessary here due to Rust’s current limitations around const generics. It’s really gross, but it works)

The structure is pretty straightforward. We have a byte array (u8[]) to store the fixed-sized key, and at the same time, we use a u64 array as its punning representation. This allows us to efficiently access the same memory region either as individual bytes or as 64-bit integers.

This approach allows us to access the key data in two ways without manual bit reinterpretation: either as individual bytes or as 64-bit integers. The union ensures both representations point to the same memory location.

To preserve lexicographical ordering when comparing keys as integers, we can do some byte-swapping to handle endianness differences. On big-endian systems, the bytes are already in the correct order for lexicographical comparison. On little-endian systems, we need to swap the bytes to maintain the same ordering semantics.

The performance gain is pretty significant, see below:

(Initialized 100000 random keys and sort, comparison are made by the same Vec)

with_int_cmp/vec sort   time:   [7.6443 ms 7.6465 ms 7.6489 ms]

without_int_cmp/vec sort
                        time:   [11.193 ms 11.196 ms 11.200 ms]

Struct layout reordering

(Note: This trick is primarily effective on x86-64 architectures)

Anyone who has worked with C/C++ knows that struct layout is critical for performance. Aligning structures to 8-byte boundaries is a common practice to ensure efficient memory access.

For those familiar with high-performance concurrent programming, you may have encountered False Sharing - a performance issue where multiple CPU cores invalidate each other’s cache lines even when accessing different variables. To prevent this, we need to align our data structures to cache line boundaries (typically 64 bytes on modern CPUs) rather than just 8-byte boundaries.

However, there’s even more we can do to optimize memory layout and access patterns. Let’s explore a more advanced technique.

Reducing memory access

Imaging that we have a 4-way tree structure, where each node has four children and three 16-byte keys in-between.

#[repr(C)]
struct Key {
  _c: [u8; 16]
}


#[repr(C, align(64))]
struct Node {
  keys: [Key; 3],
  values: [*mut Value; 3],
  children: [*mut Node; 4],
}

When traversing the tree, we need to compare the target key with the keys stored in each node. Most of the time, we’ll either find a match and access the corresponding value, or determine which child pointer to follow based on the comparison.

A key insight is that the comparison often only needs the first few bytes of the keys - we rarely need to compare the full 16 bytes. By rearranging the struct layout to place the first 8 bytes of all keys at the top, followed by the child pointers, and then the remaining 8 bytes of the keys, we can optimize our memory access pattern.

This layout means that for the common case - where we only need to check the first 8 bytes and then follow a child pointer - we can complete the node visit by fetching just one cache line instead of two. The optimized layout looks like this:

#[repr(C)]
pub struct FourTreeNode {
    keys_0: [u64; 3],
    p_children: [*const u8; 4],
    _padding_0: u64,
    keys_1: [u64; 3],
    p_values: [*const u8; 4],
    _padding_1: u64,
}

The performance benefit is not too much, but could be improved further with manual prefetching, as shown below:

(Inserted 100000 key values, and query keys. The key-value and the keys to be queried are the same)

four_tree_naive/lookup  time:   [288.41 ns 289.37 ns 290.44 ns]

four_tree_reordered/lookup
                        time:   [275.80 ns 277.57 ns 279.80 ns]