KawaiiAvii



In an OLTP database processing 1 million transactions per second, every microsecond matters. Each transaction allocates dozens of temporary objects: query ASTs, execution plans, intermediate results, lock metadata. Traditional allocators we have seen in C like malloc spend precious cycles walking free lists and coalescing blocks. But here's the insight: transaction lifetimes are perfectly scoped - once a transaction commits or aborts, all its allocations become garbage simultaneously. This is exactly the workload pattern arena allocators were designed to dominate.

In my efforts to understand how allocators work in Zig and how its helps programmers create a better mental model for implementations of memory management strategies Ive decided to write allocators from scratch; as such this is the first blog in a series of implementing customs allocators understanding this in view of OLTP systems.

Naturally the first question that one would ask, might be why would arena allocators be needed for OLTP systems . Or how could they be useful. It should be noted early that OLTP systems handle many short, frequent transactions. This can come in the form of database inserts, queries, updates, etc, etc. Each transaction might need to allocate and free memory query parsing, network buffering, caching and temporary results. If latency is slow on these fronts it can affect businesses exponentially in negative ways, there are many things that can go wrong based on transaction process latency or a better way to put it: "engineering process laziness".
This is a simple but easy understandable way to understand how arena allocators have performance in mind when it's implemented.

const ArenaAllocator = struct {
    buffer: []u8,        // Pre-allocated to avoid syscalls
    offset: usize,       // Fast: just an integer add per allocation
    used: usize 
    parent_allocator: std.mem.Allocator,
    
    // offset only moves forward until reset
    // This makes allocation O(1) with no branching
};

The above gives a solid understanding of the allocation path:

  1. Branch free is the "happy path"
  2. No locks needed (this is only regarding the state of the allocator)
  3. No syscalls
  4. Relatively easy to predict latency

The Bump-pointer

For one arenas use what is called a bump-pointer allocation which is a technique where allocations are handled by simply incrementing the pointer forward by the size of the requested data if there is enough memory allocated in the given buffer in its internal state. This is exteremly quick and effective when it comes to looking up data with its pointer arithmetic. Instead of freeing each data point individually arenas free all data within its state all at once through either a reset or destroy at the end of a transaction( the results could be updated to a database before destruction or reset). Here is a simple example of how a bump pointer updates in the buffer of the arena. Also to ensure data sits at memory addresses the CPU can access efficiently, we calculate alignment by rounding up to the nearest multiple of the required boundary this is important as the mask determines location of the bump pointer that the allocator uses and relies on for consistent contiguous data.

      // determine the alignment of T then subtract 1 to determine the mask
        const alignment_mask = @alignOf(T) - 1;
     // this determines the starting point bump-pointer for the data
        const alignment_offset = (self.offset + alignment_mask) & ~@as(usize, alignment_mask);
// this determines the placement of the bump-pointer after the allocation of the data  
        const allocation_end = (alignment_offset + size_in_bytes);

        // bound checking
        if (allocation_end > self.buffer.len) return error.OutOfMemory;
        // reassign the offset to the end of the allocation placement.
        // the bump-pointer is updated.
        self.offset = allocation_end;
        self.used += size_in_bytes;

        // u8_ptr points to the start of the first byte
        // of the data that was used for allocation
        // using the alignment_offset (starting point)
        const u8_ptr = &self.buffer[alignment_offset];

The point here is that transactions follow a short relatively easy life-cycle Begin -> Allocation of many objects -> Commit/Abort -> Cleanup. This would be considered a Bulk Allocation. This is ideal for a an arena allocator that uses its buffer to aggregate data the using. The crucial part of this is that all of these allocated objects share the same lifetime, that being the duration of the transaction. Whether or not the transaction commits or aborts all temporary memory must be freed simultaneously upon completion, usually with defer arena.deinit(). This creates memory pattern of bulk allocation-> bulk deallocation.

The malloc bottleneck in OLTP

Usually general purpose allocators that use heap allocations (malloc/free) are typically used for heterogenous, long-lived allocations. These allocations with there time complexity overhead can cause violations of OLTP low latency promises. In a high-concurrency OLTP server, many threads try and process transactions simultaneously. Malloc more times than not uses will try and use global or per-bin locks to manage their internal free list, this can lead to lock-contentions. This could cause low latency as threads are blocked and wait for memory locks to update. With an arena allocation implemented this is easy almost trivial with by freeing the underlying parent allocator of the data in its own buffer.

pub fn deinit(self: *Self) void {
    // Single call to free the entire large block (buffer)
    self.parent_allocator.free(self.buffer);
}

This can also be done without accessing the parent allocator by using the arena.reset().

    pub fn reset(self: *Self) !void {
        self.offset = 0;
        self.used = 0;
    }

Writing over used memory is fine as long as it's not sensitive information. if there is sensitive information there are more effective way to do this such as using .zero() or .destroy() to zero out the data already used in the buffer.

When a transaction completes calling arena.rest() doesn't touch the memory block itself; it just resets the bump-pointer back to the beginning of the buffer (buffer[0]). This is great as it allows for instantaneous memory availability for the next transaction. This would be ideal for a business account that uses a mutual transactions or multiple transfer implementations with cache locality. I try and demonstrate that here in my test case.

test " 2 allocations in a row" {
    var arena = try ArenaAllocator.init(1024, std.testing.allocator);
    defer arena.deinit();
    var t = try arena.alloc(u32, 8);
    var b = try arena.alloc(u32, 10);
    t[0] = 100;
    b[0] = 100;
    try std.testing.expectEqual(@as(u32, 100), t[0]);
    try std.testing.expectEqual(@as(u32, 100), b[0]);
}

Perfect cache locality

test "many small allocators" {
    var arena = try ArenaAllocator.init(1024, std.testing.allocator);
    defer arena.deinit();

    for (0..100) |i| {
        // only 1 item per allocation of type u8 100 times.
        const data = try arena.alloc(u8, 1);
        const data2 = try arena.alloc(u8, 1);
        data[0] = @intCast(i % 256);
        data2[0] = @intCast(i % 256);
    }
}

This produces a time complexity of O(1) making sure its low latency and consistent. The second of my 2 test above shows arenas can shine in subtle ways that others might even think about. Let's imagine a server thread that is processing a request to transfer between 2 accounts. That single transaction will require creating several temporary short lived data structures.

Data Structure Purpose Size Lifetime Transaction State Holds the ID, amount, and status. Small (e.g., 32 bytes) Ends with transaction. Account A Lock Node Used to temporarily lock Account A. Small (e.g., 16 bytes) Ends when lock is released. Account B Lock Node Used to temporarily lock Account B. Small (e.g., 16 bytes) Ends when lock is released. Audit Log Entry Details of the change before writing to disk. Medium (e.g., 128 bytes) Ends with transaction commit. Result Metadata Status codes and message for the client. Small (e.g., 8 bytes) Ends when response is sent. Now all of these objects are logically interconnected because they are all apart of the same atomic transaction.

## Arena vs Malloc for Cache Locality

  1. Arena Allocator (Bump-pointer) When the arena allocator handles these five objects , the bump pointer that in use lays them out sequentially in memory 
 [Trans. State][Lock A][Lock B][Audit Log][Result]→Contiguous Block
 The mental model to have with this is that the transaction starts and allocates the transaction state. When the cpu requests this data (from the page_allocator [the parent allocator]) it loads a 64 byte cache line in the L1 cache of the cpu. All of the other smaller objects (Lock A, Lock B, etc) immediately follows the transaction state as its already resident in the same 64 byte cache line. As the transaction process -> acquiring locks, checking the state, or building the audit log <- the cpu accesses these subsequent objects with near zero latency as they are instantly available through the cache hit. This tight packing through the linear buffer from the arena’s state reduces the time spent waiting for memory fetches, this is absolutely needed and expected in an OTLP server. In the following zig I try and demonstrated using a many-item pointer that uses the arena’s allocated offset as the start of the transaction state up until its result metadata. 
```zig
 // u8_ptr points to the start of the first byte ) // of the data that was used for allocation // using the alignment_offset (starting point) const u8_ptr = &self.buffer[alignment_offset]; -> (transaction state starting point)

     // pointer must point to a many item pointer not just u8 pointer 
     // casting to return type []T
     // you can index into this: type_ptr[0], type_ptr[1], type_ptr[2], etc.
     const type_ptr: [*]T = @ptrCast(@alignCast(u8_ptr)); -> ( data structures that are used as part of the transaction)
    
     // collects bytes (of underlying data) from
     // address of type_ptr (u8_ptr) up until item count
     // and return that slice
     const result = type_ptr[0..item_count]; -> ( transaction state… result metadata) 
     return result;
    

    

    ``` 
2. General-Purpose (malloc) Typically a general purpose heap allocator aims for efficiency by finding the smaller available hole in the available memory that can fit the requested size. Although there are allocators that C has that does allow for more complex strategies such as glibc malloc, jemalloc, mimalloc, tcmalloc they still typically use “find smallest hole” algorithm. 
 [Lock A]…Gap…[Audit Log]…Gap…[Lock B]…Gap…[Trans. State]

The problem that arises from this is the issue of scattered memory. Each of the five small objects could be allocated far from the other on the heap, depending on what other threads are doing. Whenever the transaction needs to jump from checking the Transaction State to the Account A Lock Node, the memory address could very likely be in a completely different 64 byte block, the point being there is no contiguous sequence between the objects. The cpu is there forced to fetched a new cache line for virtually every step leading to constant cache misses (~30 ~50ns). Stalling of the cpu can and will cause spikes in latency spikes which is catastrophic for P99 latency in OLTP. 

## Closing thoughts what I learned The choice not to use a bump-pointer arena allocator in OLTP can be a form of the "engineering process laziness" I mentioned earlier—sacrificing predictable, low latency for the sake of using a general-purpose tool. Think of the best mechanic you know: they take their time and use specialized precision tools. Other mechanics might finish your car faster, but the 10x mechanic's care is unmatched. The moral? Choose the right tool, even if it seems harder initially; it will simplify the job going forward. When architecting mission-critical OLTP services, where every microsecond counts, Zig's O(1) arena allocator's speed and guaranteed cache locality are that carefully crafted and time-tested tool. By using Zig to implement this custom memory strategy, you gain two distinct advantages:

  1. Elimination of Lock Contention: Thread-local arenas bypass the global locks that choke malloc in concurrent servers, ensuring maximum parallelism.
  2. Deterministic Performance: By using bulk allocation → bulk deallocation, complex, slow heap allocations are replaced by a single, fast pointer reset (self.offset = 0), guaranteeing consistent and predictable cleanup after every transaction. If you made it this far, thank you for reading my first blog piece on how Zig and custom allocators can revolutionize OLTP servers. I plan to write plenty more. If I made any mistakes or over-guaranteed anything in my writing, please let me know. Thank you again, and shout-out to Tiger Beetle 🪲 :3.