In an OLTP database processing 1 million transactions per second, every microsecond matters. Each transaction allocates dozens of temporary objects: query ASTs, execution plans, intermediate results, lock metadata. Traditional allocators we have seen in C like malloc spend precious cycles walking free lists and coalescing blocks. But here's the insight: transaction lifetimes are perfectly scoped - once a transaction commits or aborts, all its allocations become garbage simultaneously. This is exactly the workload pattern arena allocators were designed to dominate.
In my efforts to understand how allocators work in Zig and how its helps programmers create a better mental model for implementations of memory management strategies Ive decided to write allocators from scratch; as such this is the first blog in a series of implementing customs allocators understanding this in view of OLTP systems.
Naturally the first question that one would ask, might be
This is a simple but easy understandable way to understand how arena allocators have performance in mind when it's implemented.
const ArenaAllocator = struct {
buffer: []u8, // Pre-allocated to avoid syscalls
offset: usize, // Fast: just an integer add per allocation
used: usize
parent_allocator: std.mem.Allocator,
// offset only moves forward until reset
// This makes allocation O(1) with no branching
};
The above gives a solid understanding of the allocation path:
- Branch free is the "happy path"
- No locks needed (this is only regarding the state of the allocator)
- No syscalls
- Relatively easy to predict latency
The Bump-pointer
For one arenas use what is called a bump-pointer allocation which is a technique where allocations are handled by simply incrementing the pointer forward by the size of the requested data if there is enough memory allocated in the given buffer in its internal state. This is exteremly quick and effective when it comes to looking up data with its pointer arithmetic. Instead of freeing each data point individually arenas free all data within its state all at once through either a reset or destroy at the end of a transaction( the results could be updated to a database before destruction or reset). Here is a simple example of how a bump pointer updates in the buffer of the arena. Also to ensure data sits at memory addresses the CPU can access efficiently, we calculate alignment by rounding up to the nearest multiple of the required boundary this is important as the mask determines location of the bump pointer that the allocator uses and relies on for consistent contiguous data.
// determine the alignment of T then subtract 1 to determine the mask
const alignment_mask = @alignOf(T) - 1;
// this determines the starting point bump-pointer for the data
const alignment_offset = (self.offset + alignment_mask) & ~@as(usize, alignment_mask);
// this determines the placement of the bump-pointer after the allocation of the data
const allocation_end = (alignment_offset + size_in_bytes);
// bound checking
if (allocation_end > self.buffer.len) return error.OutOfMemory;
// reassign the offset to the end of the allocation placement.
// the bump-pointer is updated.
self.offset = allocation_end;
self.used += size_in_bytes;
// u8_ptr points to the start of the first byte
// of the data that was used for allocation
// using the alignment_offset (starting point)
const u8_ptr = &self.buffer[alignment_offset];
The point here is that transactions follow a short relatively easy life-cycle Begin -> Allocation of many objects -> Commit/Abort -> Cleanup. This would be considered a
The malloc bottleneck in OLTP
Usually general purpose allocators that use heap allocations (malloc/free) are typically used for heterogenous, long-lived allocations. These allocations with there time complexity overhead can cause violations of OLTP low latency promises. In a high-concurrency OLTP server, many threads try and process transactions simultaneously. Malloc more times than not uses will try and use global or per-bin locks to manage their internal free list, this can lead to lock-contentions. This could cause low latency as threads are blocked and wait for memory locks to update. With an arena allocation implemented this is easy almost trivial with by freeing the underlying parent allocator of the data in its own buffer.
pub fn deinit(self: *Self) void {
// Single call to free the entire large block (buffer)
self.parent_allocator.free(self.buffer);
}
This can also be done without accessing the parent allocator by using the arena.reset().
pub fn reset(self: *Self) !void {
self.offset = 0;
self.used = 0;
}
Writing over used memory is fine as long as it's not sensitive information. if there is sensitive information there are more effective way to do this such as using .zero() or .destroy() to zero out the data already used in the buffer.
When a transaction completes calling arena.rest() doesn't touch the memory block itself; it just
test " 2 allocations in a row" {
var arena = try ArenaAllocator.init(1024, std.testing.allocator);
defer arena.deinit();
var t = try arena.alloc(u32, 8);
var b = try arena.alloc(u32, 10);
t[0] = 100;
b[0] = 100;
try std.testing.expectEqual(@as(u32, 100), t[0]);
try std.testing.expectEqual(@as(u32, 100), b[0]);
}
Perfect cache locality
test "many small allocators" {
var arena = try ArenaAllocator.init(1024, std.testing.allocator);
defer arena.deinit();
for (0..100) |i| {
// only 1 item per allocation of type u8 100 times.
const data = try arena.alloc(u8, 1);
const data2 = try arena.alloc(u8, 1);
data[0] = @intCast(i % 256);
data2[0] = @intCast(i % 256);
}
}
This produces a time complexity of O(1) making sure its low latency and consistent. The second of my 2 test above shows arenas can shine in subtle ways that others might even think about. Let's imagine a server thread that is processing a request to transfer between 2 accounts. That single transaction will require creating several temporary short lived data structures.
Data Structure Purpose Size Lifetime Transaction State Holds the ID, amount, and status. Small (e.g., 32 bytes) Ends with transaction. Account A Lock Node Used to temporarily lock Account A. Small (e.g., 16 bytes) Ends when lock is released. Account B Lock Node Used to temporarily lock Account B. Small (e.g., 16 bytes) Ends when lock is released. Audit Log Entry Details of the change before writing to disk. Medium (e.g., 128 bytes) Ends with transaction commit. Result Metadata Status codes and message for the client. Small (e.g., 8 bytes) Ends when response is sent. Now all of these objects are logically interconnected because they are all apart of the same atomic transaction. ## Arena vs Malloc for Cache Locality
Arena Allocator (Bump-pointer) When the arena allocator handles these five objects , the bump pointer that in use lays them out sequentially in memory [Trans. State][Lock A][Lock B][Audit Log][Result]→Contiguous Block The mental model to have with this is that the transaction starts and allocates the transaction state. When the cpu requests this data (from the page_allocator [the parent allocator]) it loads a 64 byte cache line in the L1 cache of the cpu. All of the other smaller objects (Lock A, Lock B, etc) immediately follows the transaction state as its already resident in the same 64 byte cache line. As the transaction process -> acquiring locks, checking the state, or building the audit log <- the cpu accesses these subsequent objects with near zero latency as they are instantly available through the cache hit. This tight packing through the linear buffer from the arena’s state reduces the time spent waiting for memory fetches, this is absolutely needed and expected in an OTLP server. In the following zig I try and demonstrated using a many-item pointer that uses the arena’s allocated offset as the start of the transaction state up until its result metadata. ```zig // u8_ptr points to the start of the first byte ) // of the data that was used for allocation // using the alignment_offset (starting point) const u8_ptr = &self.buffer[alignment_offset]; -> (transaction state starting point)
// pointer must point to a many item pointer not just u8 pointer // casting to return type []T // you can index into this: type_ptr[0], type_ptr[1], type_ptr[2], etc. const type_ptr: [*]T = @ptrCast(@alignCast(u8_ptr)); -> ( data structures that are used as part of the transaction) // collects bytes (of underlying data) from // address of type_ptr (u8_ptr) up until item count // and return that slice const result = type_ptr[0..item_count]; -> ( transaction state… result metadata) return result;
``` 2. General-Purpose (malloc) Typically a general purpose heap allocator aims for efficiency by finding the smaller available hole in the available memory that can fit the requested size. Although there are allocators that C has that does allow for more complex strategies such as glibc malloc, jemalloc, mimalloc, tcmalloc they still typically use “find smallest hole” algorithm. [Lock A]…Gap…[Audit Log]…Gap…[Lock B]…Gap…[Trans. State]
The problem that arises from this is the issue of scattered memory. Each of the five small objects could be allocated far from the other on the heap, depending on what other threads are doing. Whenever the transaction needs to jump from checking the Transaction State to the Account A Lock Node, the memory address could very likely be in a completely different 64 byte block, the point being there is no contiguous sequence between the objects. The cpu is there forced to fetched a new cache line for virtually every step leading to constant cache misses (~30 ~50ns). Stalling of the cpu can and will cause spikes in latency spikes which is catastrophic for P99 latency in OLTP. ## Closing thoughts what I learned The choice not to use a bump-pointer arena allocator in OLTP can be a form of the "engineering process laziness" I mentioned earlier—sacrificing predictable, low latency for the sake of using a general-purpose tool. Think of the best mechanic you know: they take their time and use specialized precision tools. Other mechanics might finish your car faster, but the 10x mechanic's care is unmatched. The moral? Choose the right tool, even if it seems harder initially; it will simplify the job going forward. When architecting mission-critical OLTP services, where every microsecond counts, Zig's O(1) arena allocator's speed and guaranteed cache locality are that carefully crafted and time-tested tool. By using Zig to implement this custom memory strategy, you gain two distinct advantages:
- Elimination of Lock Contention: Thread-local arenas bypass the global locks that choke malloc in concurrent servers, ensuring maximum parallelism.
- Deterministic Performance: By using bulk allocation → bulk deallocation, complex, slow heap allocations are replaced by a single, fast pointer reset (self.offset = 0), guaranteeing consistent and predictable cleanup after every transaction. If you made it this far, thank you for reading my first blog piece on how Zig and custom allocators can revolutionize OLTP servers. I plan to write plenty more. If I made any mistakes or over-guaranteed anything in my writing, please let me know. Thank you again, and shout-out to Tiger Beetle 🪲 :3.