An Optimization That Is Impossible In Rust

TL;DR

Umbra-style short string optimization keeps short strings inline on the stack, avoiding heap allocation and pointer chasing for the common case.

Briefing Cornell Notes

Briefing

A widely repeated claim that “short string optimization” is impossible in Rust gets challenged through a full implementation of an Umbra-style string type—showing that the optimization can be made to work, but only by leaning hard on Rust’s memory-layout control and carefully contained unsafe code.

The core idea behind Umbra-style strings is to avoid heap allocation for most real-world strings. Instead of storing every string as a pointer/length/capacity triple that points to a heap buffer, a short string can be stored directly inside the string object itself. The implementation repurposes unused space in the usual string representation: a short string sets a bit in the capacity field (conceptually), then keeps the remaining “capacity” bits plus the length and content inline. That eliminates buffer allocation and pointer dereferencing on access—exactly the kind of micro-optimization that matters in database workloads where string comparisons and ordering happen constantly.

The transcript then walks through why Rust’s standard string layout doesn’t automatically provide this behavior. Rust’s built-in String is a 24-byte structure on the stack (pointer, length, capacity), and the language’s safety model makes it nontrivial to create custom layouts that mix inline storage with heap-backed storage while still supporting correct ownership, cloning, and deallocation. The discussion also clarifies that Rust’s “fat pointers” for slices and trait objects carry extra metadata like length, which complicates building a compact 16-byte string representation that can switch between inline bytes and heap bytes.

To make the Umbra-style layout fit, the implementation uses a union-like representation to store either inline content or a pointer to heap data, while keeping a small prefix inline for fast comparisons. For long strings, only a fixed-size prefix (described as four bytes) is checked first; if prefixes differ, ordering and equality can be decided without touching the rest of the string. When prefixes match, the code falls back to comparing the remaining bytes, sometimes still avoiding pointer dereferences when both strings are inline.

The hardest engineering piece is shared ownership of heap-backed bytes across clones and threads. Rather than using a straightforward Arc<[u8]>, the approach builds a custom reference-counted “shared bytes” DST (dynamically sized type) so the string object can remain compact while the heap allocation carries the atomic reference count plus the byte array. Because Rust doesn’t let you freely construct DST values without manual allocation, the code defines a layout, allocates memory for the reference count and byte array together, then copies the input bytes into place. Drop and clone are implemented with explicit reference-count decrement/increment, and deallocation happens only when the count reaches one.

The result is a practical argument: the optimization isn’t “impossible” in Rust, but it’s not free. Achieving it requires deep knowledge of Rust’s type layout rules, DST mechanics, and careful unsafe code to bridge between thin and fat pointers, allocate custom DSTs, and manage deallocation correctly. The transcript closes by emphasizing that Rust’s safety is real, yet some performance-oriented, layout-heavy tricks can look “bonkers” to most developers—while still being feasible for those willing to master the underlying model.

Cornell Notes

Umbra-style strings use short string optimization to keep most strings inline on the stack, avoiding heap allocation and pointer chasing. The transcript shows that Rust can implement this despite an initial “impossible” claim, by building a custom string layout that stores short content directly and long content via a shared, reference-counted heap buffer. Fast comparisons come from an inline prefix check (e.g., four bytes) so equality/order can often be decided without reading the full string. The implementation’s complexity comes from Rust’s fat pointers for slices/DSTs and the need to manually allocate and manage a custom dynamically sized, atomically reference-counted buffer using unsafe code. The payoff is a compact string type with performance-oriented behavior suitable for database-style workloads.

Why does short string optimization matter for database workloads?

It reduces per-string overhead by avoiding heap allocation and pointer dereferencing for the common case where strings are short. In the described Umbra-style design, short strings are stored inline inside the string object, so comparisons and ordering can often proceed using only stack-resident bytes. The transcript also highlights that many string comparisons can be decided early using only a small inline prefix, further cutting memory traffic.

What makes Rust’s standard String layout unsuitable for this optimization out of the box?

Rust’s built-in String is typically a 24-byte stack structure containing pointer, length, and capacity, with the actual bytes living in a heap buffer. To inline bytes for short strings, a custom representation must replace the usual “always heap” model with a mixed inline/heap layout, while still supporting correct UTF-8 handling, cloning, and drop behavior.

How does the implementation keep comparisons fast without reading the whole string?

It stores a fixed-size prefix inline (described as four bytes). Equality/ordering first checks length and then compares the prefix bytes. If prefixes differ, the result is determined immediately; only when prefixes match does it compare the remaining bytes. This avoids expensive pointer dereferences and full-string scans in the common case.

Why do fat pointers and DSTs complicate building a compact shared string buffer?

Slices and trait objects in Rust use fat pointers that include both a data pointer and extra metadata like length. If the heap-backed bytes were represented as an Arc to a slice directly, the string object would grow because the slice metadata would need to be stored alongside the pointer. The transcript describes a custom DST-based reference-counted buffer so the string object can stay compact while the heap allocation still knows its length.

What is the role of unsafe code in the shared heap allocation approach?

Rust doesn’t provide a simple safe way to construct arbitrary DST instances or to allocate memory for a custom DST layout. The implementation therefore defines a layout for the reference count plus a byte array, allocates enough space, casts pointers between thin and fat representations, copies bytes into the allocated region, and then stores the resulting pointer. Drop and clone then use atomic reference counts to ensure correct deallocation when the last reference goes away.

How do clone and drop work for the heap-backed case?

Clone increments an atomic reference count without copying the bytes (an atomic “shared” clone). Drop decrements the count; if the count reaches one (meaning the current drop is the last owner), it deallocates the heap buffer using the previously defined layout. For inline short strings, drop and clone operate on the inline bytes without reference counting.

Review Questions

What specific mechanism allows Umbra-style strings to avoid heap allocation for short strings, and how does that affect pointer dereferencing during access?
How does prefix-based comparison reduce the amount of work needed for equality and ordering, and what happens when prefixes match?
Why does implementing a compact shared heap buffer require dealing with DSTs and fat pointers, and where does unsafe code enter the process?

Key Points

1
Umbra-style short string optimization keeps short strings inline on the stack, avoiding heap allocation and pointer chasing for the common case.
2
Fast string comparisons can be achieved by storing a small fixed-size prefix inline and deciding equality/order early when prefixes differ.
3
Rust’s built-in String layout doesn’t provide inline storage, so a custom representation must mix inline bytes with heap-backed bytes.
4
Shared ownership for heap-backed bytes is implemented with atomic reference counting, but doing it compactly requires custom DST machinery rather than a direct Arc<[u8]> approach.
5
Rust fat pointers for slices/DSTs add metadata like length, which can bloat the string object unless the design carefully separates thin and fat pointer representations.
6
Custom DST allocation typically needs manual layout definition and unsafe pointer casting, because Rust doesn’t safely construct these DST values directly.
7
Correctness hinges on implementing drop/clone to manage atomic reference counts and deallocate only when the last reference is released.

Highlights

The “impossible in Rust” claim is overturned by implementing an Umbra-style string with inline short storage and heap-backed long storage.

A fixed inline prefix (described as four bytes) enables many equality/order checks to finish without touching the full string bytes.

Keeping the string object compact requires custom handling of DSTs and fat pointers, not just using Arc<[u8]> directly.

The shared heap buffer is allocated using a manually defined layout and managed via atomic reference counts, with unsafe code used to bridge Rust’s DST limitations.

Topics

Umbra String
Short String Optimization
Rust Memory Layout
DST and Fat Pointers
Atomic Reference Counting

Mentioned

Andy Pavlo
UTF8
DST
ARC
RC
VC
XKCD