CloudFlare - Trie Hard - Big Savings On Cloud

TL;DR

Internal-header clearing in Pingora Origin consumed over 1.7% of CPU time, making it a high-leverage optimization target because it runs on every outgoing request.

Briefing Cornell Notes

Briefing

Cloudflare’s Pingora Origin spent a measurable slice of CPU time—about 1.7%—on a seemingly mundane task: clearing “internal” HTTP headers before requests leave Cloudflare’s infrastructure. That work happens on every request in the hottest path, with Pingora Origin processing roughly 35 million requests per second globally, so even tiny inefficiencies translate into massive compute costs at Cloudflare scale.

Engineers first quantified the bottleneck with Rust benchmarking (Criterion) and eBPF-based profiling, using large sets of synthesized requests with controlled mixes of internal vs. non-internal headers. The initial implementation removed internal headers by iterating over a list of internal header names and calling remove on the request’s header map. Because the average request contains far fewer than 100 headers, repeatedly scanning/removing from the request structure wasted reads. Flipping the direction of the operation—collecting the request’s header keys first, then intersecting with the internal-header set—cut runtime for the clearing function by about 2.39× in local tests. Even so, the team found the remaining savings were still limited by how the internal-header set was represented.

The next leap came from treating the problem as a data-structure choice rather than another micro-optimization. Hash maps offer constant-time lookups in theory, but real costs include hashing string keys and other overheads that scale with key length. Alternative structures like sorted sets and regex-based matching were evaluated; regex-based approaches using “re2x”/regex expressions were roughly twice as slow as the hashmap approach in their benchmarks. The search for something between a state machine and a general-purpose map led to a purpose-built “retrieval tree” (a trie-like structure) optimized for Cloudflare’s header patterns.

The result is “tree hard,” an open-source Rust crate announced alongside Pingora’s ongoing development. Instead of relying on existing trie implementations (which were not optimized for tens of millions of requests per second), Tree hard stores node relationships in compact bit-packed unsigned integers and keeps the structure in contiguous memory to improve cache behavior. In benchmarks, Tree hard reduced the average runtime of internal-header clearing to under a microsecond, and the team estimated CPU utilization for the function at about 0.43%—down from 1.71%. That corresponds to a 1.28% reduction in Pingora Origin’s CPU utilization, surpassing their target and translating into substantial savings at Cloudflare’s request volumes.

Crucially, the team didn’t stop at local benchmarking. Tree hard has been running in production since July 2024, and performance was validated using statistical sampling of production stack traces (a method similar to what Chrome uses for profiling). The sampled production CPU footprint matched the benchmark predictions closely, and in some comparisons improved further. The broader takeaway is less about any single trick and more about disciplined measurement: find the hottest code path, quantify the cost, then redesign the underlying data structures to remove that cost—because at Cloudflare scale, “small” percentages become real operational wins.

Cornell Notes

Cloudflare’s Pingora Origin spends about 1.7% of its CPU clearing internal HTTP headers on every outgoing request, at a global rate of roughly 35 million requests per second. After profiling and benchmarking, engineers improved the logic by flipping the lookup direction: instead of iterating internal headers and removing from the request, they intersect the request’s header keys with the internal set, yielding about a 2.39× speedup. Further gains required changing the data structure used to recognize internal header names. Hash maps and regex approaches underperformed in their tests, so the team built “tree hard,” a compact, cache-friendly trie-like structure optimized for header matching. Deployed in production since July 2024, tree hard reduced the function’s CPU utilization to about 0.43%, with production stack-trace sampling confirming benchmark-aligned results.

Why did clearing internal headers become a high-impact optimization target for Pingora Origin?

It sits in the hottest path: every request leaving Cloudflare must have internal routing/optimization information removed. Pingora Origin processes about 35 million requests per second globally, and the specific “clear internal headers” function consumed more than 1.7% of Pingora Origin’s total CPU time. At that scale, even small per-request overheads translate into enormous compute usage (the team framed this as thousands of continuously utilized CPU cores).

What was the first major algorithmic improvement, and what did it change?

The initial approach iterated over the internal-header list and called remove on the request’s header map for each internal header. Engineers flipped the direction: they first collected the request’s header keys, then intersected that set with the internal-header names. Because typical requests have far fewer than 100 headers, this reduced unnecessary reads while still removing the same internal headers. Benchmarks showed a substantial runtime improvement—about a 2.39× speedup for the clearing function.

Why didn’t a hashmap-based internal-header set remain the best option after flipping the lookup direction?

Even with fewer operations, the cost of hashmap lookups includes hashing string keys (which requires reading the key bytes) and other overheads. The team also considered that “O(1) lookup” can hide real costs: lookup time can be effectively linear in key length due to hashing, and the constant factors matter in a microsecond-level hot path. Their benchmarks found other structures like sorted sets and regex-based matching were not competitive either.

How did regex/state-machine approaches compare to hashmap in their benchmarks?

Regex-based matching using “re2x”/regex expressions was measured at about twice the runtime of the hashmap solution for this specific task. While state machines can be fast at rejecting non-matches early, the particular regex approach didn’t deliver the raw speed needed for the internal-header clearing hot path.

What is tree hard, and what design choices made it fast enough for production traffic?

Tree hard is a purpose-built trie-like “retrieval tree” for quickly identifying internal header names. Instead of using general-purpose trie implementations from crates.io (which weren’t optimized for this kind of high-throughput hot path), tree hard stores node relationships in compact bit-packed unsigned integers and keeps the entire tree in contiguous memory to improve cache locality. In benchmarks, it reduced internal-header clearing runtime to under a microsecond and estimated CPU utilization to about 0.43%.

How was the production impact validated beyond local benchmarks?

Tree hard ran in production starting July 2024. Engineers collected performance metrics by sampling stack traces statistically over time—an approach similar to Chrome’s profiling method. They estimated the function’s CPU utilization as the fraction of samples in which the function appeared, then compared those sampled results across versions. The production sampling closely matched benchmark predictions and sometimes improved further.

Review Questions

What specific bottleneck percentage did internal-header clearing consume in Pingora Origin, and why does that matter at Cloudflare’s request rates?
Describe the difference between the original header-removal strategy and the flipped lookup approach. How did the change reduce work?
What properties of tree hard (data layout and matching strategy) helped it outperform hashmaps and regex in this use case?

Key Points

1
Internal-header clearing in Pingora Origin consumed over 1.7% of CPU time, making it a high-leverage optimization target because it runs on every outgoing request.
2
Flipping the lookup direction—intersecting request header keys with the internal-header set—cut clearing runtime substantially (about 2.39× in benchmarks).
3
Hash maps can still be costly in hot paths due to string hashing and key-length effects, even when lookup is “O(1)” in theory.
4
Regex/state-machine matching was tested but ran about twice as long as the hashmap approach for this task in their benchmarks.
5
Tree hard replaces general-purpose tries with a compact, cache-friendly retrieval-tree implementation optimized for header-name matching.
6
Tree hard reduced the clearing function’s estimated CPU utilization to about 0.43% from 1.71%, surpassing the team’s target reduction.
7
Production validation used statistical stack-trace sampling, and results aligned closely with benchmark predictions.

Highlights

A single internal-header cleanup function consumed more than 1.7% of Pingora Origin CPU—turning a “small” per-request task into a major compute cost at tens of millions of requests per second.

Switching from “iterate internal headers and remove” to “intersect request keys with internal names” delivered a 2.39× speedup in local benchmarks.

Tree hard’s trie-like structure is engineered for cache and memory layout (bit-packed relationships, contiguous storage), enabling sub-microsecond clearing times.

Production rollout since July 2024 was validated with stack-trace sampling, showing benchmark-aligned CPU reductions.

Topics

Pingora Origin
Internal Header Clearing
Rust Benchmarking
Retrieval Tree
Production Profiling

Mentioned

Cloudflare
Pingora
Criterion
eBPF
tree hard
CPU
HTTP
eBPF