CloudFlare - Trie Hard - Big Savings On Cloud
Based on The PrimeTime's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Internal-header clearing in Pingora Origin consumed over 1.7% of CPU time, making it a high-leverage optimization target because it runs on every outgoing request.
Briefing
Cloudflare’s Pingora Origin spent a measurable slice of CPU time—about 1.7%—on a seemingly mundane task: clearing “internal” HTTP headers before requests leave Cloudflare’s infrastructure. That work happens on every request in the hottest path, with Pingora Origin processing roughly 35 million requests per second globally, so even tiny inefficiencies translate into massive compute costs at Cloudflare scale.
Engineers first quantified the bottleneck with Rust benchmarking (Criterion) and eBPF-based profiling, using large sets of synthesized requests with controlled mixes of internal vs. non-internal headers. The initial implementation removed internal headers by iterating over a list of internal header names and calling remove on the request’s header map. Because the average request contains far fewer than 100 headers, repeatedly scanning/removing from the request structure wasted reads. Flipping the direction of the operation—collecting the request’s header keys first, then intersecting with the internal-header set—cut runtime for the clearing function by about 2.39× in local tests. Even so, the team found the remaining savings were still limited by how the internal-header set was represented.
The next leap came from treating the problem as a data-structure choice rather than another micro-optimization. Hash maps offer constant-time lookups in theory, but real costs include hashing string keys and other overheads that scale with key length. Alternative structures like sorted sets and regex-based matching were evaluated; regex-based approaches using “re2x”/regex expressions were roughly twice as slow as the hashmap approach in their benchmarks. The search for something between a state machine and a general-purpose map led to a purpose-built “retrieval tree” (a trie-like structure) optimized for Cloudflare’s header patterns.
The result is “tree hard,” an open-source Rust crate announced alongside Pingora’s ongoing development. Instead of relying on existing trie implementations (which were not optimized for tens of millions of requests per second), Tree hard stores node relationships in compact bit-packed unsigned integers and keeps the structure in contiguous memory to improve cache behavior. In benchmarks, Tree hard reduced the average runtime of internal-header clearing to under a microsecond, and the team estimated CPU utilization for the function at about 0.43%—down from 1.71%. That corresponds to a 1.28% reduction in Pingora Origin’s CPU utilization, surpassing their target and translating into substantial savings at Cloudflare’s request volumes.
Crucially, the team didn’t stop at local benchmarking. Tree hard has been running in production since July 2024, and performance was validated using statistical sampling of production stack traces (a method similar to what Chrome uses for profiling). The sampled production CPU footprint matched the benchmark predictions closely, and in some comparisons improved further. The broader takeaway is less about any single trick and more about disciplined measurement: find the hottest code path, quantify the cost, then redesign the underlying data structures to remove that cost—because at Cloudflare scale, “small” percentages become real operational wins.
Cornell Notes
Cloudflare’s Pingora Origin spends about 1.7% of its CPU clearing internal HTTP headers on every outgoing request, at a global rate of roughly 35 million requests per second. After profiling and benchmarking, engineers improved the logic by flipping the lookup direction: instead of iterating internal headers and removing from the request, they intersect the request’s header keys with the internal set, yielding about a 2.39× speedup. Further gains required changing the data structure used to recognize internal header names. Hash maps and regex approaches underperformed in their tests, so the team built “tree hard,” a compact, cache-friendly trie-like structure optimized for header matching. Deployed in production since July 2024, tree hard reduced the function’s CPU utilization to about 0.43%, with production stack-trace sampling confirming benchmark-aligned results.
Why did clearing internal headers become a high-impact optimization target for Pingora Origin?
What was the first major algorithmic improvement, and what did it change?
Why didn’t a hashmap-based internal-header set remain the best option after flipping the lookup direction?
How did regex/state-machine approaches compare to hashmap in their benchmarks?
What is tree hard, and what design choices made it fast enough for production traffic?
How was the production impact validated beyond local benchmarks?
Review Questions
- What specific bottleneck percentage did internal-header clearing consume in Pingora Origin, and why does that matter at Cloudflare’s request rates?
- Describe the difference between the original header-removal strategy and the flipped lookup approach. How did the change reduce work?
- What properties of tree hard (data layout and matching strategy) helped it outperform hashmaps and regex in this use case?
Key Points
- 1
Internal-header clearing in Pingora Origin consumed over 1.7% of CPU time, making it a high-leverage optimization target because it runs on every outgoing request.
- 2
Flipping the lookup direction—intersecting request header keys with the internal-header set—cut clearing runtime substantially (about 2.39× in benchmarks).
- 3
Hash maps can still be costly in hot paths due to string hashing and key-length effects, even when lookup is “O(1)” in theory.
- 4
Regex/state-machine matching was tested but ran about twice as long as the hashmap approach for this task in their benchmarks.
- 5
Tree hard replaces general-purpose tries with a compact, cache-friendly retrieval-tree implementation optimized for header-name matching.
- 6
Tree hard reduced the clearing function’s estimated CPU utilization to about 0.43% from 1.71%, surpassing the team’s target reduction.
- 7
Production validation used statistical stack-trace sampling, and results aligned closely with benchmark predictions.