Benchmarking JavaScript Is A Mess

TL;DR

Micro-benchmarks in JavaScript frequently measure JIT tiering, warmup, and caching effects rather than the true cost of the code path.

Briefing Cornell Notes

Briefing

JavaScript benchmarking is unreliable because modern engines constantly change how code runs—especially across warmup, optimization tiers, caching, and even timing precision—so micro-benchmarks often measure engine behavior instead of real application performance. The practical takeaway is blunt: if the goal is speed, measuring tiny snippets in isolation is usually “effectively worthless,” while production-style measurement of end-to-end latency (including percentiles) is what actually predicts user impact.

A major reason micro-benchmarks mislead is the JIT compilation pipeline. JavaScript engines like V8 don’t simply interpret code once; they compile in multiple tiers, deciding when functions become “hot” enough to optimize. Those tier changes can produce dramatic swings—often 10x or more—after full optimization. That means the same snippet can look fast or slow depending on whether it has already been warmed up, whether the function shape triggers compilation, and whether the benchmark accidentally benefits from caching. Benchmarks also tend to “fight” the engine by trying to eliminate caching and optimization, which can distort results further: later trials may appear faster simply because the benchmark setup prevented the same behavior that would occur in real workloads.

Timing accuracy adds another layer of trouble. JavaScript engines intentionally reduce precision to mitigate timing attacks and fingerprinting, so microsecond-level measurements can be artificially quantized or otherwise made less trustworthy. Even when developers switch from millisecond to microsecond timers, the underlying precision limits and browser differences remain. The result is a mismatch between what developers think they’re measuring (fine-grained operation cost) and what the runtime is willing to reveal.

Cross-engine variation compounds the problem. Server-side JavaScript is not one environment: V8, JSC (Safari), and SpiderMonkey (Firefox) can optimize equivalent code differently, including support for features like tail call optimization (TCO). The transcript highlights a factorial-style recursion example where engines can hit “maximum call stack size exceeded” in some runtimes but not others, because some engines implement TCO while others don’t. Even when the same language is used, the runtime and its surrounding “runtime” components matter: timers like setTimeout are handled outside the JavaScript engine, so benchmarking “JavaScript” can accidentally benchmark the runtime’s event loop and native integration.

Garbage collection (GC) and optimization/deoptimization also make micro-benchmarks lie. Creating objects in tight loops may avoid GC during the benchmark, yet in production those allocations would trigger GC pauses and reshape performance. Meanwhile, JIT compilers can optimize away work that appears unnecessary, and flame graphs can help reveal where time truly goes—but only when profiling the right workload.

The recommended approach is to benchmark the whole system in production (or production-like conditions) using real API response times and percentiles such as the 99th, then investigate regressions with profiling tools like flame graphs. For teams that must benchmark across engines or control optimization levels, the transcript points to lower-level tooling such as d8 (V8) and engine-specific flags, plus cross-engine runners like ESVU/ESHost, but emphasizes that this is heavy lifting. For most developers, the message is to stop chasing microsecond wins in isolated snippets and instead measure what users feel—then optimize the hot paths that show up under real load.

Cornell Notes

JavaScript micro-benchmarks often fail because engines change execution dynamically: JIT tiering, warmup effects, caching, and optimization can swing performance by 10x or more. Engines also reduce timing precision to limit timing attacks, so microsecond measurements may be quantized or otherwise unreliable. Different JavaScript engines (V8, JSC, SpiderMonkey) can optimize the same code differently, including support gaps like tail call optimization, which can change correctness and performance. GC behavior and JIT optimization/deoptimization can further distort micro-benchmark results. The practical solution is to benchmark end-to-end in production using latency percentiles (e.g., 99th) and profile real workloads to find true hot spots.

Why can a small JavaScript snippet benchmark 10x faster after “warmup” or optimization?

Modern engines use JIT compilation with multiple tiers. Code may run initially in a less-optimized form, then later—once functions become “hot” based on usage patterns—get compiled and optimized more aggressively. The transcript notes that smaller pieces of code can show 10+× improvements after full optimizations, which means the benchmark is partly measuring when tiering kicks in, not just the snippet’s inherent cost.

How do caching and trial ordering make micro-benchmarks misleading?

Benchmarks often run the same code repeatedly, so earlier trials can populate caches and internal engine state. That can make later trials look faster, even if the snippet itself didn’t fundamentally improve. The transcript also criticizes benchmark setups that try to eliminate caching/optimization, because that can distort what real applications experience—especially when production workloads naturally exercise “hot paths” repeatedly.

What role does reduced timing precision play in JavaScript benchmarking?

JavaScript engines intentionally reduce precision to mitigate timing attacks and fingerprinting. The transcript describes how high-resolution timestamps can be quantized (e.g., to increments like 100 microseconds in certain contexts) and that this makes micro-level timing comparisons unreliable without special tweaks. As a result, microsecond benchmarking can become a measurement of the runtime’s defenses rather than the operation’s true cost.

Why do benchmarks differ across Node, Deno, Bun, and browsers?

Server-side JavaScript isn’t a single engine. V8 (Chrome/Node/Deno), JSC (Safari), and SpiderMonkey (Firefox) can produce different performance profiles for equivalent code. The transcript also highlights that runtimes integrate with native components differently: setTimeout isn’t executed purely inside the JavaScript engine, so benchmarking “JavaScript” can end up benchmarking runtime/event-loop behavior too. Bun’s underlying engine (JavaScriptCore) and its runtime integration can therefore change results.

How can garbage collection and JIT optimization/deoptimization invalidate micro-benchmarks?

Micro-benchmarks may allocate objects in a tight loop but still avoid triggering GC during the benchmark window, while production would allocate continuously and incur GC pauses. Separately, JIT compilers can optimize away work that appears unnecessary, meaning the benchmark might not represent real computation. Both effects can make micro-benchmarks look cleaner than real workloads.

What’s the recommended benchmarking strategy when the goal is real performance?

Measure end-to-end behavior in production (or production-like conditions) using API response times and percentile metrics like the 99th. Then use profiling (e.g., flame graphs) to locate where time is actually spent. The transcript argues that benchmarking isolated micro-snippets is often “pretending” and that production latency shifts (right shift = worse, left shift = better) are the meaningful signal.

Review Questions

What specific engine behaviors (tiering, warmup, caching, timing precision) can cause micro-benchmarks to report exaggerated or misleading speedups?
How would you design a benchmarking plan to detect a performance regression using percentile latency rather than microsecond timings?
Give an example of how cross-engine differences (like tail call optimization support) could change both correctness and performance of the same JavaScript code.

Key Points

1
Micro-benchmarks in JavaScript frequently measure JIT tiering, warmup, and caching effects rather than the true cost of the code path.
2
JIT compilation in engines like V8 can change performance dramatically after optimization, sometimes by 10x or more, corrupting naive timing comparisons.
3
JavaScript engines intentionally reduce timing precision to mitigate timing attacks and fingerprinting, making microsecond-level benchmarks unreliable.
4
Equivalent JavaScript can perform differently across engines (V8, JSC, SpiderMonkey), including feature differences like tail call optimization support.
5
Garbage collection behavior and JIT optimization/deoptimization can make micro-benchmarks look better than production by avoiding GC pauses or skipping work.
6
For meaningful results, benchmark end-to-end in production using latency percentiles (especially the 99th) and profile real workloads to find hot spots.
7
Cross-engine benchmarking is possible but requires substantial tooling and pipeline work; most teams should focus on production measurements first.

Highlights

JIT tiering means the same JavaScript snippet can look 10x faster after optimization, so warmup isn’t a nuisance—it’s the measurement.

Engines reduce timing precision to blunt timing attacks, so microsecond benchmarking can become a test of runtime defenses rather than operation cost.

Tail call optimization support differs across engines, and recursion-heavy code can hit call stack limits in some runtimes but not others.

setTimeout isn’t “pure JavaScript engine work”; it involves runtime integration, so benchmarking it can measure more than the JS snippet.

The most actionable benchmark is production latency percentiles (like the 99th) plus profiling to locate real hot paths. 

Topics

JavaScript Benchmarking
JIT Compilation
Timing Attacks
Cross-Engine Performance
Garbage Collection

Mentioned

JIT
TCO
V8
JSC
GC
API
CPU
Tiers