Benchmarking JavaScript Is A Mess
Based on The PrimeTime's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Micro-benchmarks in JavaScript frequently measure JIT tiering, warmup, and caching effects rather than the true cost of the code path.
Briefing
JavaScript benchmarking is unreliable because modern engines constantly change how code runs—especially across warmup, optimization tiers, caching, and even timing precision—so micro-benchmarks often measure engine behavior instead of real application performance. The practical takeaway is blunt: if the goal is speed, measuring tiny snippets in isolation is usually “effectively worthless,” while production-style measurement of end-to-end latency (including percentiles) is what actually predicts user impact.
A major reason micro-benchmarks mislead is the JIT compilation pipeline. JavaScript engines like V8 don’t simply interpret code once; they compile in multiple tiers, deciding when functions become “hot” enough to optimize. Those tier changes can produce dramatic swings—often 10x or more—after full optimization. That means the same snippet can look fast or slow depending on whether it has already been warmed up, whether the function shape triggers compilation, and whether the benchmark accidentally benefits from caching. Benchmarks also tend to “fight” the engine by trying to eliminate caching and optimization, which can distort results further: later trials may appear faster simply because the benchmark setup prevented the same behavior that would occur in real workloads.
Timing accuracy adds another layer of trouble. JavaScript engines intentionally reduce precision to mitigate timing attacks and fingerprinting, so microsecond-level measurements can be artificially quantized or otherwise made less trustworthy. Even when developers switch from millisecond to microsecond timers, the underlying precision limits and browser differences remain. The result is a mismatch between what developers think they’re measuring (fine-grained operation cost) and what the runtime is willing to reveal.
Cross-engine variation compounds the problem. Server-side JavaScript is not one environment: V8, JSC (Safari), and SpiderMonkey (Firefox) can optimize equivalent code differently, including support for features like tail call optimization (TCO). The transcript highlights a factorial-style recursion example where engines can hit “maximum call stack size exceeded” in some runtimes but not others, because some engines implement TCO while others don’t. Even when the same language is used, the runtime and its surrounding “runtime” components matter: timers like setTimeout are handled outside the JavaScript engine, so benchmarking “JavaScript” can accidentally benchmark the runtime’s event loop and native integration.
Garbage collection (GC) and optimization/deoptimization also make micro-benchmarks lie. Creating objects in tight loops may avoid GC during the benchmark, yet in production those allocations would trigger GC pauses and reshape performance. Meanwhile, JIT compilers can optimize away work that appears unnecessary, and flame graphs can help reveal where time truly goes—but only when profiling the right workload.
The recommended approach is to benchmark the whole system in production (or production-like conditions) using real API response times and percentiles such as the 99th, then investigate regressions with profiling tools like flame graphs. For teams that must benchmark across engines or control optimization levels, the transcript points to lower-level tooling such as d8 (V8) and engine-specific flags, plus cross-engine runners like ESVU/ESHost, but emphasizes that this is heavy lifting. For most developers, the message is to stop chasing microsecond wins in isolated snippets and instead measure what users feel—then optimize the hot paths that show up under real load.
Cornell Notes
JavaScript micro-benchmarks often fail because engines change execution dynamically: JIT tiering, warmup effects, caching, and optimization can swing performance by 10x or more. Engines also reduce timing precision to limit timing attacks, so microsecond measurements may be quantized or otherwise unreliable. Different JavaScript engines (V8, JSC, SpiderMonkey) can optimize the same code differently, including support gaps like tail call optimization, which can change correctness and performance. GC behavior and JIT optimization/deoptimization can further distort micro-benchmark results. The practical solution is to benchmark end-to-end in production using latency percentiles (e.g., 99th) and profile real workloads to find true hot spots.
Why can a small JavaScript snippet benchmark 10x faster after “warmup” or optimization?
How do caching and trial ordering make micro-benchmarks misleading?
What role does reduced timing precision play in JavaScript benchmarking?
Why do benchmarks differ across Node, Deno, Bun, and browsers?
How can garbage collection and JIT optimization/deoptimization invalidate micro-benchmarks?
What’s the recommended benchmarking strategy when the goal is real performance?
Review Questions
- What specific engine behaviors (tiering, warmup, caching, timing precision) can cause micro-benchmarks to report exaggerated or misleading speedups?
- How would you design a benchmarking plan to detect a performance regression using percentile latency rather than microsecond timings?
- Give an example of how cross-engine differences (like tail call optimization support) could change both correctness and performance of the same JavaScript code.
Key Points
- 1
Micro-benchmarks in JavaScript frequently measure JIT tiering, warmup, and caching effects rather than the true cost of the code path.
- 2
JIT compilation in engines like V8 can change performance dramatically after optimization, sometimes by 10x or more, corrupting naive timing comparisons.
- 3
JavaScript engines intentionally reduce timing precision to mitigate timing attacks and fingerprinting, making microsecond-level benchmarks unreliable.
- 4
Equivalent JavaScript can perform differently across engines (V8, JSC, SpiderMonkey), including feature differences like tail call optimization support.
- 5
Garbage collection behavior and JIT optimization/deoptimization can make micro-benchmarks look better than production by avoiding GC pauses or skipping work.
- 6
For meaningful results, benchmark end-to-end in production using latency percentiles (especially the 99th) and profile real workloads to find hot spots.
- 7
Cross-engine benchmarking is possible but requires substantial tooling and pipeline work; most teams should focus on production measurements first.