🚨🚨 Full Casey Muratori: Language Perf and Picking A Lang Stream 🚨🚨

TL;DR

Choosing a new programming language for serious work is expensive because real proficiency takes time and early adoption often brings tooling “jank.”

Briefing Cornell Notes

Briefing

A deep dive into language-performance “benchmarks” turns into a broader warning: flashy charts can be wildly misleading when implementations aren’t validated and workloads are subtly distorted. The discussion centers on a popular Levenshtein-distance benchmark where Fortran appears to crush C and other languages—sometimes by double-digit margins—yet careful inspection reveals multiple ways the benchmark can produce false conclusions.

The conversation starts with the practical problem of choosing a “language of the year” for building asynchronous web-service pipelines—aggregating Twitch chat, recording and decoding audio, calling Whisper, and then combining results into a final voice output. The participants weigh Go’s productivity and goroutines against its “foot guns,” Rust’s power but low personal enjoyment, and newer options like Zig, Odin, Elixir, and especially J for metaprogramming. A recurring theme is that switching languages is costly: real proficiency takes time, tooling and language behavior can change midstream, and early adoption often means dealing with compiler/debugger “jank.” There’s also skepticism about whether emerging languages have mature standard libraries for web-service work comparable to Go.

That language debate quickly gets overshadowed by a performance rant. The benchmark in question uses Levenshtein distance plus another “bouncing ball” style test, and the chart’s headline claim—Fortran outperforming C, Zig, Odin, and even Rust—sparks outrage because it invites naive inferences like “rewrite your code in Fortran for a big speedup.” The critique isn’t that Fortran is slow or that other languages are fast; it’s that the benchmark methodology is broken enough to invalidate the ranking.

The key technical finding comes from reproducing and inspecting the benchmark’s behavior. Assembly-level comparisons between C (compiled with GCC) and Fortran (compiled with gfortran) show the inner loop logic is essentially the same: GCC inlines a branch-free min operation using conditional moves, and the Fortran code generation looks similar. That makes the large performance gap suspicious. When the benchmark is run with additional verification, the Fortran implementation reports correct-looking minimum distances for the “closest pair” but fails when checking broader correctness metrics like maximum distance.

The real culprit is uncovered by printing what the Fortran program actually uses from the command-line arguments: it turns out the Fortran version is clipping or truncating input strings (the discussion calls out “Klippinstein,” i.e., a clipped variant). That changes the workload dramatically—especially in an N-squared Levenshtein computation where longer strings can dominate cost—so the benchmark becomes a test of a different problem. The participants also note other confounds, including differences in memory allocation patterns (Fortran doing allocatable buffers and C using stack allocation) and the benchmark’s inclusion of command-line parsing overhead, which is not representative of typical real workloads.

The conclusion is blunt: microbenchmarks posted as rankings without rigorous validation are “bad information.” The participants argue that benchmark authors should validate correctness across implementations, spot-check generated code, and ensure all languages are solving the same problem under comparable conditions. Otherwise, language communities and maintainers get unfairly dragged into defensive debates based on charts that measure bugs, shortcuts, or mismatched workloads—not language performance in the real world.

Cornell Notes

The discussion warns that published language-performance charts can be misleading when benchmarks aren’t validated. A Levenshtein-distance benchmark shows Fortran far ahead of C and other languages, but assembly inspection suggests the inner-loop logic is similar, so the size of the gap doesn’t add up. When the Fortran results are checked more thoroughly, the implementation appears to compute a different workload by clipping/truncating input strings, which changes the cost structure of an N-squared edit-distance computation. The takeaway is that benchmark rankings require correctness checks, workload equivalence, and spot-checking of generated code; otherwise they risk spreading false conclusions about language speed.

Why does the Levenshtein benchmark’s “Fortran is faster” headline fail basic credibility checks?

The inner-loop code generated for C and Fortran looks nearly the same at the assembly level: both inline the min logic and avoid branching using conditional moves. If the hot loop is effectively equivalent, a large performance gap (often framed as double-digit or more) is suspicious. That mismatch between similar codegen and big timing differences is the first red flag that the benchmark may be measuring something other than true Levenshtein distance on identical inputs.

How did deeper validation reveal that the Fortran implementation wasn’t doing the same work as the C version?

The participants ran the benchmark and then added correctness checks beyond the benchmark’s reported “minimum distance” for the closest pair. When they computed additional metrics (like maximum distance) or otherwise verified more of the expected results, the Fortran version didn’t match. That suggested the program was either skipping parts of the search space or transforming inputs in a way that preserved the benchmark’s narrow output while breaking broader correctness.

What specific workload distortion was uncovered in the Fortran run?

Printing the effective command-line arguments used by the Fortran program showed it was applying a clipping/truncation step—described as a “Klippinstein” variant. In an N-squared Levenshtein computation, truncating longer strings can drastically reduce the amount of edit-distance work, so the benchmark becomes a test of a modified problem rather than the intended full-distance computation.

Why do memory allocation differences matter in this kind of benchmark?

The discussion notes that the Fortran code uses allocatable buffers (dynamic allocation) for intermediate string storage, while the C version uses stack allocation. Even if the inner loop is similar, allocation strategy can affect runtime and makes cross-language comparisons less apples-to-apples unless the benchmark controls for these differences.

What methodological mistakes make microbenchmarks especially dangerous when shared as rankings?

The participants argue that microbenchmarks posted without validation can spread misinformation. Common issues include: not spot-checking generated code, not verifying that all implementations compute the same problem, and including confounds like command-line parsing overhead. When the workload or correctness differs, the chart becomes a ranking of bugs or shortcuts, not language performance.

How does this connect to the earlier language-choice conversation?

The language-selection debate emphasized that switching languages is costly and that tooling and methodology can mislead. The benchmark rant reinforces that point: choosing a language based on unreliable performance charts is another form of bad decision-making. If the evidence is flawed, the “investment” in adopting a new language may be wasted or misdirected.

Review Questions

What evidence suggests that the C and Fortran implementations are doing similar inner-loop work, and why does that make the timing gap suspicious?
How can truncating/clipping input strings change the computational cost of an N-squared Levenshtein benchmark?
List three validation steps a responsible benchmark author should perform before publishing a cross-language performance ranking.

Key Points

1
Choosing a new programming language for serious work is expensive because real proficiency takes time and early adoption often brings tooling “jank.”
2
Go’s main appeal for asynchronous services is productivity and goroutines, even if the language has “foot guns.”
3
Benchmark charts are unreliable when implementations aren’t validated for correctness across the full expected output, not just a narrow metric.
4
Assembly-level inspection can reveal when a benchmark’s headline results don’t match the apparent hot-loop code generation.
5
A major Levenshtein benchmark distortion came from Fortran clipping/truncating inputs, effectively changing the workload being measured.
6
For cross-language performance comparisons, memory allocation strategy and other confounds (like command-line parsing) must be controlled or explicitly accounted for.
7
Benchmark authors should spot-check codegen and verify that every language implementation computes the same problem under comparable conditions before publishing rankings.

Highlights

The inner-loop assembly for C and Fortran looked nearly identical, yet the benchmark claimed a large speed advantage—an immediate sign the benchmark likely measured something else.

Fortran’s results held up for the benchmark’s narrow output but failed broader correctness checks, pointing to a workload mismatch rather than true language speed.

Printing the effective inputs revealed the Fortran version clipped/truncated strings (“Klippinstein”), reducing the edit-distance work and invalidating the comparison.

The core warning: microbenchmarks posted as rankings without validation can spread false conclusions and unfairly damage language reputations. 

Topics

Language Choice
Metaprogramming
Go vs Rust
Benchmark Methodology
Levenshtein Distance