It's Really Just That Bad

TL;DR

Don’t trust language benchmark rankings without verifying that implementations compute the same problem on the same inputs.

Briefing Cornell Notes

Briefing

A widely shared “language benchmark” ranking that put Fortran far ahead of C (and other languages) collapses under basic verification: the Fortran implementation is effectively not computing the full Levenshtein distance workload the benchmark claims to measure. The mismatch isn’t subtle. When the benchmark is stress-tested with additional checks, the Fortran program reports correct-looking minimum distances while failing to produce correct maximum distances—then the root cause is traced to input “clipping” that shortens the strings being compared. That clipping dramatically reduces the amount of work in an O(n²) pairwise string comparison, letting Fortran appear faster without actually doing the same computation.

The discussion starts with the Levenshtein distance benchmark itself—an algorithm that compares two strings by counting the minimum number of edits (insertions, deletions, substitutions) needed to transform one into the other. The benchmark’s outer structure forces a huge amount of work: it loops over all pairs of strings, yielding an n² comparison count, and each comparison itself loops over characters, making the hottest path effectively an “n to the fourth” style workload. That’s why performance differences should be meaningful only if every language implementation performs the same algorithm on the same inputs.

Yet the Fortran-vs-C speed gap looks suspicious even at the assembly level. After compiling optimized C and Fortran versions, the generated inner-loop code is largely the same: both inline the same kind of min-of-three logic and use branch-free conditional moves rather than branches. The only notable difference is minor instruction scheduling and an extra loop unroll in the Fortran build—but the unroll doesn’t explain the large timing advantage. A forced unroll pragma doesn’t change performance either, undermining the “Fortran is inherently faster” narrative.

The real break comes when the benchmark is run with the exact input it uses. The benchmark passes the strings via command-line arguments, which already adds a questionable confounder (command-line parsing time and behavior), but the more serious issue is correctness. The Fortran code appears to match the benchmark’s expected minimum distance under the benchmark’s own reporting, which can mask errors because matching only the closest pair doesn’t guarantee the full distance matrix is computed correctly.

When the analysis expands to compute a maximum distance as well, the results diverge: the Fortran version finds a maximum that doesn’t align with the C version. That discrepancy points to a logic issue rather than a compiler optimization. Inspecting Fortran-specific behavior reveals the program is using the “Klippenstein algorithm” (not the straightforward Levenshtein computation implied by the benchmark), clipping the input strings to a smaller effective length. In an all-pairs distance workload, reducing string length reduces the work per comparison, so the benchmark’s timing becomes a measure of an altered problem size—not a fair comparison of language speed on the same algorithm.

Even after correcting the behavior to use the full strings, the ordering flips: C becomes faster than Fortran. The conclusion is blunt: the benchmark is methodologically unreliable because it lacks validation of correctness and equivalence across implementations, and it mixes in performance-relevant implementation details (like dynamic allocation patterns and input handling) that aren’t controlled. The broader takeaway is that microbenchmarks posted as rankings—especially when correctness checks are missing—can spread misinformation faster than they can be audited, turning language performance debates into “benchmarking bugs” rather than engineering insight.

Cornell Notes

The Levenshtein distance benchmark that ranked Fortran far above C fails a basic equivalence test. Assembly inspection shows the inner-loop logic for C and Fortran is largely the same, so the huge speed gap can’t be explained by compiler codegen alone. When the benchmark is validated more thoroughly (e.g., checking maximum distances, not just the minimum), the Fortran result is inconsistent with C. The discrepancy traces to the Fortran implementation using the Klippenstein algorithm, which clips input strings—reducing the effective work in an all-pairs O(n²) workload. Once the full strings are used, the performance ordering flips, showing the original ranking measured a smaller problem, not true language speed.

Why does the Levenshtein “all pairs” benchmark create such a heavy performance hotspot?

The benchmark compares every string against every other string, so there’s an outer loop over string pairs (about n² comparisons). Each pairwise Levenshtein computation itself loops over characters, so the total work concentrates in the nested inner loops. That makes the innermost distance computation the hottest path, where compiler optimizations (inlining, branch elimination, instruction scheduling) should matter only if every implementation computes the same distances on the same inputs.

What did assembly-level inspection reveal about the C vs Fortran implementations?

After compiling optimized C and Fortran versions, the generated inner-loop code is strikingly similar. The min-of-three logic is inlined (no function call overhead), and comparisons are implemented in a branch-free way using conditional moves rather than branching. The only meaningful difference noted is an extra loop unroll in the Fortran codegen, but forcing unrolling via a pragma doesn’t materially change speed—so unrolling isn’t the source of the large timing gap.

How can a benchmark appear correct while still being wrong?

Matching only the minimum distance can hide errors. If a program computes the correct distance for the closest pair but produces incorrect distances for other pairs, the benchmark’s reported “minimum” can still match expected output. That’s why adding additional checks—like computing a maximum distance—can expose incorrect computation even when the benchmark’s primary reported metric looks right.

What specific bug/behavior undermined the Fortran speed advantage?

The Fortran implementation uses the Klippenstein algorithm and clips the input strings, shortening the effective lengths being compared. In an all-pairs distance workload, reducing string length reduces the amount of work per comparison, so runtime drops without the program actually computing the full Levenshtein distances the benchmark claims to measure.

What happens to the performance ranking after using the full strings?

When the Fortran program is made to use the full strings (i.e., not clipped), the performance ordering reverses: C becomes faster than Fortran. That flip confirms the original “Fortran is faster” claim was driven by altered problem size and incomplete equivalence, not by a legitimate language-level performance advantage.

Why is command-line input handling a methodological red flag in this benchmark?

The benchmark reads the strings as command-line arguments. That means runtime includes command-line parsing behavior and OS/runtime details that typically aren’t part of the algorithmic workload being compared. Even if parsing time isn’t the dominant factor here, it adds variance and makes the benchmark less representative of real workloads where data is provided through files, APIs, or in-memory structures.

Review Questions

What evidence suggests the Fortran speedup wasn’t due to fundamentally different inner-loop codegen?
How does checking only a benchmark’s reported minimum distance risk missing correctness problems?
Why does clipping string lengths disproportionately affect runtime in an all-pairs Levenshtein workload?

Key Points

1
Don’t trust language benchmark rankings without verifying that implementations compute the same problem on the same inputs.
2
Assembly inspection can quickly rule out “mystery” speedups when inner-loop logic and optimizations (like inlining and conditional moves) are effectively equivalent.
3
A benchmark that reports only a single metric (e.g., minimum distance) can mask widespread correctness errors across the rest of the distance matrix.
4
Input clipping or algorithm substitution (e.g., using the Klippenstein algorithm) can reduce effective work and produce misleading performance results.
5
Benchmark methodology matters: command-line argument parsing can introduce confounds unrelated to the algorithm under test.
6
If performance ordering flips after correcting equivalence (e.g., using full strings), the original ranking likely measured a smaller problem, not true language speed.
7
Microbenchmarks posted as rankings without correctness validation can spread misinformation and force maintainers into defensive, low-signal debates.

Highlights

The Fortran implementation produced the benchmark’s expected minimum distance while failing a broader correctness check (maximum distance), revealing it wasn’t computing the full distance matrix correctly.

Inner-loop assembly for C and Fortran looked nearly the same—so the large timing gap couldn’t be explained by compiler codegen differences alone.

The apparent Fortran advantage traced to input clipping via the Klippenstein algorithm, which reduced string lengths and therefore reduced the work in the all-pairs workload.

After switching to full strings, the performance order flipped, with C outperforming Fortran—confirming the original benchmark measured an altered problem size.

Topics

Levenshtein Distance
Benchmark Correctness
Assembly Inspection
Algorithm Clipping
Methodology Confounds