Linus On LLMs For Coding

TL;DR

LLMs are expected to become a standard coding aid, with some workflows already using them to generate code that developers submit as requests.

Briefing Cornell Notes

Briefing

Large language models are likely to become a routine part of coding—first as assistants that help generate or check code, and eventually as tools that can contribute to code review and maintenance—but their usefulness depends heavily on human understanding and strong testing, not on “autopilot” trust. The conversation centers on a practical question: will LLM-written code be submitted as a request? The answer lands on “yes,” already happening in smaller ways, because automation has steadily moved closer to the developer workflow over decades.

A key tension runs through the discussion: optimism about LLM capability versus skepticism about reliability. One side points to the near-term value of catching “obvious stupid bugs” and flagging patterns that deviate from expected norms—similar to what compilers and linters do, but potentially at a higher level of nuance. The other side highlights a hard limit: LLMs can hallucinate, invent code paths, or produce confident-sounding mistakes. That risk becomes more serious when models are allowed to act without a human catching errors, especially in security-sensitive contexts.

The most concrete example involves security bug reporting around curl. A reported issue allegedly stems from a buffer-related error path that, in practice, does not match the repository’s actual code. The current maintainer pushes back, arguing the problematic snippet isn’t present, and describes an interaction where the person using the LLM believed the model’s output rather than verifying it against the codebase. The takeaway isn’t that LLMs can’t help; it’s that they can generate plausible but wrong details, and those errors can waste time or even misdirect security work.

Another thread argues that LLMs are best when they support—not replace—developer judgment. Manual translation between languages (cited in the discussion as an approach where a team outperformed “automagic” conversion) is used to illustrate why nuance matters: understanding edge cases and the real problem domain can outperform automation that merely “gets the syntax right.” The same logic extends to testing. The conversation repeatedly returns to the idea that good automated tests are the gatekeeper for safe adoption, yet there’s skepticism that “good enough” coverage exists broadly enough to make fully automated LLM-driven changes dependable.

The discussion also challenges how people talk about LLMs. Claims of being “10x better” after minimal use are treated as misleading, because they may only reflect improvement from a low baseline rather than mastery of what “good” looks like. There’s also a call for longer, deliberate experimentation—trying tools like Claude and other assistants over months rather than minutes.

Overall, the core insight is conditional: LLMs can be powerful accelerators for developers who understand the nuance of the system and can validate outputs with tests and review. Without that grounding, hallucinations and subtle misunderstandings can turn productivity gains into costly bugs—especially when teams scale usage faster than their verification practices.

Cornell Notes

Large language models are expected to become a normal part of coding, moving beyond autocomplete into assistance for writing, reviewing, and maintaining code. The promise is strongest for catching obvious mistakes and flagging suspicious patterns, but reliability is constrained by hallucinations and confident wrong outputs. The curl security example highlights how LLM-generated snippets can be plausible yet not present in the real codebase, wasting time and misdirecting fixes. Adoption is safest when developers retain domain understanding and enforce verification through strong automated tests and human review. Short trials and hype-driven claims are treated as unreliable; meaningful evaluation requires sustained use and careful comparison against real engineering standards.

Why does the conversation treat LLM coding as “automation,” not a revolutionary leap?

The discussion frames LLMs as the next step in a long automation arc: developers no longer write machine code or assembler, and toolchains have moved from C toward higher-level languages like Rust. From that perspective, LLMs fit as another layer that helps generate or transform code, even if today’s headlines make it feel unprecedented.

What’s the strongest argument for LLMs helping with code review and maintenance?

The most practical case is bug detection—especially “obvious stupid bugs” and additional checks for patterns that don’t match expected norms. Compilers already warn about clear issues; the hope is that LLMs can extend warnings to more subtle cases by comparing code structure and intent against typical patterns.

How does the curl example illustrate the main failure mode?

A reported security issue is tied to a code path that the maintainer says does not exist in the repository. The interaction describes someone relying on LLM output that included a snippet believed to be present, while the maintainer counters that the exact code isn’t there. The lesson is that LLMs can hallucinate plausible details, so outputs must be verified against the actual codebase.

Why does “manual translation” beat “automagic translation” in the cited discussion?

The argument is that syntax-level conversion isn’t the same as understanding the problem. A team that manually translates code can track edge cases, small nuances, and where bugs actually originate. When LLM intervention replaces that understanding, teams may get similar error rates in some narrow sense while losing the deeper ability to reason about correctness.

What role do tests play in making LLM-assisted coding safe?

Automated tests are presented as the critical safety net. The conversation doubts that “good enough” tests exist widely enough to guarantee that LLM-generated changes won’t break real-world behavior. A separate anecdote about a major incident emphasizes that tests can be misleading if they don’t match real conditions—mocked or generated scenarios may pass while production fails.

Why does the discussion criticize quick, shallow evaluations of LLMs?

Claims like “I used it for five minutes and got better” are treated as uninformative. The conversation argues that people may only be improving from a low baseline, without knowing what “good” looks like. It recommends longer experimentation—using tools consistently over months—to judge whether the assistance truly improves outcomes.

Review Questions

What specific verification step does the curl example imply is necessary before acting on LLM-generated code or security claims?
How does the discussion connect LLM usefulness to developer domain understanding and the quality of automated tests?
What reasons are given for why short LLM trials can lead to misleading conclusions about productivity gains?

Key Points

1
LLMs are expected to become a standard coding aid, with some workflows already using them to generate code that developers submit as requests.
2
The most credible near-term value is catching obvious bugs and flagging suspicious deviations from expected code patterns.
3
Hallucinations remain a central risk: LLM outputs can include plausible but nonexistent code paths, as illustrated by the curl security dispute.
4
Safe use depends on human domain understanding plus verification—especially strong automated tests and careful review.
5
Automation that replaces nuance (e.g., “automagic” translation) can underperform approaches that preserve deep problem understanding.
6
Claims of dramatic productivity gains after minimal LLM use are treated as unreliable without sustained evaluation and baseline comparison.

Highlights

The curl example underscores how LLMs can generate confident snippets that don’t exist in the real repository, turning “helpful” output into wasted security effort.

LLMs may extend the spirit of compiler warnings—catching not just obvious errors but also pattern mismatches—yet they still require human validation.

Manual, nuance-aware translation can outperform automated conversion, because correctness depends on understanding edge cases, not just producing syntactically valid code.

The safest adoption path hinges on tests that reflect real-world conditions; mocked or generated scenarios can pass while production fails.

Topics

LLM Coding
Code Review
Software Testing
Security Bugs
Developer Productivity