Karpathy vs. McKinsey: The Truth About AI Agents (Software 3.0)

TL;DR

Karpathy frames LLMs as probabilistic “stochastic simulations of people,” which undermines assumptions of deterministic execution.

Briefing Cornell Notes

Briefing

AI agents are colliding with enterprise reality, and the sharpest contrast this week came from Andre Karpathy’s “Software 3.0” framing versus McKinsey’s “agentic mesh” pitch. Karpathy’s core claim is that large language models should be treated less like deterministic programs and more like “stochastic simulations of people”—useful, human-feeling, and powerful, but inherently jagged in execution. That framing matters because it forces a design shift: software must assume humans will validate outputs, and systems must be built to keep that human-in-the-loop workflow sustainable.

Karpathy’s talk to entrepreneurs at Y Combinator Startup School argues that the next coding language is effectively English, not because code will disappear, but because LLMs behave like probabilistic “people spirits” rather than reliable machines. He compares LLMs to utilities—metered by tokens like electricity—and to operating systems, where different model ecosystems create user preference “wars” similar to Windows versus Mac. The practical takeaway is about how to build for agents that can’t yet be trusted to execute high-stakes tasks end-to-end. Instead of betting on full autonomy, teams should design validation loops first, making checking as easy as possible. He also recommends constraining generation—“putting the LLM on a short leash”—so evaluators aren’t overwhelmed by hundreds of candidate outputs when humans can only verify a small subset.

The transcript also flags where Karpathy’s view may be incomplete. English may become a dominant interface, but complex systems still require strong technical engineering, especially as traditional software interacts with agentic, AI-augmented components. The talk acknowledges the limits of “vibe coding” (popularized by Karpathy’s earlier work): it works well in local environments, but breaks down across deployment pipelines, CI/CD, and integrations. Karpathy’s broader vision is an “augmented Iron Man suit” for human reach and control—agents expand what people can do, but data systems, control systems, and validation mechanisms must be engineered to match how probabilistic models behave.

That builder-first emphasis is contrasted with McKinsey’s CEO-facing messaging. While the boardroom-friendly themes—workflow thinking, not just task automation—are directionally correct, the transcript criticizes “agentic mesh” as vague “word salad” lacking empirical grounding. The concern is practical: McKinsey’s implied promise that agents can be plugged in like USB ports, swapping models such as “Mistral small” or “GPT-3.5 Turbo” with minimal rework, clashes with how enterprise systems actually ship. The transcript points to a key mismatch between simplified narratives and engineering constraints, including the underwhelming performance of edge computing for models—an approach that has not delivered the expected sustained gains.

The closing message is a call for cultural change: organizations need to tell the truth about AI system complexity and adopt a crawl-walk-run rollout strategy rather than starting with full automation of core business lines. The payoff can be real—agentic systems can unlock major value—but only if teams design for human validation, constrain generation, and build systems that don’t pretend integration is plug-and-play.

Cornell Notes

Karpathy’s “Software 3.0” reframes LLMs as “stochastic simulations of people,” not deterministic programs. That shift implies software must be designed around probabilistic behavior: humans should validate outputs in a sustainable loop, and generation should be constrained so evaluators aren’t swamped. He argues the next “coding language” is English as an interface, while also admitting that vibe coding struggles in real deployment pipelines (CI/CD, integrations). The transcript contrasts this builder-focused, implementable view with McKinsey’s “agentic mesh,” criticized as vague and insufficiently grounded for engineering teams. The stakes: enterprise AI projects fail when board-level promises ignore integration complexity and the limits of autonomy.

Why does treating LLMs as “stochastic simulations of people” change how software should be built?

Because probabilistic outputs are “jagged” in reliability. If execution can’t be trusted end-to-end, systems must be engineered for human validation. That means designing validation loops into the workflow from the start and treating AI outputs as candidates that require checking, not as deterministic results that can be safely executed without review.

What does “putting the LLM on a short leash” mean in practice?

It’s a constraint on how much the model generates relative to how many items humans can realistically evaluate. The example given is generating hundreds of ad variants while humans can only validate 10—an imbalance that wastes compute and effort. Constraining generation keeps the system aligned with the human evaluation budget.

How does Karpathy connect LLMs to utilities and operating systems?

He uses analogies to explain how LLMs fit into product ecosystems. As utilities, usage can be metered (e.g., dollars per token), similar to electricity. As operating systems, different model ecosystems create user preference “wars” (compared to Windows vs. Mac), implying that integration and workflow design matter as much as raw model capability.

Where does the transcript say Karpathy’s “English as the next coding language” view may be limited?

It argues that complex systems still require strong technical engineers who understand system construction. English may become the dominant interface, but the transition won’t be “English driving code all the way through,” because agentic augmentation increases system complexity and introduces integration challenges across the stack.

What specific engineering concern is raised about McKinsey’s “agentic mesh”?

The transcript claims “agentic mesh” lacks empirical grounding and doesn’t translate into buildable guidance. It criticizes the implied promise that agents can be plugged in like USB ports—swapping models with minimal modification—contradicting how enterprise systems require integration work, model-specific behavior handling, and data/control alignment.

Why does edge computing for models come up, and what does the transcript imply about it?

It’s used to challenge the idea that small models running at the edge can reliably handle agentic tasks. The transcript says edge computing for models hasn’t worked as expected, noting that larger models have shown sustained intelligence gains that smaller edge models can’t match. It also mentions Apple’s bet on edge computing as an example that hasn’t paid off yet.

Review Questions

How do human-in-the-loop validation and constrained generation work together to make agentic systems reliable?
What are the practical reasons vibe coding struggles beyond local environments, according to the transcript?
Why does the transcript argue that “plug-and-play” agent architectures are unrealistic in enterprise settings?

Key Points

1
Karpathy frames LLMs as probabilistic “stochastic simulations of people,” which undermines assumptions of deterministic execution.
2
Agentic software should be designed around human validation loops, with checking treated as a first-class engineering problem.
3
Constraining generation prevents evaluator overload—for example, generating far more candidates than humans can review wastes compute.
4
Even if English becomes a dominant interface, building and maintaining complex systems still requires strong technical engineering and careful integration.
5
The transcript criticizes McKinsey’s “agentic mesh” as lacking empirical grounding and not translating into buildable guidance for tech teams.
6
“Plug-and-play” agent promises (USB-like model swapping) conflict with real enterprise integration constraints and model-dependent behavior.
7
A crawl-walk-run rollout approach is recommended over starting with full automation of core business functions.

Highlights

Karpathy’s “people spirits” framing treats LLM outputs as probabilistic candidates, not dependable program execution—so validation must be engineered in.

A key operational warning: generating hundreds of options while humans can only validate a handful turns agentic systems into wasted effort.

The transcript’s central critique of “agentic mesh” is that it sounds plug-and-play for CEOs, but doesn’t match what engineering teams need to ship reliably.

Edge computing for models is portrayed as underperforming versus expectations, weakening the case for small-model autonomy at the edge.

Topics

Software 3.0
LLM Agents
Human Validation
Agentic Mesh
Vibe Coding

Mentioned

Andre Karpathy