I Tracked Every AI Win & Failure in 2025. Here's What Actually Worked (9 Surprises)

TL;DR

Letting LLMs use code as a tool turns natural language into computer control, enabling agentic workflows that non-code users can adopt.

Briefing Cornell Notes

Briefing

2025’s biggest unlock wasn’t a new model—it was letting large language models (LLMs) use code as a tool. That shift turned “talk to the model” into “talk to the computer,” enabling agentic workflows that can manipulate files, run tasks, and iterate with far less technical friction. The change built gradually through the year as practical layers emerged: cloud code, model context protocol, “skills,” Codex, and tools like Cursor. By the end of the year, plain English increasingly became a control surface for software work, not just a way to generate text. The implication is clear: once an LLM can operate parts of a computer, the addressable value expands beyond developers into non-code users—because the interface becomes natural language plus execution.

A second major surprise was images finally catching up enough to matter. For much of 2025, text stayed the most reliable way to get accurate results, with code treated as just another text-like language. But when image generation reached the point where detailed infographics, maps, layouts, and slide-like visuals didn’t look “weird,” the payoff went beyond prettier decks. It enabled generative UI—interfaces that aren’t fixed to a developer’s static design or a single screen. The vision described is a continuous, evolving digital surface: experiences that adapt to context, potentially spanning wearables, combinations of phone and laptop, and even generative elements embedded alongside services like Comet. The caveat is equally important: consistency and habit still matter, so the future isn’t a total replacement of familiar interfaces. Still, solving images was framed as a foundational step toward graphical experiences that can evolve with users rather than forcing users to adapt to rigid UI.

Beyond those two headline shifts, several operational lessons stood out. One was that progress didn’t require “AI developers” in the narrow sense. Individuals who designed systems—using templates, validators, retries, and fast iteration—could outperform larger teams that treated engineering as a model-worship problem. That fed a broader reframe: technical and non-technical categories are less useful than curiosity about domain problems and willingness to learn the AI skills needed to solve them.

Another practical win came from verification loops in agentic systems. Measuring correctness across multiple dimensions and feeding results back into iteration made agents far harder to game and dramatically faster—likened to adding a jet engine to an airplane. The expectation for 2026 is more standardization: shared eval and verification primitives, especially for areas like accessibility, so teams don’t reinvent the same checks.

The “messy middle” also emerged as underbuilt but valuable. While attention often fixated on model makers owning the stack, the transcript argues that transforming messy model outputs into structured representations—routing intent, orchestrating tool calls, handling exceptions, and producing usable interfaces—creates durable leverage. Cursor is cited as a prominent example, but the broader point is that intelligence substrates can be built on top of models.

Finally, the year’s momentum shifted from hype to quality. AI slop was treated as a symptom of unmanaged, unconstrained production; better systems can generate performant, information-dense marketing and content without sacrificing usefulness. Meanwhile, labor-market signals suggested faster selection for creative problem-solving instincts, and leaders increasingly moved from cost-cutting toward quality lift—scaling customer value while keeping people focused on what only humans can do. The closing question: what exceeded expectations in 2025?

Cornell Notes

The transcript’s core claim is that 2025’s real progress came from practical unlocks, especially letting LLMs use code as a tool and finally making images reliable enough for complex visuals. Together, these changes expanded what agents can do: code access turns natural language into computer control, while stronger image generation enables generative UI that can adapt beyond fixed screens. It also highlights execution lessons—verification loops speed and improve agent reliability, and the “messy middle” (routing, orchestration, exception handling, and UI) remains a major value-creation layer. Progress didn’t depend solely on AI specialists; system design and fast iteration by domain-focused builders often outperformed model-centric approaches. The year’s direction points toward standardized evaluation, higher-quality deployments, and content that earns attention through usefulness rather than volume.

Why does “LLMs using code as a tool” matter more than incremental model improvements?

The transcript frames code access as a massive unlock because it turns an LLM from a text generator into an executor. Once an LLM can work with code, it can operate parts of the computer—manipulating files, running tasks, and supporting agentic workflows. That expands usability beyond technical users because plain English becomes a way to control software. The year’s progress is described as a gradual assembly of enabling layers: cloud code, model context protocol, “skills,” Codex, and tools like Cursor—each making it easier for agents to translate natural language into actions.

What changed about images in 2025, and why did that enable new interface possibilities?

Images improved to the point where detailed infographics, maps, layouts, and slide-like outputs looked coherent rather than “weird.” The transcript argues that this matters because humans process information visually faster than through text alone. With images solved enough for mixed text-and-image work, the path opens to generative UI: interfaces that evolve with the user instead of being locked to a developer’s fixed screen layout. Examples include redecoration-like personal applications (fashion and home), and the idea of continuous engagement surfaces across devices—wearables, phone/laptop combinations, or generative elements embedded in experiences like Comet.

How did the transcript challenge the idea that “AI developers are everything”?

It argues that what’s often missing isn’t an AI specialist but system design. Individuals can out-execute teams by treating engineering as a workflow they can design—using templates, validators, and retries, then iterating quickly and aggressively. The transcript also warns against confusing “agentic” with “good.” The broader takeaway is that everyone should be judged by curiosity about domain problems and willingness to learn the AI skills needed to solve them, including technical skills that are increasingly approachable via tools like ChatGPT (e.g., scheduled learning and coding reviews).

What are verification loops, and why are they portrayed as a game-changer for agents?

Verification loops are mechanisms that measure correctness across multiple dimensions and feed results back into agent iteration. The transcript compares the effect to adding a jet engine to an airplane: agents become much faster and more reliable when correctness checks are hard to game. It also predicts an ecosystem and standardization in 2026—teams will adopt shared eval/verification loops (for example, accessibility checks) rather than reinventing them each time.

What does “the messy middle” refer to, and why does it still hold value?

The messy middle is the layer between raw model outputs and useful user outcomes. The transcript claims it’s underbuilt relative to its value: it transforms messy inputs into structured representations, routes intent, orchestrates tool calls, handles exceptions, and produces useful, domain-specific user interfaces. Even if model makers build strong substrates, the middle layer can deliver outputs users actually want. Cursor is cited as a prominent example, with other startups expanding into this orchestration layer across technical and non-technical domains.

How does the transcript connect AI slop to system design and quality control?

AI slop is described as a symptom of unconstrained, unmanaged AI use—especially when companies scale content production without the right controls. The transcript argues that performance and usefulness don’t have to be sacrificed: with the right systems, AI can produce compelling ad flows, email marketing, and content that outperforms what humans can do. It suggests a practical future where AI isn’t announced; instead, the content’s usefulness, information density, and reusability become the real metrics. It also notes AI can help by grounding claims in research and increasingly supporting factchecking and structure.

Review Questions

Which two “unlock” areas does the transcript identify as most underestimated in 2025, and what concrete capabilities did they add to agentic workflows?
How do verification loops change agent behavior, and what standardization trend does the transcript expect for 2026?
What does the transcript mean by “the messy middle,” and how does it argue that value persists even as model makers compete?

Key Points

1
Letting LLMs use code as a tool turns natural language into computer control, enabling agentic workflows that non-code users can adopt.
2
Image generation reaching reliable, detailed outputs unlocked generative UI—interfaces that can evolve with users and context rather than staying fixed to developer layouts.
3
Progress in 2025 often came from system design (templates, validators, retries, fast iteration) rather than from relying on specialized “AI developers” alone.
4
Verification loops—multi-dimensional correctness checks feeding agent iteration—made agents faster, less gameable, and more dependable.
5
The “messy middle” (intent routing, orchestration, exception handling, and domain-specific UI) remains a major value-creation layer that’s still underbuilt.
6
AI slop is framed as a symptom of unconstrained production; useful, high-density content can be achieved with managed systems and stronger grounding/factchecking.
7
Market momentum shifted toward quality lift and customer value, with leaders treating people’s expertise and attention as a precious asset rather than a cost to cut.

Highlights

The most important 2025 unlock was not a new model—it was LLMs using code as a tool, making “plain English” a way to manipulate the computer.

Once images became reliably detailed (infographics, maps, layouts), the path opened to generative UI and continuous, context-aware digital interfaces.

Verification loops function like an engine upgrade for agents: correctness checks across dimensions make iteration both faster and harder to game.

The “messy middle” is where durable value lives—turning messy model outputs into structured actions, routed intent, and usable interfaces.

AI slop is treated as unmanaged production; better systems can generate performant, information-dense marketing and content without sacrificing usefulness.

Topics

LLM Code Tools
Generative UI
Agent Verification Loops
Messy Middle Orchestration
AI Content Quality

Mentioned

Nate B Jones
LLM
UI
AI