I Tracked Every AI Win & Failure in 2025. Here's What Actually Worked (9 Surprises)
Based on AI News & Strategy Daily | Nate B Jones's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Letting LLMs use code as a tool turns natural language into computer control, enabling agentic workflows that non-code users can adopt.
Briefing
2025’s biggest unlock wasn’t a new model—it was letting large language models (LLMs) use code as a tool. That shift turned “talk to the model” into “talk to the computer,” enabling agentic workflows that can manipulate files, run tasks, and iterate with far less technical friction. The change built gradually through the year as practical layers emerged: cloud code, model context protocol, “skills,” Codex, and tools like Cursor. By the end of the year, plain English increasingly became a control surface for software work, not just a way to generate text. The implication is clear: once an LLM can operate parts of a computer, the addressable value expands beyond developers into non-code users—because the interface becomes natural language plus execution.
A second major surprise was images finally catching up enough to matter. For much of 2025, text stayed the most reliable way to get accurate results, with code treated as just another text-like language. But when image generation reached the point where detailed infographics, maps, layouts, and slide-like visuals didn’t look “weird,” the payoff went beyond prettier decks. It enabled generative UI—interfaces that aren’t fixed to a developer’s static design or a single screen. The vision described is a continuous, evolving digital surface: experiences that adapt to context, potentially spanning wearables, combinations of phone and laptop, and even generative elements embedded alongside services like Comet. The caveat is equally important: consistency and habit still matter, so the future isn’t a total replacement of familiar interfaces. Still, solving images was framed as a foundational step toward graphical experiences that can evolve with users rather than forcing users to adapt to rigid UI.
Beyond those two headline shifts, several operational lessons stood out. One was that progress didn’t require “AI developers” in the narrow sense. Individuals who designed systems—using templates, validators, retries, and fast iteration—could outperform larger teams that treated engineering as a model-worship problem. That fed a broader reframe: technical and non-technical categories are less useful than curiosity about domain problems and willingness to learn the AI skills needed to solve them.
Another practical win came from verification loops in agentic systems. Measuring correctness across multiple dimensions and feeding results back into iteration made agents far harder to game and dramatically faster—likened to adding a jet engine to an airplane. The expectation for 2026 is more standardization: shared eval and verification primitives, especially for areas like accessibility, so teams don’t reinvent the same checks.
The “messy middle” also emerged as underbuilt but valuable. While attention often fixated on model makers owning the stack, the transcript argues that transforming messy model outputs into structured representations—routing intent, orchestrating tool calls, handling exceptions, and producing usable interfaces—creates durable leverage. Cursor is cited as a prominent example, but the broader point is that intelligence substrates can be built on top of models.
Finally, the year’s momentum shifted from hype to quality. AI slop was treated as a symptom of unmanaged, unconstrained production; better systems can generate performant, information-dense marketing and content without sacrificing usefulness. Meanwhile, labor-market signals suggested faster selection for creative problem-solving instincts, and leaders increasingly moved from cost-cutting toward quality lift—scaling customer value while keeping people focused on what only humans can do. The closing question: what exceeded expectations in 2025?
Cornell Notes
The transcript’s core claim is that 2025’s real progress came from practical unlocks, especially letting LLMs use code as a tool and finally making images reliable enough for complex visuals. Together, these changes expanded what agents can do: code access turns natural language into computer control, while stronger image generation enables generative UI that can adapt beyond fixed screens. It also highlights execution lessons—verification loops speed and improve agent reliability, and the “messy middle” (routing, orchestration, exception handling, and UI) remains a major value-creation layer. Progress didn’t depend solely on AI specialists; system design and fast iteration by domain-focused builders often outperformed model-centric approaches. The year’s direction points toward standardized evaluation, higher-quality deployments, and content that earns attention through usefulness rather than volume.
Why does “LLMs using code as a tool” matter more than incremental model improvements?
What changed about images in 2025, and why did that enable new interface possibilities?
How did the transcript challenge the idea that “AI developers are everything”?
What are verification loops, and why are they portrayed as a game-changer for agents?
What does “the messy middle” refer to, and why does it still hold value?
How does the transcript connect AI slop to system design and quality control?
Review Questions
- Which two “unlock” areas does the transcript identify as most underestimated in 2025, and what concrete capabilities did they add to agentic workflows?
- How do verification loops change agent behavior, and what standardization trend does the transcript expect for 2026?
- What does the transcript mean by “the messy middle,” and how does it argue that value persists even as model makers compete?
Key Points
- 1
Letting LLMs use code as a tool turns natural language into computer control, enabling agentic workflows that non-code users can adopt.
- 2
Image generation reaching reliable, detailed outputs unlocked generative UI—interfaces that can evolve with users and context rather than staying fixed to developer layouts.
- 3
Progress in 2025 often came from system design (templates, validators, retries, fast iteration) rather than from relying on specialized “AI developers” alone.
- 4
Verification loops—multi-dimensional correctness checks feeding agent iteration—made agents faster, less gameable, and more dependable.
- 5
The “messy middle” (intent routing, orchestration, exception handling, and domain-specific UI) remains a major value-creation layer that’s still underbuilt.
- 6
AI slop is framed as a symptom of unconstrained production; useful, high-density content can be achieved with managed systems and stronger grounding/factchecking.
- 7
Market momentum shifted toward quality lift and customer value, with leaders treating people’s expertise and attention as a precious asset rather than a cost to cut.