Get AI summaries of any video or article — Sign up free
The 4 Big Changes in LLMs thumbnail

The 4 Big Changes in LLMs

Sam Witteveen·
6 min read

Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Assume model capability will keep improving and design LLM apps so prompts, tool logic, and chains can be updated quickly as new models land.

Briefing

LLMs are improving on multiple fronts at once—smarter reasoning, faster token generation, cheaper inference, and ever-larger context—and product teams that plan for those shifts now will be able to ship better, faster, and more profitable systems. The biggest strategic mistake is treating model progress as something to work around rather than something to build on: many startups design for a world where models won’t meaningfully improve, even though the trajectory suggests steady gains in capability and efficiency.

A central theme is that “models are getting smarter,” but the response should be operational, not just aspirational. Instead of locking product logic to today’s model behavior, teams should assume future models will be stronger and design so they can adjust quickly—especially as new foundation models arrive with different cost profiles. The practical tension is that products must still work on current models, because the next generation may be more expensive if it relies on heavier compute. At the same time, there are already signs that capability gains can come without cost increases, such as Anthropic Sonnet 3.5 and newer Gemini 1.5 variants, which point to a future where smarter models arrive alongside similar or improved economics.

Three other forces are accelerating model capability. First is synthetic data: newer training pipelines increasingly use synthetic instruction data for instruction fine-tuning and alignment, letting teams generate higher-quality examples in the exact formats they want (including structured reasoning patterns). That synthetic training can also help models unlock more of what they learned during pretraining. Second is multimodality, which strengthens grounding by letting models operate across different input modes rather than text alone. Third is the rise of faster token generation—driven by better GPU/TPU serving techniques—making it feasible to run multiple model calls per user request without unacceptable latency.

Faster tokens change product design in concrete ways. With slower models, low-latency products often can’t afford repeated calls; with faster models, teams can use techniques like polling (multiple generations followed by majority vote), reflection (having the model re-check or revise its own outputs), and verification steps that validate tool outputs. Prompt and query rewriting also becomes more practical on the fly, because the extra iterations don’t crush response time.

Cost trends reinforce the shift. Token prices are falling quickly—people in the Bay Area consensus cited a drop to roughly one-seventh or one-eighth of early-year costs by year-end—making it possible to get “expensive-model” quality from cheaper, faster options. Examples mentioned include GPT-4o’s price reductions and the growing competitiveness of models like Haiku, Flash, and Sonnet 3.5.

Finally, context windows are moving toward “infinite” scale. RAG isn’t dead, but it’s changing: systems increasingly ingest 30,000–50,000 tokens at a time, which alters embedding and chunking strategies. In-context learning can also reduce the need for fine-tuning when long contexts plus context caching allow dozens of examples to be included in a prompt prefix. Teams should also dynamically select which in-context examples to include based on the user query, an approach supported by research such as DeepMind’s “many-shot in-context learning” work. The overall takeaway: build LLM apps with abstraction and modularity—so prompts, chunking/embeddings, and in-context example sets can be swapped quickly—while planning for how these shifts will reshape both quality and margins over the next six months.

Cornell Notes

LLM progress is accelerating in ways that directly affect product strategy: models are getting smarter, tokens are generating faster, token prices are dropping, and context windows are expanding dramatically. Teams that design for “today’s model only” risk being outpaced by competitors who can iterate quickly as new models arrive. Faster and cheaper inference makes multi-call techniques—polling, reflection, verification, and on-the-fly prompt/query rewriting—practical for improving quality without unacceptable latency. Meanwhile, longer context and context caching enable more in-context learning (sometimes replacing fine-tuning) and require rethinking RAG chunking/embedding pipelines. The business implication is that modular app design and data pipeline flexibility become competitive advantages, not optional engineering polish.

Why does “models are getting smarter” change startup strategy, not just model selection?

The key strategic shift is betting on continued model improvement rather than building around the assumption that capability will stay flat. Many startups build “horizontal” layers on top of a fixed model, but the trajectory implies that future foundation models will be meaningfully better. That means product logic should be modular and prompt/tooling should be easy to update, so the app can benefit from stronger models as they arrive. At the same time, the app must still work on current models because newer models may initially cost more if they use heavier compute.

How does synthetic data make models stronger in practice?

Synthetic data boosts training quality by letting teams generate instruction fine-tuning and alignment examples in the exact formats they want. Instead of relying only on naturally collected data, synthetic pipelines can produce structured training sets—such as data formatted for chain-of-thought-style patterns or other controlled alterations. That training can improve the model’s ability to use information learned during pretraining, because fine-tuning on synthetic instruction data helps unlock and apply those capabilities more effectively.

What does faster token generation enable that slower models couldn’t support?

Faster tokens make multi-call workflows feasible. Rather than one generation per request, products can poll the model multiple times and take a majority vote, use reflection/reflexion to re-check outputs (including reflecting on tool results), and add verification steps that validate intermediate outputs. Prompt and query rewriting also becomes practical during a single user interaction, since extra iterations won’t blow up latency as they would with slower models.

Why are falling token prices and cheaper “small/fast” models a big deal for product economics?

Lower token costs change what quality tier a product can afford. The transcript cites a rough consensus that token prices could drop to about one-seventh or one-eighth of early-year levels by year-end, aligning with examples like GPT-4o’s reduced token costs. As a result, capabilities that previously required expensive models can increasingly be achieved with faster, cheaper options such as Haiku, Flash, and Sonnet 3.5—improving margins and enabling more experimentation.

How should RAG change when context windows expand to tens of thousands of tokens?

RAG remains important, but the architecture shifts. Instead of forcing everything through small context limits, systems can ingest 30,000–50,000 tokens at a time, which affects chunking and embedding strategies. Teams may store raw data and quickly generate multiple chunking/embedding variants to test different RAG configurations. Longer context also makes in-context learning more viable, sometimes reducing the need for fine-tuning when enough examples can fit into the prompt prefix.

When can in-context learning replace fine-tuning, and what role does context caching play?

If a long context window can hold many high-quality in-context examples, the model can learn the task from those examples during inference. Context caching then reduces the cost and latency of repeatedly sending the same long prefix, making it practical to include dozens of examples for each request. The transcript also highlights dynamic selection of examples based on the user query, supported by research on many-shot in-context learning (e.g., DeepMind’s work), which can improve benchmark results and translate to real-world tasks, including agent setups.

Review Questions

  1. What design choices would make an LLM app resilient to rapid model improvements over the next year?
  2. How do polling, reflection/reflexion, and verification differ from single-pass prompting, and why do faster tokens make them practical?
  3. In what ways do long context windows and context caching reduce the need for fine-tuning, and what new engineering work do they create for RAG systems?

Key Points

  1. 1

    Assume model capability will keep improving and design LLM apps so prompts, tool logic, and chains can be updated quickly as new models land.

  2. 2

    Keep products compatible with current models because next-generation models may start more expensive even when they’re smarter.

  3. 3

    Use synthetic data strategically for instruction fine-tuning and alignment to generate high-quality, format-controlled training examples.

  4. 4

    Exploit faster token generation to add multi-call quality controls like polling (majority vote), reflection/reflexion, and verification steps.

  5. 5

    Treat falling token prices as a margin and feature-enablement lever, not just a cost reduction.

  6. 6

    Plan for “infinite” or very large context windows by rethinking RAG chunking/embedding and by using in-context learning with context caching.

  7. 7

    Build modular data pipelines that can rapidly regenerate chunking/embedding variants and dynamically select in-context examples per query.

Highlights

The most important product shift is betting on ongoing model improvement: apps should be modular enough to benefit from stronger future models rather than hard-coding behavior for today’s ones.
Faster tokens make multi-call strategies practical—polling, reflection/reflexion, and verification can raise quality without the latency penalty that older models imposed.
Synthetic data is becoming a primary lever for instruction fine-tuning and alignment because it can generate high-quality training examples in the exact formats needed.
Long context plus context caching can make in-context learning competitive with fine-tuning, while RAG evolves toward ingesting tens of thousands of tokens per request.

Mentioned