The 4 Big Changes in LLMs
Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Assume model capability will keep improving and design LLM apps so prompts, tool logic, and chains can be updated quickly as new models land.
Briefing
LLMs are improving on multiple fronts at once—smarter reasoning, faster token generation, cheaper inference, and ever-larger context—and product teams that plan for those shifts now will be able to ship better, faster, and more profitable systems. The biggest strategic mistake is treating model progress as something to work around rather than something to build on: many startups design for a world where models won’t meaningfully improve, even though the trajectory suggests steady gains in capability and efficiency.
A central theme is that “models are getting smarter,” but the response should be operational, not just aspirational. Instead of locking product logic to today’s model behavior, teams should assume future models will be stronger and design so they can adjust quickly—especially as new foundation models arrive with different cost profiles. The practical tension is that products must still work on current models, because the next generation may be more expensive if it relies on heavier compute. At the same time, there are already signs that capability gains can come without cost increases, such as Anthropic Sonnet 3.5 and newer Gemini 1.5 variants, which point to a future where smarter models arrive alongside similar or improved economics.
Three other forces are accelerating model capability. First is synthetic data: newer training pipelines increasingly use synthetic instruction data for instruction fine-tuning and alignment, letting teams generate higher-quality examples in the exact formats they want (including structured reasoning patterns). That synthetic training can also help models unlock more of what they learned during pretraining. Second is multimodality, which strengthens grounding by letting models operate across different input modes rather than text alone. Third is the rise of faster token generation—driven by better GPU/TPU serving techniques—making it feasible to run multiple model calls per user request without unacceptable latency.
Faster tokens change product design in concrete ways. With slower models, low-latency products often can’t afford repeated calls; with faster models, teams can use techniques like polling (multiple generations followed by majority vote), reflection (having the model re-check or revise its own outputs), and verification steps that validate tool outputs. Prompt and query rewriting also becomes more practical on the fly, because the extra iterations don’t crush response time.
Cost trends reinforce the shift. Token prices are falling quickly—people in the Bay Area consensus cited a drop to roughly one-seventh or one-eighth of early-year costs by year-end—making it possible to get “expensive-model” quality from cheaper, faster options. Examples mentioned include GPT-4o’s price reductions and the growing competitiveness of models like Haiku, Flash, and Sonnet 3.5.
Finally, context windows are moving toward “infinite” scale. RAG isn’t dead, but it’s changing: systems increasingly ingest 30,000–50,000 tokens at a time, which alters embedding and chunking strategies. In-context learning can also reduce the need for fine-tuning when long contexts plus context caching allow dozens of examples to be included in a prompt prefix. Teams should also dynamically select which in-context examples to include based on the user query, an approach supported by research such as DeepMind’s “many-shot in-context learning” work. The overall takeaway: build LLM apps with abstraction and modularity—so prompts, chunking/embeddings, and in-context example sets can be swapped quickly—while planning for how these shifts will reshape both quality and margins over the next six months.
Cornell Notes
LLM progress is accelerating in ways that directly affect product strategy: models are getting smarter, tokens are generating faster, token prices are dropping, and context windows are expanding dramatically. Teams that design for “today’s model only” risk being outpaced by competitors who can iterate quickly as new models arrive. Faster and cheaper inference makes multi-call techniques—polling, reflection, verification, and on-the-fly prompt/query rewriting—practical for improving quality without unacceptable latency. Meanwhile, longer context and context caching enable more in-context learning (sometimes replacing fine-tuning) and require rethinking RAG chunking/embedding pipelines. The business implication is that modular app design and data pipeline flexibility become competitive advantages, not optional engineering polish.
Why does “models are getting smarter” change startup strategy, not just model selection?
How does synthetic data make models stronger in practice?
What does faster token generation enable that slower models couldn’t support?
Why are falling token prices and cheaper “small/fast” models a big deal for product economics?
How should RAG change when context windows expand to tens of thousands of tokens?
When can in-context learning replace fine-tuning, and what role does context caching play?
Review Questions
- What design choices would make an LLM app resilient to rapid model improvements over the next year?
- How do polling, reflection/reflexion, and verification differ from single-pass prompting, and why do faster tokens make them practical?
- In what ways do long context windows and context caching reduce the need for fine-tuning, and what new engineering work do they create for RAG systems?
Key Points
- 1
Assume model capability will keep improving and design LLM apps so prompts, tool logic, and chains can be updated quickly as new models land.
- 2
Keep products compatible with current models because next-generation models may start more expensive even when they’re smarter.
- 3
Use synthetic data strategically for instruction fine-tuning and alignment to generate high-quality, format-controlled training examples.
- 4
Exploit faster token generation to add multi-call quality controls like polling (majority vote), reflection/reflexion, and verification steps.
- 5
Treat falling token prices as a margin and feature-enablement lever, not just a cost reduction.
- 6
Plan for “infinite” or very large context windows by rethinking RAG chunking/embedding and by using in-context learning with context caching.
- 7
Build modular data pipelines that can rapidly regenerate chunking/embedding variants and dynamically select in-context examples per query.