Did AI Just Get Commoditized? Gemini 2.5, New DeepSeek V3, & Microsoft vs OpenAI

TL;DR

Gemini 2.5 Pro is portrayed as a top performer across knowledge, science, and math, but the broader implication is convergence rather than a single unbeatable breakthrough.

Briefing Cornell Notes

Briefing

Gemini 2.5 Pro and DeepSeek V3 arrive with a clear message for the AI market: top-tier language-model performance is converging across companies, making “secret sauce” less visible than the amount of compute and engineering effort behind each system. The practical takeaway is that model leadership is becoming harder to sustain, even as models keep improving—because multiple labs are moving toward similar capability levels on the same kinds of tasks.

Google’s Gemini 2.5 Pro is positioned as its “most intelligent” model, and the transcript leans on a mix of benchmark categories to argue that it’s not just another incremental release. In knowledge-heavy testing—described as “humanity’s last exam,” with obscure trivia, Latin translations, and specialized science—Gemini 2.5 Pro is portrayed as leading for “knowledge without searching the web.” The comparison also highlights how OpenAI’s O3 mini (not the full O3) is expected to score higher once the larger model is released, and it notes that some benchmark reporting differences (like majority voting and tool use) can make direct comparisons tricky.

The same convergence theme shows up in science and math. Gemini 2.5 Pro is described as roughly level with OpenAI’s o3-mini and close to other frontier models when extended reasoning is allowed. The transcript also flags a key structural issue in benchmarking: different labs use different evaluation methods, including whether they apply majority voting or allow extra compute, which can widen or narrow gaps depending on the rules.

Where Gemini 2.5 Pro stands out most sharply is multimodal and long-context capability. It’s described as state-of-the-art at reading tables and charts in the MMU benchmark, and the transcript claims it’s the first model to get “within touching distance” of human performance there—while humans are said to have the advantage of browsing and taking their time. On long context, Gemini 2.5 Pro is said to handle up to a million tokens, far beyond other models in the cited chart.

DeepSeek V3 reinforces the same story from a different angle. The transcript distinguishes DeepSeek V3 as a new base model (not the separate reasoning model R2), and frames it as comparable to OpenAI’s GPT 4.5 base layer for future reasoning systems. In the benchmark comparisons, DeepSeek V3 is portrayed as notably stronger in mathematics and competitive in coding, while being closer to OpenAI on science and general knowledge. The broader claim: if base models are converging, then reasoning models also lack a clear moat.

That leads into the “commoditization” argument tied to Microsoft’s leadership. The transcript references Microsoft CEO claims that models are being commoditized—performance increasingly bought like a commodity through compute and scaling, with labs selling an experience rather than a unique path to AGI. A Microsoft internal unit (Microsoft AI) is described as having reverse-engineered or replicated “reasoning-like” behavior (the “think before answering” pattern associated with systems such as Gemini and DeepSeek’s R1), with Microsoft claiming near-parity on benchmarks with leading OpenAI and Anthropic models.

Finally, the transcript adds a reality-check on the hype cycle: even as predictions grow that AI will write nearly all code, the transcript points to ongoing hiring and a playful example of a model getting stuck in a simple game scenario. The overall conclusion: Gemini 2.5 Pro looks impressive, but it reads less like a singular breakthrough and more like evidence that frontier capabilities are converging across model families—making differentiation increasingly about deployment, cost, and product execution rather than undisclosed intelligence secrets.

Cornell Notes

Gemini 2.5 Pro and DeepSeek V3 are presented as proof that frontier language-model performance is converging across major labs. Gemini 2.5 Pro is highlighted for knowledge-heavy tasks, strong science/math performance, state-of-the-art table/chart understanding (MMU), and very long context (up to a million tokens). DeepSeek V3 is framed as a new base model that is competitive with OpenAI’s GPT 4.5 base layer, especially in mathematics and coding. The transcript ties these benchmark patterns to a broader “commoditization” thesis: if multiple teams can reach similar capability levels, the differentiator shifts toward compute scale, engineering, and productization rather than a single undisclosed route to AGI.

What does Gemini 2.5 Pro’s benchmark performance suggest about “secret sauce” versus scaling?

The transcript uses multiple benchmark categories to argue that Gemini 2.5 Pro is strong across knowledge, science, and math, while also showing standout capability in multimodal table/chart understanding (MMU) and very long context (up to a million tokens). At the same time, it repeatedly notes that other top models are converging on similar levels when compute and evaluation rules are aligned (and that benchmark comparisons can be distorted by factors like majority voting and tool use). The implication is that leadership is harder to keep because multiple labs are reaching comparable capability bands, not because one lab has a uniquely uncopyable intelligence method.

Why does the transcript emphasize differences in benchmark methodology (e.g., majority voting, tools, compute)?

Direct comparisons are described as increasingly difficult because labs don’t always report the same conditions. Some systems use majority voting, which can improve scores by spending extra compute; others don’t. Some benchmarks allow tools (web access or external assistance), while others test “without searching the web.” The transcript also notes that some reported results may omit models like OpenAI’s full O3, using smaller variants instead. Those differences can make gaps look larger or smaller than they would under a standardized evaluation.

What are Gemini 2.5 Pro’s most distinctive strengths in the transcript?

Two stand out. First, MMU table/chart understanding: Gemini 2.5 Pro is described as state-of-the-art and the first model to get close to human performance, with humans allowed to browse and take their time. Second, long context: Gemini 2.5 Pro is said to handle up to a million tokens, while other models in the cited chart handle far less (down to roughly a quarter of that).

How does DeepSeek V3 fit into the “base model” and “reasoning model” distinction?

DeepSeek V3 is described as a new base model, not the separate reasoning model R2. The transcript compares this structure to OpenAI’s lineup: a base model (like GPT 4.5) supports later reasoning systems. It also frames DeepSeek V3 as the likely base for the upcoming R2 reasoning model, and compares DeepSeek V3’s benchmark behavior against OpenAI’s GPT 4.5 base layer.

What evidence is used to support the claim that reasoning models lack a clear moat?

The transcript argues that if base models are on par, then reasoning systems built on top of them also converge. It cites Microsoft’s internal claims that its Microsoft AI unit has replicated “think before answering” reasoning behavior and that its models perform nearly as well as leading OpenAI and Anthropic models on benchmarks. Combined with the Gemini and DeepSeek reasoning trend, this is used to suggest that the reasoning advantage is becoming reproducible rather than uniquely protected.

How does the transcript challenge the hype around AI writing most code?

It references an Anthropic CEO prediction that AI could write 90% of code within 3–6 months and essentially all code within 12 months. The transcript then points out a mismatch: Anthropic is still advertising software engineering roles with substantial salaries. It also uses a playful example of a model getting stuck in a simple Pokémon game scenario, implying that real-world competence still has gaps even when benchmarks look strong.

Review Questions

Which two benchmark areas does the transcript treat as where Gemini 2.5 Pro most clearly separates from competitors, and what specific capabilities are being tested?
How do majority voting, tool use, and model-size differences (like using O3 mini vs full O3) complicate benchmark comparisons?
What does the transcript mean by “base model” versus “reasoning model,” and how does that framing connect Gemini/DeepSeek/Microsoft to the commoditization thesis?

Key Points

1
Gemini 2.5 Pro is portrayed as a top performer across knowledge, science, and math, but the broader implication is convergence rather than a single unbeatable breakthrough.
2
Benchmark comparisons are increasingly unreliable without matching evaluation rules such as majority voting, tool access, and compute budgets.
3
Gemini 2.5 Pro’s strongest differentiators in the transcript are MMU table/chart understanding and very long context length (up to a million tokens).
4
DeepSeek V3 is framed as a new base model that is competitive with OpenAI’s GPT 4.5 base layer, especially in mathematics and coding.
5
The “commoditization” thesis centers on the idea that scaling compute and engineering effort can reproduce much of the capability, reducing the value of undisclosed intelligence secrets.
6
Microsoft’s internal claims about replicating “think before answering” reasoning are used as supporting evidence that reasoning advantages may be narrowing across labs.
7
Even with rising confidence about AI automation (like coding), the transcript highlights real-world inconsistencies such as ongoing hiring and observed failures in simple tasks.

Highlights

Gemini 2.5 Pro is described as state-of-the-art on MMU table/chart understanding, with results said to be near human performance despite humans being allowed to browse and take their time.

The transcript claims Gemini 2.5 Pro can handle up to a million tokens, far exceeding other models shown in the cited long-context chart.

DeepSeek V3 is treated as a new base model likely feeding into R2, with benchmark competitiveness—especially in math—used to argue that reasoning moats are shrinking.

Microsoft’s internal unit is described as claiming near-parity on benchmarks by replicating reasoning behavior, reinforcing the commoditization narrative.

Topics

Gemini 2.5 Pro
DeepSeek V3
Model Convergence
AI Commoditization
Long Context

Mentioned

Mustafa Suleyman
Sam Altman
AGI
MMU
GPT
O3
R2