Get AI summaries of any video or article — Sign up free
A 100T Transformer Model Coming? Plus ByteDance Saga and the Mixtral Price Drop thumbnail

A 100T Transformer Model Coming? Plus ByteDance Saga and the Mixtral Price Drop

AI Explained·
6 min read

Based on AI Explained's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

OpenAI employees denied GPT 4.5 rumors, describing the claims as consistent hallucinations and arguing a real release would not be silent or hidden in API strings.

Briefing

Rumors of a “GPT 4.5” release were met with unusually direct denials from multiple OpenAI employees, with one pointing to the pattern of a consistent hallucination and another warning that any “4.5” would not plausibly appear silently—especially not with a telltale API string. The exchange didn’t just cool hype; it also shifted attention back to concrete, market-moving developments: a claimed transformer-focused chip breakthrough, a sharp price collapse for Mixtral, and ByteDance’s alleged use of OpenAI technology to accelerate a competing model.

On the hardware side, a new company, H. (as referenced in the transcript), claims it has built the world’s first “Transformer supercomputer” designed from the ground up to run transformer workloads. The pitch is that the company “burned” the Transformer architecture into a custom chip—code-named “Soo”—so that each transistor can be optimized for the computations that dominate large language model inference, especially matrix multiplication. The company’s marketing centers on throughput gains—massively outperforming Nvidia H100 on tokens per second—and on enabling real-time interaction, such as voice agents that can ingest thousands of words in milliseconds and generate outputs without the typical pauses users experience.

That strategy echoes a broader bet described via an earlier article: two Harvard dropouts raised millions to design an AI accelerator dedicated to large language model inference, arguing that generalized hardware may not deliver the biggest gains. They acknowledge a key risk—specializing too narrowly as workloads evolve—but contend that if the bet pays off, the payoff could be dramatic: claims of up to 140× throughput per dollar and scalability toward extremely large models (up to 100 trillion parameters). The chip is reportedly slated for availability in 2024, with a funding round planned for early next year, though investors remain skeptical given the challenge of entering semiconductors.

Meanwhile, “Mixtral” is delivering an immediate, user-visible story: steep, rapid price cuts. The transcript traces a sequence of reductions after the model’s announcement—first to $2 per 1 million tokens, then down by 70% to 30 cents per 1 million tokens, and later to 27 cents per 1 million tokens—attributed to providers competing aggressively and even offering Mixtral for free in some cases. The implication is that performance improvements are arriving alongside faster cost declines, raising the question of where “intelligence per dollar” could land by the end of 2024.

Sebastian Bubeck—author of Sparks of AGI and associated with the F series—ties the market shift to a research goal: enabling reasoning at minimal scale. He frames the mission as identifying the smallest ingredients needed for GPT-4-like intelligence, not merely fitting large models onto phones, even though on-device models are trending upward.

Finally, the ByteDance saga adds legal and competitive pressure. The transcript alleges ByteDance secretly relied on OpenAI’s API (including for “Project Seed”) to develop foundational LLM code, despite OpenAI terms forbidding using model outputs to develop competing AI products. OpenAI reportedly banned ByteDance from ChatGPT, and ByteDance leadership is quoted as suggesting a “super strong” model may arrive soon, with open weights/open sourcing expectations left open as the timeline tightens. The overall picture: hype about GPT 4.5 cools, while hardware specialization, inference pricing, and competitive tactics accelerate the real race toward more capable—and cheaper—systems.

Cornell Notes

OpenAI employees pushed back hard on “GPT 4.5” rumors, describing the chatter as a consistent hallucination pattern and warning that a real 4.5 would not appear silently—especially not with recognizable API strings. Attention then moved to tangible shifts: a company claims it has “etched” the Transformer architecture into a custom chip (Soo) to boost inference throughput far beyond Nvidia H100, potentially enabling real-time voice and other low-latency uses. In parallel, Mixtral’s pricing dropped sharply in a matter of days, with multiple providers cutting rates from $2 per 1M tokens down to around 27 cents per 1M tokens, suggesting “intelligence per dollar” could keep improving quickly. Sebastian Bubeck emphasizes a research target of reasoning at smaller scales, while ByteDance faces allegations of using OpenAI API technology to build a competing model in violation of terms.

Why did “GPT 4.5” rumors lose credibility in this discussion?

Multiple OpenAI employees denied the existence of GPT 4.5. One described the “GPT 4.5 turbo” claims as a “weird and oddly consistent hallucination,” while another said any real 4.5 wouldn’t be released silently. A third added that if a 4.5 existed, it would likely show up in the API string as “4.5,” making a silent rollout implausible. The net effect was a rapid collapse of the hype bubble.

What is the core claim behind the “etched Transformer” chip, and what problem is it trying to solve?

The transcript describes a company claiming it built a Transformer-focused supercomputer by burning the Transformer architecture directly into silicon via a custom chip code-named “Soo.” The goal is to avoid relying on general-purpose GPUs that run transformer workloads through software optimization. Instead, the hardware is designed so transistors can be optimized for transformer computations—especially matrix multiplication—aiming for much higher tokens-per-second inference and lower latency for interactive applications like voice agents.

How does the Mixtral price drop change the practical story for users and developers?

The transcript traces a steep sequence of reductions after Mixtral’s announcement: starting around $2 per 1 million tokens, then dropping 70% to 30 cents per 1 million tokens, and later falling again to 27 cents per 1 million tokens. It also mentions at least one provider offering Mixtral for free for both input and output. Together, these cuts suggest that improved model access is coming alongside faster cost declines, which can materially change experimentation budgets and deployment economics.

What does Sebastian Bubeck’s reasoning focus imply about model scaling?

Bubeck frames the mission as discovering the minimal ingredients needed to achieve GPT-4-like reasoning, not just scaling parameters or fitting models onto phones. He suggests there may be room to enable reasoning at relatively smaller sizes—citing performance extraction at 1B and 3B parameters—and he treats “what capabilities are possible” as an open question. Even with claims that up to 10B parameters can fit on-device, he emphasizes the scientific quest over the device constraint.

What allegation is made about ByteDance and OpenAI technology?

The transcript alleges ByteDance used OpenAI API technology to develop a competing foundational LLM code-named “Project Seed,” including during training and evaluation phases. It claims this relied on OpenAI outputs to build competing models, which the transcript says violates OpenAI terms of service that prohibit using model outputs to develop competing AI products and services. It also notes OpenAI banned ByteDance from ChatGPT after the issue surfaced.

What does “open weights/open source” mean in the ByteDance discussion, and why does it matter?

A ByteDance head of research, Quangang (as named in the transcript), is quoted as being uncertain about GPT 5 timing but expecting a super-strong model soon. When asked about open sourcing and open model weights, the response points toward openness rather than waiting to catch up. The transcript treats this as a competitive signal: if open weights arrive quickly, it could accelerate adoption and reduce barriers for developers compared with closed releases.

Review Questions

  1. What specific evidence did OpenAI employees cite (or imply) to argue that GPT 4.5 was not real?
  2. How does “etching” the Transformer architecture into silicon differ from running transformers on GPUs optimized through software?
  3. What does the Mixtral pricing timeline suggest about competition among inference providers and the likely direction of “intelligence per dollar”?

Key Points

  1. 1

    OpenAI employees denied GPT 4.5 rumors, describing the claims as consistent hallucinations and arguing a real release would not be silent or hidden in API strings.

  2. 2

    A company claims it built a Transformer-dedicated chip (“Soo”) by etching the Transformer architecture into silicon to optimize inference computations like matrix multiplication.

  3. 3

    The etched-chip pitch centers on higher tokens-per-second throughput and lower latency for real-time applications such as voice agents.

  4. 4

    Mixtral’s inference cost fell rapidly after launch, with multiple providers cutting prices from about $2 per 1M tokens down to roughly 27 cents per 1M tokens.

  5. 5

    Sebastian Bubeck emphasizes enabling reasoning at smaller model scales, treating “minimal ingredients for GPT-4-like intelligence” as the key research question.

  6. 6

    ByteDance is alleged to have used OpenAI API technology (including for “Project Seed”) to develop a competing LLM, which the transcript says violates OpenAI terms.

  7. 7

    ByteDance leadership is quoted as expecting a super-strong model soon and signaling interest in open weights, intensifying competitive pressure.

Highlights

Multiple OpenAI employees dismissed GPT 4.5 as a hallucination pattern, with one arguing that any real 4.5 would show up clearly in API identifiers rather than appearing silently.
The “etched Transformer” concept aims to replace software-optimized GPU inference with hardware optimized at the transistor level for transformer computations, targeting major tokens-per-second gains.
Mixtral pricing dropped in rapid steps—down to around 27 cents per 1M tokens—suggesting aggressive competition among providers for inference demand.
Sebastian Bubeck’s focus is on reasoning at minimal scale, not just fitting ever-larger models onto devices.
ByteDance is accused of using OpenAI API outputs to build “Project Seed,” prompting an OpenAI ban from ChatGPT.

Topics

  • GPT 4.5 Denials
  • Etched Transformer Chip
  • Mixtral Pricing
  • Sebastian Bubeck Reasoning
  • ByteDance OpenAI Terms

Mentioned