A 100T Transformer Model Coming? Plus ByteDance Saga and the Mixtral Price Drop
Based on AI Explained's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
OpenAI employees denied GPT 4.5 rumors, describing the claims as consistent hallucinations and arguing a real release would not be silent or hidden in API strings.
Briefing
Rumors of a “GPT 4.5” release were met with unusually direct denials from multiple OpenAI employees, with one pointing to the pattern of a consistent hallucination and another warning that any “4.5” would not plausibly appear silently—especially not with a telltale API string. The exchange didn’t just cool hype; it also shifted attention back to concrete, market-moving developments: a claimed transformer-focused chip breakthrough, a sharp price collapse for Mixtral, and ByteDance’s alleged use of OpenAI technology to accelerate a competing model.
On the hardware side, a new company, H. (as referenced in the transcript), claims it has built the world’s first “Transformer supercomputer” designed from the ground up to run transformer workloads. The pitch is that the company “burned” the Transformer architecture into a custom chip—code-named “Soo”—so that each transistor can be optimized for the computations that dominate large language model inference, especially matrix multiplication. The company’s marketing centers on throughput gains—massively outperforming Nvidia H100 on tokens per second—and on enabling real-time interaction, such as voice agents that can ingest thousands of words in milliseconds and generate outputs without the typical pauses users experience.
That strategy echoes a broader bet described via an earlier article: two Harvard dropouts raised millions to design an AI accelerator dedicated to large language model inference, arguing that generalized hardware may not deliver the biggest gains. They acknowledge a key risk—specializing too narrowly as workloads evolve—but contend that if the bet pays off, the payoff could be dramatic: claims of up to 140× throughput per dollar and scalability toward extremely large models (up to 100 trillion parameters). The chip is reportedly slated for availability in 2024, with a funding round planned for early next year, though investors remain skeptical given the challenge of entering semiconductors.
Meanwhile, “Mixtral” is delivering an immediate, user-visible story: steep, rapid price cuts. The transcript traces a sequence of reductions after the model’s announcement—first to $2 per 1 million tokens, then down by 70% to 30 cents per 1 million tokens, and later to 27 cents per 1 million tokens—attributed to providers competing aggressively and even offering Mixtral for free in some cases. The implication is that performance improvements are arriving alongside faster cost declines, raising the question of where “intelligence per dollar” could land by the end of 2024.
Sebastian Bubeck—author of Sparks of AGI and associated with the F series—ties the market shift to a research goal: enabling reasoning at minimal scale. He frames the mission as identifying the smallest ingredients needed for GPT-4-like intelligence, not merely fitting large models onto phones, even though on-device models are trending upward.
Finally, the ByteDance saga adds legal and competitive pressure. The transcript alleges ByteDance secretly relied on OpenAI’s API (including for “Project Seed”) to develop foundational LLM code, despite OpenAI terms forbidding using model outputs to develop competing AI products. OpenAI reportedly banned ByteDance from ChatGPT, and ByteDance leadership is quoted as suggesting a “super strong” model may arrive soon, with open weights/open sourcing expectations left open as the timeline tightens. The overall picture: hype about GPT 4.5 cools, while hardware specialization, inference pricing, and competitive tactics accelerate the real race toward more capable—and cheaper—systems.
Cornell Notes
OpenAI employees pushed back hard on “GPT 4.5” rumors, describing the chatter as a consistent hallucination pattern and warning that a real 4.5 would not appear silently—especially not with recognizable API strings. Attention then moved to tangible shifts: a company claims it has “etched” the Transformer architecture into a custom chip (Soo) to boost inference throughput far beyond Nvidia H100, potentially enabling real-time voice and other low-latency uses. In parallel, Mixtral’s pricing dropped sharply in a matter of days, with multiple providers cutting rates from $2 per 1M tokens down to around 27 cents per 1M tokens, suggesting “intelligence per dollar” could keep improving quickly. Sebastian Bubeck emphasizes a research target of reasoning at smaller scales, while ByteDance faces allegations of using OpenAI API technology to build a competing model in violation of terms.
Why did “GPT 4.5” rumors lose credibility in this discussion?
What is the core claim behind the “etched Transformer” chip, and what problem is it trying to solve?
How does the Mixtral price drop change the practical story for users and developers?
What does Sebastian Bubeck’s reasoning focus imply about model scaling?
What allegation is made about ByteDance and OpenAI technology?
What does “open weights/open source” mean in the ByteDance discussion, and why does it matter?
Review Questions
- What specific evidence did OpenAI employees cite (or imply) to argue that GPT 4.5 was not real?
- How does “etching” the Transformer architecture into silicon differ from running transformers on GPUs optimized through software?
- What does the Mixtral pricing timeline suggest about competition among inference providers and the likely direction of “intelligence per dollar”?
Key Points
- 1
OpenAI employees denied GPT 4.5 rumors, describing the claims as consistent hallucinations and arguing a real release would not be silent or hidden in API strings.
- 2
A company claims it built a Transformer-dedicated chip (“Soo”) by etching the Transformer architecture into silicon to optimize inference computations like matrix multiplication.
- 3
The etched-chip pitch centers on higher tokens-per-second throughput and lower latency for real-time applications such as voice agents.
- 4
Mixtral’s inference cost fell rapidly after launch, with multiple providers cutting prices from about $2 per 1M tokens down to roughly 27 cents per 1M tokens.
- 5
Sebastian Bubeck emphasizes enabling reasoning at smaller model scales, treating “minimal ingredients for GPT-4-like intelligence” as the key research question.
- 6
ByteDance is alleged to have used OpenAI API technology (including for “Project Seed”) to develop a competing LLM, which the transcript says violates OpenAI terms.
- 7
ByteDance leadership is quoted as expecting a super-strong model soon and signaling interest in open weights, intensifying competitive pressure.