OpenAI’s open source models are finally here
Based on Theo - t3․gg's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
GPT OSS 20B is small enough (~11GB) to run locally on consumer hardware, while GPT OSS 120B (~60GB) is far more demanding and can overwhelm laptop memory.
Briefing
OpenAI’s newly released open-weight models—an “120B” and a “20B” variant—are built to run locally, and early testing suggests the smaller 20B model can deliver performance close to OpenAI’s 03 Mini while fitting on consumer hardware. That combination—local execution plus competitive capability—matters because it shifts AI use from “pay-per-request through a safety-and-routing stack” toward “run it yourself,” which is attractive for privacy, cost control, and developers who want predictable infrastructure.
The practical headline is hardware feasibility. The 20B model is described as roughly 11GB in size and around 11GB downloaded, making it plausible to run on devices as modest as a smartphone, while the 120B model is around 60GB and is positioned for “basic gaming hardware.” In hands-on tests, the 20B model runs on an M2 Max MacBook Pro and even continues generating when airplane mode is enabled, reflecting a key advantage of open weights: no network calls are required once the model is installed. The 120B model, by contrast, can overwhelm memory on smaller machines—Ollama usage spikes past 30GB RAM quickly—and becomes slow or impractical on a laptop, though it runs more smoothly on a higher-end desktop GPU setup.
Under the hood, both models are mixture-of-experts (MoE). Instead of activating all parameters for every token, the router selects a subset of “experts” relevant to the request. The transcript frames this as why a model with 117B parameters can route a given request through about 5.1B active parameters. A notable implementation detail is that the “active parameters per token” are said to stay relatively close across the two model sizes, which may help explain why the 20B variant performs surprisingly well.
Performance and reliability vary sharply depending on where the model is served and how tool-calling is implemented. Token throughput is reported as very fast on some providers—Cerebras is cited at roughly 2,300 tokens per second for the 120B model, with Grok around 810 TPS. But tool calling can be brittle: the transcript attributes many failures to provider-specific “Harmony” tool-call bindings that translate other tool formats into the exact bracketed syntax the model expects. In one front-end coding test (generating a Next.js “image generation studio” mock), multiple attempts produced errors until switching providers; even then, the output was criticized as not great for CSS. In SnitchBench-style evaluations, the 20B model is described as more likely to omit required JSON fields or leak unexpected syntax into tool arguments.
Benchmarks paint a mixed but promising picture. OpenAI claims GPT OSS 120B with tools lands near 03 and 04 Mini ranges, while 03 Mini without tools is lower than the 20B model without tooling—an argument for the 20B model’s efficiency. Health-focused tests are highlighted as especially relevant for privacy-sensitive workloads like medical data, where local inference could reduce exposure to third-party APIs. Cost comparisons are also central: the transcript claims these models are far cheaper than some competing hosted options, and that Artificial Analysis found GPT OSS 120B to be the most intelligent American open-weight model, with strong efficiency on a “intelligence vs cost” view.
Finally, the release is framed as a safety-and-systems milestone. OpenAI’s open-weight approach removes the historical “safety layer in front of the model” that had been present for proprietary deployments, so providers must implement their own handling. The transcript argues that this is why the release took longer: the open weights can’t be “taken back,” and the models are Apache 2 licensed, increasing the need for careful training and standardized tool formats across the ecosystem.
Cornell Notes
OpenAI released two open-weight text-only models, GPT OSS 20B and GPT OSS 120B, designed to run locally via providers and tools like Ollama and LM Studio. The 20B model is small enough to run on consumer hardware (even described as feasible on a smartphone) and is reported to perform close to 03 Mini in some evaluations, especially when considering instruction-following and tool use. Both models use mixture-of-experts routing, activating only a fraction of parameters per token, which helps explain why the smaller model can be surprisingly capable. Tool-calling reliability depends heavily on the provider’s Harmony/tool-call binding layer, with some providers producing fewer errors. The release matters for privacy, cost, and developer control because it reduces reliance on a proprietary safety/routing layer in front of the model.
Why does the 20B model feel unusually “usable” compared with other open-weight releases?
What’s the biggest practical difference between running GPT OSS 20B and GPT OSS 120B on a laptop?
How do tool calls work here, and why do failures differ by provider?
What do benchmarks claim about intelligence and instruction-following?
Why is the release framed as a safety-system shift, not just a model upgrade?
Review Questions
- What hardware constraints most strongly affect local use of GPT OSS 120B, and how do those constraints show up in the transcript’s RAM observations?
- How does mixture-of-experts routing change the compute required per token, and why might that help the 20B model compete with larger systems?
- In tool-calling tests, what kinds of errors appear when provider-specific Harmony bindings are weak, and why do those errors matter for real developer workflows?
Key Points
- 1
GPT OSS 20B is small enough (~11GB) to run locally on consumer hardware, while GPT OSS 120B (~60GB) is far more demanding and can overwhelm laptop memory.
- 2
Both models use mixture-of-experts routing, activating only a subset of parameters per token (about 5.1B active parameters per request is cited), which helps performance scale down to 20B.
- 3
Tool-calling reliability depends heavily on the provider’s Harmony/tool-call binding layer; Grok and Fireworks are described as more reliable than some alternatives.
- 4
Open-weight release changes the safety architecture: the centralized “safety layer in front of the model” is no longer guaranteed, so providers must implement their own protections.
- 5
OpenAI claims GPT OSS 120B with tools is near 03/04 Mini ranges, and the 20B model can match 03 Mini in some evaluations—making local inference more competitive.
- 6
Cost and throughput are positioned as major advantages, with the transcript citing very low per-token pricing and high tokens-per-second on certain providers.
- 7
Task performance is uneven: the models are described as strong at many general tasks and science, but weaker at specific front-end/CSS expectations and prone to table-heavy outputs.