OpenAI’s open source models are finally here

TL;DR

GPT OSS 20B is small enough (~11GB) to run locally on consumer hardware, while GPT OSS 120B (~60GB) is far more demanding and can overwhelm laptop memory.

Briefing Cornell Notes

Briefing

OpenAI’s newly released open-weight models—an “120B” and a “20B” variant—are built to run locally, and early testing suggests the smaller 20B model can deliver performance close to OpenAI’s 03 Mini while fitting on consumer hardware. That combination—local execution plus competitive capability—matters because it shifts AI use from “pay-per-request through a safety-and-routing stack” toward “run it yourself,” which is attractive for privacy, cost control, and developers who want predictable infrastructure.

The practical headline is hardware feasibility. The 20B model is described as roughly 11GB in size and around 11GB downloaded, making it plausible to run on devices as modest as a smartphone, while the 120B model is around 60GB and is positioned for “basic gaming hardware.” In hands-on tests, the 20B model runs on an M2 Max MacBook Pro and even continues generating when airplane mode is enabled, reflecting a key advantage of open weights: no network calls are required once the model is installed. The 120B model, by contrast, can overwhelm memory on smaller machines—Ollama usage spikes past 30GB RAM quickly—and becomes slow or impractical on a laptop, though it runs more smoothly on a higher-end desktop GPU setup.

Under the hood, both models are mixture-of-experts (MoE). Instead of activating all parameters for every token, the router selects a subset of “experts” relevant to the request. The transcript frames this as why a model with 117B parameters can route a given request through about 5.1B active parameters. A notable implementation detail is that the “active parameters per token” are said to stay relatively close across the two model sizes, which may help explain why the 20B variant performs surprisingly well.

Performance and reliability vary sharply depending on where the model is served and how tool-calling is implemented. Token throughput is reported as very fast on some providers—Cerebras is cited at roughly 2,300 tokens per second for the 120B model, with Grok around 810 TPS. But tool calling can be brittle: the transcript attributes many failures to provider-specific “Harmony” tool-call bindings that translate other tool formats into the exact bracketed syntax the model expects. In one front-end coding test (generating a Next.js “image generation studio” mock), multiple attempts produced errors until switching providers; even then, the output was criticized as not great for CSS. In SnitchBench-style evaluations, the 20B model is described as more likely to omit required JSON fields or leak unexpected syntax into tool arguments.

Benchmarks paint a mixed but promising picture. OpenAI claims GPT OSS 120B with tools lands near 03 and 04 Mini ranges, while 03 Mini without tools is lower than the 20B model without tooling—an argument for the 20B model’s efficiency. Health-focused tests are highlighted as especially relevant for privacy-sensitive workloads like medical data, where local inference could reduce exposure to third-party APIs. Cost comparisons are also central: the transcript claims these models are far cheaper than some competing hosted options, and that Artificial Analysis found GPT OSS 120B to be the most intelligent American open-weight model, with strong efficiency on a “intelligence vs cost” view.

Finally, the release is framed as a safety-and-systems milestone. OpenAI’s open-weight approach removes the historical “safety layer in front of the model” that had been present for proprietary deployments, so providers must implement their own handling. The transcript argues that this is why the release took longer: the open weights can’t be “taken back,” and the models are Apache 2 licensed, increasing the need for careful training and standardized tool formats across the ecosystem.

Cornell Notes

OpenAI released two open-weight text-only models, GPT OSS 20B and GPT OSS 120B, designed to run locally via providers and tools like Ollama and LM Studio. The 20B model is small enough to run on consumer hardware (even described as feasible on a smartphone) and is reported to perform close to 03 Mini in some evaluations, especially when considering instruction-following and tool use. Both models use mixture-of-experts routing, activating only a fraction of parameters per token, which helps explain why the smaller model can be surprisingly capable. Tool-calling reliability depends heavily on the provider’s Harmony/tool-call binding layer, with some providers producing fewer errors. The release matters for privacy, cost, and developer control because it reduces reliance on a proprietary safety/routing layer in front of the model.

Why does the 20B model feel unusually “usable” compared with other open-weight releases?

It’s small enough to run on everyday hardware and it’s built as a mixture-of-experts model. The transcript cites the 20B model as ~11GB downloaded, with local tests running on an M2 Max MacBook Pro and continuing generation offline (airplane mode). MoE routing means a request doesn’t activate all parameters; the transcript says a 117B-parameter model can route a given request through about 5.1B active parameters, reducing compute for each token and helping the 20B variant stay competitive.

What’s the biggest practical difference between running GPT OSS 20B and GPT OSS 120B on a laptop?

Memory pressure and speed. In the transcript, the 120B model can lock up a computer and quickly fills RAM (Ollama reported using over 30GB RAM almost immediately). The 20B model runs “totally fine” in local tests, while the 120B model is described as slow or impractical on a laptop even when it eventually runs (one paragraph taking minutes). The 120B model is portrayed as more realistic on a desktop with substantial RAM and a capable GPU.

How do tool calls work here, and why do failures differ by provider?

The model expects a specific bracketed “Harmony” tool-call format (e.g., bracket-bar syntax like “Start user message … end”). Providers must translate their own tool-call representations into that exact format. The transcript says reliability varies because each provider implements the binding layer differently. In tests, Grok and Fireworks are described as best so far for tool calling, while other providers produce errors like unterminated strings, missing required JSON fields, or leaked bracket syntax into tool arguments.

What do benchmarks claim about intelligence and instruction-following?

OpenAI claims GPT OSS 120B with tools performs comparably to 03 and 04 Mini, while 03 Mini without tools is lower than the 20B model without tooling—suggesting the 20B model’s capability per size is strong. The transcript also emphasizes instruction-following: OpenAI models are said to follow requested syntax more reliably than some alternatives, with fewer “autocomplete mode” deviations. However, some task-specific benchmarks (like skate-trick naming) still show the OSS models lagging behind top proprietary models.

Why is the release framed as a safety-system shift, not just a model upgrade?

Historically, proprietary deployments had a safety layer “in front of” the model that could filter unsafe requests (e.g., blocking illegal drug manufacturing). With open weights, that intermediary layer is no longer automatically present for everyone, and once weights are released they can’t be “taken back.” The transcript argues OpenAI delayed the release to ensure safety despite the loss of that centralized front-end control, and that providers now need their own handling standards.

Review Questions

What hardware constraints most strongly affect local use of GPT OSS 120B, and how do those constraints show up in the transcript’s RAM observations?
How does mixture-of-experts routing change the compute required per token, and why might that help the 20B model compete with larger systems?
In tool-calling tests, what kinds of errors appear when provider-specific Harmony bindings are weak, and why do those errors matter for real developer workflows?

Key Points

1
GPT OSS 20B is small enough (~11GB) to run locally on consumer hardware, while GPT OSS 120B (~60GB) is far more demanding and can overwhelm laptop memory.
2
Both models use mixture-of-experts routing, activating only a subset of parameters per token (about 5.1B active parameters per request is cited), which helps performance scale down to 20B.
3
Tool-calling reliability depends heavily on the provider’s Harmony/tool-call binding layer; Grok and Fireworks are described as more reliable than some alternatives.
4
Open-weight release changes the safety architecture: the centralized “safety layer in front of the model” is no longer guaranteed, so providers must implement their own protections.
5
OpenAI claims GPT OSS 120B with tools is near 03/04 Mini ranges, and the 20B model can match 03 Mini in some evaluations—making local inference more competitive.
6
Cost and throughput are positioned as major advantages, with the transcript citing very low per-token pricing and high tokens-per-second on certain providers.
7
Task performance is uneven: the models are described as strong at many general tasks and science, but weaker at specific front-end/CSS expectations and prone to table-heavy outputs.

Highlights

The 20B model is described as feasible on a smartphone-class setup, while the 120B model is framed as workable on basic gaming hardware—local AI becomes a realistic option for more people.

Mixture-of-experts routing is central: even with huge total parameter counts, only a fraction is activated per request, helping the smaller model stay competitive.

Tool calling isn’t just “model quality”—provider-specific Harmony bindings can make the difference between clean tool arguments and frequent validation failures.

Open-weight licensing (Apache 2) and the removal of a centralized safety layer explain why the release required extra safety work and why ecosystem standards matter.

Benchmark claims emphasize instruction-following and cost-to-performance: GPT OSS 120B is presented as a strong efficiency winner among American open-weight models.

Topics

Mentioned

MoE
TPS
RAM
CPU
GPU
H100
CLI
JSON
MoE router
FBA
Apache 2
MoE