OpenAI's New OPEN Models - GPT-OSS 120B & 20B
Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
OpenAI released two open-weights models (120B and 20B) under an Apache 2.0 license, enabling broad reuse of the weights.
Briefing
OpenAI has released two open-weights language models under an Apache 2.0 license: a 120B-parameter model and a 20B-parameter model. The headline impact is practical: the weights are permissively licensed, making it easier for developers to run, fine-tune, and deploy them without the user-count or usage restrictions that often come with “open” releases. The models are also positioned for agentic workflows—tool use, web search, Python execution, and instruction following—backed by post-training aimed at those tasks.
The naming is where the release gets contentious. “OPEN Models” and “GPT-OSS” suggest open-source, but the models are open-weight rather than fully open-source in the strict sense. True open-source releases typically include training code, checkpoints, and data access for reproducibility; here, the emphasis is on downloadable weights under Apache 2.0. The transcript notes that even internal interpretation of “OSS” appears to land on “open source series,” while the models themselves are treated as open-weight artifacts.
On capability, both models are described as trained in a way “similar” to OpenAI’s O3 and O4 families, using supervised fine-tuning plus reinforcement-style alignment. A key feature is adjustable “reasoning effort” with three levels—low, medium, and high—set via the system prompt. That knob is framed as a latency-versus-performance trade-off, and the transcript stresses that real-world testing will determine how much accuracy improves when the model is allowed more time to think.
Architecturally, the models appear to be mixture-of-experts. The 120B model runs with only 5B active parameters, while the 20B model uses 3.6B active parameters—an efficiency profile that the transcript compares to other MoE efforts in the ecosystem. Context length support is also a selling point: both models reportedly handle up to 120K context, likely built on rotary positional embeddings. The models are described as English-only, which limits multilingual use cases for now.
Benchmarks are presented as strong but not dominant. The transcript highlights that tool-using performance is substantially higher than “without tools,” aligning with the agent-first positioning. Function-calling results are said to surpass O4 mini and come close to O3, and some reasoning-heavy tasks (like competition math and GPQA) improve when longer reasoning chains are enabled. Still, there’s skepticism about benchmark saturation—many evaluation sets have been “maxed out” by prior models—raising the possibility of overfitting to common leaderboards.
For deployment, the models can be accessed through OpenRouter and served via OpenAI-compatible endpoints, including a newer “Harmony” SDK path tailored to the Responses API format. The transcript also flags an important limitation: the knowledge cutoff is 202406, meaning answers about current events can be stale. Local use is pitched as feasible with the right setup: Ollama support is emphasized, along with a quantization approach that depends on Triton for efficient loading. A practical test example shows the 20B model correctly answering general questions but misidentifying the current U.S. president due to the cutoff.
Overall, the release is framed as a meaningful step for OpenAI back into open-weights territory after GPT-2, putting pressure on other frontier labs—especially in the West—to publish more models. With GPT-5 reportedly imminent, the transcript suggests attention may quickly shift back to proprietary systems, but the open-weights availability could still unlock a wave of local and agentic experimentation.
Cornell Notes
OpenAI’s new open-weights models—GPT-OSS 120B and GPT-OSS 20B—arrive with an Apache 2.0 license, enabling broad use of the weights without restrictive user-count conditions. Both models target agentic workflows (tool use, web search, Python execution, instruction following) and support adjustable “reasoning effort” levels (low/medium/high) to trade latency for performance. The models appear to be mixture-of-experts, with the 120B using 5B active parameters and the 20B using 3.6B active parameters, helping them run more efficiently than their headline sizes suggest. They support up to 120K context but are described as English-only, and their knowledge cutoff is 202406, which can cause outdated factual answers. Local deployment is feasible via Ollama with the right quantization setup, while OpenAI-compatible serving is available through Responses-style tooling.
What makes the release practically “open,” and what’s the main caveat about the “OSS” label?
How do the models let users control reasoning, and why does that matter?
What architectural efficiency detail stands out in the model sizes?
What deployment paths are mentioned, and what technical requirement affects local loading?
What limitation shows up in factual QA, and what causes it?
How do benchmark results relate to the agentic positioning?
Review Questions
- Which specific mechanism lets a user adjust reasoning time, and how is it expected to affect latency and accuracy?
- Why might a model answer about current events incorrectly even if it performs well on reasoning benchmarks?
- What does “open-weights” imply compared with “open-source,” based on what’s missing from the release?
Key Points
- 1
OpenAI released two open-weights models (120B and 20B) under an Apache 2.0 license, enabling broad reuse of the weights.
- 2
The “GPT-OSS” naming is disputed because the release is open-weights rather than fully open-source (no training code/checkpoints/data for reproducibility).
- 3
Both models target agentic use cases with post-training for instruction following, tool use, web search, and Python code execution.
- 4
Reasoning effort can be set to low/medium/high via the system prompt, creating a latency-versus-performance trade-off that requires testing.
- 5
The models support up to 120K context but are described as English-only, limiting multilingual applications.
- 6
A 202406 knowledge cutoff can cause stale factual answers, even when reasoning and tool use look strong.
- 7
Local deployment is feasible via Ollama, but efficient quantized loading depends on Triton to avoid loading in 16-bit and exceeding hardware limits.