Get AI summaries of any video or article — Sign up free
OpenAI's New OPEN Models - GPT-OSS 120B & 20B thumbnail

OpenAI's New OPEN Models - GPT-OSS 120B & 20B

Sam Witteveen·
5 min read

Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

OpenAI released two open-weights models (120B and 20B) under an Apache 2.0 license, enabling broad reuse of the weights.

Briefing

OpenAI has released two open-weights language models under an Apache 2.0 license: a 120B-parameter model and a 20B-parameter model. The headline impact is practical: the weights are permissively licensed, making it easier for developers to run, fine-tune, and deploy them without the user-count or usage restrictions that often come with “open” releases. The models are also positioned for agentic workflows—tool use, web search, Python execution, and instruction following—backed by post-training aimed at those tasks.

The naming is where the release gets contentious. “OPEN Models” and “GPT-OSS” suggest open-source, but the models are open-weight rather than fully open-source in the strict sense. True open-source releases typically include training code, checkpoints, and data access for reproducibility; here, the emphasis is on downloadable weights under Apache 2.0. The transcript notes that even internal interpretation of “OSS” appears to land on “open source series,” while the models themselves are treated as open-weight artifacts.

On capability, both models are described as trained in a way “similar” to OpenAI’s O3 and O4 families, using supervised fine-tuning plus reinforcement-style alignment. A key feature is adjustable “reasoning effort” with three levels—low, medium, and high—set via the system prompt. That knob is framed as a latency-versus-performance trade-off, and the transcript stresses that real-world testing will determine how much accuracy improves when the model is allowed more time to think.

Architecturally, the models appear to be mixture-of-experts. The 120B model runs with only 5B active parameters, while the 20B model uses 3.6B active parameters—an efficiency profile that the transcript compares to other MoE efforts in the ecosystem. Context length support is also a selling point: both models reportedly handle up to 120K context, likely built on rotary positional embeddings. The models are described as English-only, which limits multilingual use cases for now.

Benchmarks are presented as strong but not dominant. The transcript highlights that tool-using performance is substantially higher than “without tools,” aligning with the agent-first positioning. Function-calling results are said to surpass O4 mini and come close to O3, and some reasoning-heavy tasks (like competition math and GPQA) improve when longer reasoning chains are enabled. Still, there’s skepticism about benchmark saturation—many evaluation sets have been “maxed out” by prior models—raising the possibility of overfitting to common leaderboards.

For deployment, the models can be accessed through OpenRouter and served via OpenAI-compatible endpoints, including a newer “Harmony” SDK path tailored to the Responses API format. The transcript also flags an important limitation: the knowledge cutoff is 202406, meaning answers about current events can be stale. Local use is pitched as feasible with the right setup: Ollama support is emphasized, along with a quantization approach that depends on Triton for efficient loading. A practical test example shows the 20B model correctly answering general questions but misidentifying the current U.S. president due to the cutoff.

Overall, the release is framed as a meaningful step for OpenAI back into open-weights territory after GPT-2, putting pressure on other frontier labs—especially in the West—to publish more models. With GPT-5 reportedly imminent, the transcript suggests attention may quickly shift back to proprietary systems, but the open-weights availability could still unlock a wave of local and agentic experimentation.

Cornell Notes

OpenAI’s new open-weights models—GPT-OSS 120B and GPT-OSS 20B—arrive with an Apache 2.0 license, enabling broad use of the weights without restrictive user-count conditions. Both models target agentic workflows (tool use, web search, Python execution, instruction following) and support adjustable “reasoning effort” levels (low/medium/high) to trade latency for performance. The models appear to be mixture-of-experts, with the 120B using 5B active parameters and the 20B using 3.6B active parameters, helping them run more efficiently than their headline sizes suggest. They support up to 120K context but are described as English-only, and their knowledge cutoff is 202406, which can cause outdated factual answers. Local deployment is feasible via Ollama with the right quantization setup, while OpenAI-compatible serving is available through Responses-style tooling.

What makes the release practically “open,” and what’s the main caveat about the “OSS” label?

The weights are released under an Apache 2.0 license, which is permissive and avoids the common “open” caveats that restrict usage based on user counts or other conditions. The caveat is that the models are open-weights rather than fully open-source: the transcript argues that true open-source would include training code, checkpoints, and data access for reproducibility, not just downloadable weights.

How do the models let users control reasoning, and why does that matter?

Both models support three reasoning-effort levels—low, medium, and high—intended as a latency-versus-performance trade-off. The setting is controlled through the system prompt, and the transcript emphasizes that users will need to test how much accuracy improves when the model is allowed more “time to think,” especially for reasoning-heavy tasks.

What architectural efficiency detail stands out in the model sizes?

Despite the headline parameter counts, both models use mixture-of-experts with far fewer active parameters at inference. The 120B model reportedly uses 5B active parameters, while the 20B model uses 3.6B active parameters. This makes the models potentially more runnable than their full parameter counts suggest, and it invites comparisons to other MoE approaches in the ecosystem.

What deployment paths are mentioned, and what technical requirement affects local loading?

Access is available via OpenRouter and through OpenAI-compatible serving patterns (including Responses-style formats). For local use, Ollama is highlighted as the easiest starting point. The transcript also notes that efficient local loading depends on Triton because the models use a quantization approach; without Triton, the model may load in 16-bit and become too large for typical hardware.

What limitation shows up in factual QA, and what causes it?

In a test about U.S. presidents, the model incorrectly identifies the current president as Joe Biden. The transcript attributes this to the models’ knowledge cutoff of 202406, meaning they may not reflect events after that date.

How do benchmark results relate to the agentic positioning?

Tool-using performance is described as substantially higher than performance without tools, supporting the idea that the models are tuned for agentic workflows. Function-calling benchmarks are reported as strong—surpassing O4 mini and approaching O3—while reasoning-heavy tasks improve when longer reasoning chains are enabled.

Review Questions

  1. Which specific mechanism lets a user adjust reasoning time, and how is it expected to affect latency and accuracy?
  2. Why might a model answer about current events incorrectly even if it performs well on reasoning benchmarks?
  3. What does “open-weights” imply compared with “open-source,” based on what’s missing from the release?

Key Points

  1. 1

    OpenAI released two open-weights models (120B and 20B) under an Apache 2.0 license, enabling broad reuse of the weights.

  2. 2

    The “GPT-OSS” naming is disputed because the release is open-weights rather than fully open-source (no training code/checkpoints/data for reproducibility).

  3. 3

    Both models target agentic use cases with post-training for instruction following, tool use, web search, and Python code execution.

  4. 4

    Reasoning effort can be set to low/medium/high via the system prompt, creating a latency-versus-performance trade-off that requires testing.

  5. 5

    The models support up to 120K context but are described as English-only, limiting multilingual applications.

  6. 6

    A 202406 knowledge cutoff can cause stale factual answers, even when reasoning and tool use look strong.

  7. 7

    Local deployment is feasible via Ollama, but efficient quantized loading depends on Triton to avoid loading in 16-bit and exceeding hardware limits.

Highlights

Apache 2.0 licensing makes the weights broadly usable without the common “open” restrictions tied to user counts.
The models include a user-controllable “reasoning effort” setting (low/medium/high) intended to trade speed for better performance.
Mixture-of-experts design is emphasized: the 120B model uses 5B active parameters and the 20B uses 3.6B active parameters.
Despite strong tool and function-calling results, the knowledge cutoff (202406) can still break up-to-date factual questions.
Local running is practical with Ollama and quantization, but Triton is key to keeping memory requirements manageable.

Topics

  • Open-Weights Models
  • Apache 2.0 Licensing
  • Agentic Tool Use
  • Reasoning Effort
  • 120K Context

Mentioned

  • Sam Witteveen
  • LLM
  • MoE
  • RL
  • API
  • GPU
  • O4
  • O3
  • O4 mini
  • O3 mini
  • Ollama
  • LM Studio
  • Triton
  • 4bit
  • 16 bit
  • RAG
  • SDK
  • JSON
  • RAM
  • GPU
  • MoE