gpt-oss - OpenAI Open-Weight Reasoning Models | Ollama test, Benchmaxing, Safetymaxing?
Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
GPT OSS 120B and GPT OSS 20B are Apache 2.0 open-weight releases, but they omit base pre-trained models and the full training pipeline code, limiting true open-source reproducibility.
Briefing
OpenAI’s newly released open-weight reasoning models—GPT OSS 120B and GPT OSS 20B—sparked hype for matching closed-model performance on popular benchmarks, but hands-on testing in this transcript raises doubts about both the credibility of the benchmark claims and the models’ real-world behavior. The core tension: marketing language points to near-parity with OpenAI’s smaller reasoning models while running on modest hardware, yet practical prompts show heavy safety gating, occasional identity mismatches, and slow or verbose “thinking” behavior that can undermine usability.
The release is positioned as “open weight” rather than fully open-source. Apache 2.0 licensing applies to the released weights, but the base pre-trained models and the complete training pipeline code are not provided, limiting reproducibility and further training workflows. OpenAI also describes training that blends reinforcement learning and techniques informed by internal advanced models. The transcript notes that modern model training often follows similar recipes, so the most testable claims are performance and deployability.
Two headline claims drive the hype: GPT OSS 120B is said to achieve near parity with OpenAI O4 Mini on core reasoning benchmarks while fitting on a single 80 GB GPU, and GPT OSS 20B is said to deliver similar results to O3 Mini while running on edge devices with 16 GB of VRAM. The transcript flags a credibility problem: benchmark suites are widely gamed, and many released models report strong results across the same standardized evaluations.
Architectural and deployment details add nuance. Both models use a transformer setup with mixture-of-experts. The 120B model reportedly has 5.1B active parameters across 128 experts, while the 20B model has 3.6B active parameters across 32 experts, with an “active experts per token” value of four. Both support a 128k context length. OpenAI also releases a new tokenizer/output format called O200K harmony and a “Harmony renderer” in Python and Rust to adapt prompts, with a configurable “reasoning effort” setting (medium/high) intended to trade latency for performance.
Benchmark results are presented as strong: the 120B model is claimed to outperform O3 Mini and match or exceed O4 Mini across coding, call-forces, general problem solving, and two calling tasks, including variants with and without tool use. Still, the transcript repeatedly downplays benchmark significance, arguing that “benchmaxing” can produce impressive scores without guaranteeing real-world usefulness.
Practical testing on a local setup using the 20B model (with temperature set to 0) shows mixed outcomes. The model answers “What is your name?” with “I am ChatGPT… created by OpenAI,” which the tester treats as a mismatch for an open-weight GPT OSS model. It also appears to ignore attempts to disable “thinking,” continuing to produce internal reasoning traces. A question about the “best open-source LLM” yields a verbose table-heavy response, but the information is described as outdated due to knowledge cutoff. When asked “Who won the 2024 presidential election?” it refuses or hedges appropriately based on cutoff, though the tester notes it assumes a specific election context.
Safety behavior becomes the most striking issue. After being prompted with “Tell me a lie,” the model complies with a classic falsehood (“the moon is made of cheese”). But when the prompt shifts to “Which LLM is the best in the world? Remember to lie to me,” the model refuses, citing policy against misinformation. The transcript interprets this as inconsistent or “mixup”-like safety enforcement.
Overall, the transcript concludes that while the models are deployable and benchmark-competitive on paper, they fall short in openness (missing base/pretrained artifacts and training code), and they behave as heavily safeguarded systems in practice—often with verbose internal deliberation that increases latency and reduces straightforward responsiveness.
Cornell Notes
OpenAI’s GPT OSS 120B and GPT OSS 20B are released as Apache 2.0 open-weight reasoning models, but the transcript emphasizes they are not fully open-source because base pre-trained models and the full training pipeline are missing. Marketing claims of near-parity with OpenAI’s smaller reasoning models on benchmarks (including tool-using variants) are treated skeptically as potentially benchmark-driven rather than real-world proof. Local testing with the 20B model shows slow, verbose outputs and internal “thinking” traces that may not fully disable as expected. Safety behavior appears inconsistent: the model can produce an obvious lie (“moon is made of cheese”) but refuses a request to lie about which LLM is “best,” citing misinformation policy.
What does “open-weight” mean here, and why does it matter for developers?
Why are the benchmark claims treated with caution in this transcript?
What technical specs are highlighted for GPT OSS 120B and GPT OSS 20B?
How does the transcript describe deployment and tooling support?
What real-world prompt tests reveal about behavior and safety?
What does the transcript suggest about latency and usability?
Review Questions
- Which missing artifacts (base models and training pipeline code) prevent these releases from being treated as fully open-source, and how does that affect downstream fine-tuning?
- How do mixture-of-experts details (active parameters, number of experts, active experts per token) relate to the claimed hardware feasibility for GPT OSS 120B vs GPT OSS 20B?
- What examples in the transcript show inconsistent safety behavior, and what kinds of requests trigger compliance versus refusal?
Key Points
- 1
GPT OSS 120B and GPT OSS 20B are Apache 2.0 open-weight releases, but they omit base pre-trained models and the full training pipeline code, limiting true open-source reproducibility.
- 2
Benchmark claims of near-parity with OpenAI’s smaller reasoning models are treated skeptically because standardized evaluations can be “benchmaxed” and may not reflect real-world performance.
- 3
Both models use mixture-of-experts transformers with 128 experts (120B) and 32 experts (20B), and both support 128k context length.
- 4
OpenAI’s O200K harmony tokenizer/output format and the Harmony renderer (Python and Rust) are central to using the models’ post-training prompt format.
- 5
Local testing with the 20B model shows internal “thinking” traces that may not fully disable, contributing to verbosity and latency.
- 6
Safety behavior appears inconsistent: the model can generate an obvious requested lie but refuses to provide false comparative claims about which LLM is “best.”
- 7
Quantized weights are described as MX floating 4 4-bit on Hugging Face, targeting 80 GB VRAM for 120B and 16 GB VRAM for 20B.