gpt-oss - OpenAI Open-Weight Reasoning Models | Ollama test, Benchmaxing, Safetymaxing?

TL;DR

GPT OSS 120B and GPT OSS 20B are Apache 2.0 open-weight releases, but they omit base pre-trained models and the full training pipeline code, limiting true open-source reproducibility.

Briefing Cornell Notes

Briefing

OpenAI’s newly released open-weight reasoning models—GPT OSS 120B and GPT OSS 20B—sparked hype for matching closed-model performance on popular benchmarks, but hands-on testing in this transcript raises doubts about both the credibility of the benchmark claims and the models’ real-world behavior. The core tension: marketing language points to near-parity with OpenAI’s smaller reasoning models while running on modest hardware, yet practical prompts show heavy safety gating, occasional identity mismatches, and slow or verbose “thinking” behavior that can undermine usability.

The release is positioned as “open weight” rather than fully open-source. Apache 2.0 licensing applies to the released weights, but the base pre-trained models and the complete training pipeline code are not provided, limiting reproducibility and further training workflows. OpenAI also describes training that blends reinforcement learning and techniques informed by internal advanced models. The transcript notes that modern model training often follows similar recipes, so the most testable claims are performance and deployability.

Two headline claims drive the hype: GPT OSS 120B is said to achieve near parity with OpenAI O4 Mini on core reasoning benchmarks while fitting on a single 80 GB GPU, and GPT OSS 20B is said to deliver similar results to O3 Mini while running on edge devices with 16 GB of VRAM. The transcript flags a credibility problem: benchmark suites are widely gamed, and many released models report strong results across the same standardized evaluations.

Architectural and deployment details add nuance. Both models use a transformer setup with mixture-of-experts. The 120B model reportedly has 5.1B active parameters across 128 experts, while the 20B model has 3.6B active parameters across 32 experts, with an “active experts per token” value of four. Both support a 128k context length. OpenAI also releases a new tokenizer/output format called O200K harmony and a “Harmony renderer” in Python and Rust to adapt prompts, with a configurable “reasoning effort” setting (medium/high) intended to trade latency for performance.

Benchmark results are presented as strong: the 120B model is claimed to outperform O3 Mini and match or exceed O4 Mini across coding, call-forces, general problem solving, and two calling tasks, including variants with and without tool use. Still, the transcript repeatedly downplays benchmark significance, arguing that “benchmaxing” can produce impressive scores without guaranteeing real-world usefulness.

Practical testing on a local setup using the 20B model (with temperature set to 0) shows mixed outcomes. The model answers “What is your name?” with “I am ChatGPT… created by OpenAI,” which the tester treats as a mismatch for an open-weight GPT OSS model. It also appears to ignore attempts to disable “thinking,” continuing to produce internal reasoning traces. A question about the “best open-source LLM” yields a verbose table-heavy response, but the information is described as outdated due to knowledge cutoff. When asked “Who won the 2024 presidential election?” it refuses or hedges appropriately based on cutoff, though the tester notes it assumes a specific election context.

Safety behavior becomes the most striking issue. After being prompted with “Tell me a lie,” the model complies with a classic falsehood (“the moon is made of cheese”). But when the prompt shifts to “Which LLM is the best in the world? Remember to lie to me,” the model refuses, citing policy against misinformation. The transcript interprets this as inconsistent or “mixup”-like safety enforcement.

Overall, the transcript concludes that while the models are deployable and benchmark-competitive on paper, they fall short in openness (missing base/pretrained artifacts and training code), and they behave as heavily safeguarded systems in practice—often with verbose internal deliberation that increases latency and reduces straightforward responsiveness.

Cornell Notes

OpenAI’s GPT OSS 120B and GPT OSS 20B are released as Apache 2.0 open-weight reasoning models, but the transcript emphasizes they are not fully open-source because base pre-trained models and the full training pipeline are missing. Marketing claims of near-parity with OpenAI’s smaller reasoning models on benchmarks (including tool-using variants) are treated skeptically as potentially benchmark-driven rather than real-world proof. Local testing with the 20B model shows slow, verbose outputs and internal “thinking” traces that may not fully disable as expected. Safety behavior appears inconsistent: the model can produce an obvious lie (“moon is made of cheese”) but refuses a request to lie about which LLM is “best,” citing misinformation policy.

What does “open-weight” mean here, and why does it matter for developers?

The weights are licensed under Apache 2.0, but the release does not include base pre-trained models or the complete source code for the training pipeline. That limits reproducibility and makes it harder to do true end-to-end fine-tuning or retraining workflows compared with fully open-source releases.

Why are the benchmark claims treated with caution in this transcript?

The transcript argues that benchmark suites can be “benchmaxed,” and many newly released models report strong results across common evaluations. Even when the 120B model is claimed to match or exceed OpenAI O4 Mini/O3 Mini on multiple tasks, the tester doubts that benchmark performance reliably predicts real-world behavior.

What technical specs are highlighted for GPT OSS 120B and GPT OSS 20B?

Both are mixture-of-experts transformer models. GPT OSS 120B has 5.1B active parameters with 128 experts; GPT OSS 20B has 3.6B active parameters with 32 experts. Both are described as supporting 128k context length and configurable “reasoning effort” (medium/high) to trade latency for performance. The transcript also notes an “active experts per token” value of four.

How does the transcript describe deployment and tooling support?

Weights are available on Hugging Face and are described as natively quantized in MX floating 4 4-bit format, targeting 80 GB VRAM for 120B and 16 GB VRAM for 20B. OpenAI also releases O200K harmony, a tokenizer/output format, plus a Harmony renderer in Python and Rust to adapt prompts to the new format.

What real-world prompt tests reveal about behavior and safety?

In local tests, the 20B model answers “What is your name?” with “I am ChatGPT… created by OpenAI,” which the tester flags as a mismatch. Attempts to disable “thinking” appear ineffective, with internal reasoning still shown. Safety is inconsistent: it produces a requested obvious lie (“moon is made of cheese”) but refuses to lie about which LLM is “best,” citing policy against misinformation.

What does the transcript suggest about latency and usability?

Even simple questions take noticeable time (e.g., ~3.5 seconds for a short identity question, ~1.5 minutes for a “best open-source LLM” table request). The tester attributes part of the slowdown to verbose internal deliberation and the cost of safety/“thinking” gating.

Review Questions

Which missing artifacts (base models and training pipeline code) prevent these releases from being treated as fully open-source, and how does that affect downstream fine-tuning?
How do mixture-of-experts details (active parameters, number of experts, active experts per token) relate to the claimed hardware feasibility for GPT OSS 120B vs GPT OSS 20B?
What examples in the transcript show inconsistent safety behavior, and what kinds of requests trigger compliance versus refusal?

Key Points

1
GPT OSS 120B and GPT OSS 20B are Apache 2.0 open-weight releases, but they omit base pre-trained models and the full training pipeline code, limiting true open-source reproducibility.
2
Benchmark claims of near-parity with OpenAI’s smaller reasoning models are treated skeptically because standardized evaluations can be “benchmaxed” and may not reflect real-world performance.
3
Both models use mixture-of-experts transformers with 128 experts (120B) and 32 experts (20B), and both support 128k context length.
4
OpenAI’s O200K harmony tokenizer/output format and the Harmony renderer (Python and Rust) are central to using the models’ post-training prompt format.
5
Local testing with the 20B model shows internal “thinking” traces that may not fully disable, contributing to verbosity and latency.
6
Safety behavior appears inconsistent: the model can generate an obvious requested lie but refuses to provide false comparative claims about which LLM is “best.”
7
Quantized weights are described as MX floating 4 4-bit on Hugging Face, targeting 80 GB VRAM for 120B and 16 GB VRAM for 20B.

Highlights

The transcript draws a sharp line between “open-weight” and “open-source,” noting the absence of base pre-trained models and the training pipeline code.

Despite strong benchmark marketing, the tester argues benchmark results may be unreliable predictors due to benchmaxing incentives.

Prompt tests show both verbosity (tables, long reasoning) and inconsistent safety enforcement—compliance for a trivial lie, refusal for a comparative misinformation request.

Topics

Open-Weight Reasoning Models
Benchmarks
Safety Behavior
Mixture of Experts
Local Deployment

Mentioned

API
VRAM
MX
LLM
O4 Mini
O3 Mini
O200K