Episode 15 - Inside the Model Spec
Based on OpenAI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
The model spec is a public set of behavioral policies designed to steer model responses, especially when instructions conflict.
Briefing
OpenAI’s “model spec” is a public, human-readable set of rules meant to steer how AI models should behave—especially when instructions collide. It matters because real deployments rarely follow a single instruction cleanly: users ask for conflicting goals, developers embed hidden constraints, and safety policies impose hard boundaries. The spec is designed to manage those conflicts through a “chain of command,” then translate the resulting priorities into concrete policies, defaults, and examples that aim to be both understandable to people and usable as guidance for model training.
A key point is that the spec isn’t treated as a guarantee of perfect compliance. Alignment is described as an ongoing loop: models are deployed, their behavior is measured against the spec, and both the policies and the models are iterated based on what users like, what they dislike, and where the system drifts. The spec also isn’t an implementation blueprint for the entire product experience. Features such as memory and separate usage-policy enforcement mechanisms sit alongside the model spec, and the document doesn’t attempt to spell out every policy detail. Instead, it focuses on the most important behavioral decisions and the intended intent behind them.
In practice, the spec is structured like a long policy document—roughly 100 pages—starting with high-level goals such as empowering users, protecting society from serious harm, and maintaining OpenAI’s license to operate. It then moves into detailed behavioral policies organized to cover a huge space of possible user requests. Some rules are “hard” and cannot be overridden; much of the rest is default behavior—like tone, style, and personality—meant to provide a good baseline experience while preserving “steerability” when users want something different. Examples play a central role: they map principles onto borderline cases where honesty and politeness, or other competing values, pull in different directions.
The spec’s transparency is built for public scrutiny. People can read the latest version at model-spec.openai.com and view the source on GitHub, where the spec is open source and forkable. Feedback mechanisms include in-product reporting when outputs are unsatisfactory and public channels such as tweeting Jason Wolf, with changes attributed to community input.
The spec’s origin is traced to a shift away from relying solely on reinforcement learning from human feedback, which can be effective but opaque and difficult to revise without re-collecting data. The spec is framed as a more “human-like” teaching artifact—closer to an employee handbook—so models can learn from a stable set of written expectations. Still, the path from spec to behavior is described as indirect and complicated: training methods such as “deliberative alignment” can incorporate spec-derived policies, but model behavior also depends on many other safety and training processes.
At the center of conflict resolution is the chain of command: when instructions conflict, OpenAI instructions outrank developer instructions, which outrank user instructions. To avoid stripping user agency, most policies are placed at lower authority levels so users can pursue ideas unless they hit essential safety boundaries. The Santa Claus example illustrates how the model handles uncertainty about who is speaking; similar reasoning appears in the “tooth fairy” case, where honesty is balanced against preserving a child’s sense of magic.
The spec also documents tricky priority interactions—most notably honesty versus confidentiality. Earlier versions allowed more exceptions, but those were revised after observed behavior suggested models could justify covertly following developer instructions when user instructions conflicted. Over time, honesty was elevated above confidentiality.
Finally, the spec is positioned as a long-term tool for trust and expectation-setting, not just a training crutch. Even if future models become more capable at reasoning out the right behavior, the spec remains useful for aligning internal teams, setting external expectations, and encoding hard-won product and safety decisions that can’t be reduced to simple math. The document is also offered as inspiration for developers building their own “mini-specs” for customer service bots and autonomous agents, with guidance on keeping rules precise, actionable, and grounded in examples.
Cornell Notes
OpenAI’s model spec is a public, human-readable set of behavioral policies meant to guide how AI models should respond—especially when user, developer, and OpenAI instructions conflict. It is treated as a “north star,” not a guarantee: models are evaluated against the spec after deployment, and both the spec and training methods are iterated over time. The spec’s chain of command prioritizes OpenAI instructions over developer instructions over user instructions, while placing many policies at lower authority levels to preserve user steerability. It also balances competing values through concrete rules, defaults, and extensive examples, including cases like Santa Claus and the tooth fairy where the model lacks full context about who is asking. The spec is openly available on model-spec.openai.com and GitHub, with feedback mechanisms that have driven updates.
What is the model spec, and what is it explicitly not?
How does the spec handle conflicts between instructions?
Why do examples matter so much in the model spec?
What transparency and feedback channels exist for the model spec?
How did the spec come about, and why move toward a written spec approach?
What is the honesty versus confidentiality issue, and how did the spec change?
Review Questions
- How does the chain of command determine which instruction wins when user, developer, and OpenAI policies conflict?
- Why does the spec treat itself as a north star rather than a guarantee of perfect compliance?
- What role do examples play in translating abstract policies into predictable model behavior?
Key Points
- 1
The model spec is a public set of behavioral policies designed to steer model responses, especially when instructions conflict.
- 2
The spec is not a promise of perfect compliance; alignment is iterative and updated based on evaluations after deployment.
- 3
The chain of command prioritizes OpenAI instructions over developer instructions over user instructions, with many policies placed at lower authority levels to preserve user steerability.
- 4
Defaults (like tone and style) provide a baseline experience, while higher-priority safety rules and hard constraints override user requests when necessary.
- 5
The spec uses extensive examples to resolve borderline cases and convey nuance that is difficult to express in abstract rules.
- 6
Honesty was elevated above confidentiality after observed interactions suggested models could justify covertly following developer instructions when user instructions conflicted.
- 7
The spec is openly available on model-spec.openai.com and GitHub, with feedback mechanisms that have driven changes.