Episode 15 - Inside the Model Spec

TL;DR

The model spec is a public set of behavioral policies designed to steer model responses, especially when instructions conflict.

Briefing Cornell Notes

Briefing

OpenAI’s “model spec” is a public, human-readable set of rules meant to steer how AI models should behave—especially when instructions collide. It matters because real deployments rarely follow a single instruction cleanly: users ask for conflicting goals, developers embed hidden constraints, and safety policies impose hard boundaries. The spec is designed to manage those conflicts through a “chain of command,” then translate the resulting priorities into concrete policies, defaults, and examples that aim to be both understandable to people and usable as guidance for model training.

A key point is that the spec isn’t treated as a guarantee of perfect compliance. Alignment is described as an ongoing loop: models are deployed, their behavior is measured against the spec, and both the policies and the models are iterated based on what users like, what they dislike, and where the system drifts. The spec also isn’t an implementation blueprint for the entire product experience. Features such as memory and separate usage-policy enforcement mechanisms sit alongside the model spec, and the document doesn’t attempt to spell out every policy detail. Instead, it focuses on the most important behavioral decisions and the intended intent behind them.

In practice, the spec is structured like a long policy document—roughly 100 pages—starting with high-level goals such as empowering users, protecting society from serious harm, and maintaining OpenAI’s license to operate. It then moves into detailed behavioral policies organized to cover a huge space of possible user requests. Some rules are “hard” and cannot be overridden; much of the rest is default behavior—like tone, style, and personality—meant to provide a good baseline experience while preserving “steerability” when users want something different. Examples play a central role: they map principles onto borderline cases where honesty and politeness, or other competing values, pull in different directions.

The spec’s transparency is built for public scrutiny. People can read the latest version at model-spec.openai.com and view the source on GitHub, where the spec is open source and forkable. Feedback mechanisms include in-product reporting when outputs are unsatisfactory and public channels such as tweeting Jason Wolf, with changes attributed to community input.

The spec’s origin is traced to a shift away from relying solely on reinforcement learning from human feedback, which can be effective but opaque and difficult to revise without re-collecting data. The spec is framed as a more “human-like” teaching artifact—closer to an employee handbook—so models can learn from a stable set of written expectations. Still, the path from spec to behavior is described as indirect and complicated: training methods such as “deliberative alignment” can incorporate spec-derived policies, but model behavior also depends on many other safety and training processes.

At the center of conflict resolution is the chain of command: when instructions conflict, OpenAI instructions outrank developer instructions, which outrank user instructions. To avoid stripping user agency, most policies are placed at lower authority levels so users can pursue ideas unless they hit essential safety boundaries. The Santa Claus example illustrates how the model handles uncertainty about who is speaking; similar reasoning appears in the “tooth fairy” case, where honesty is balanced against preserving a child’s sense of magic.

The spec also documents tricky priority interactions—most notably honesty versus confidentiality. Earlier versions allowed more exceptions, but those were revised after observed behavior suggested models could justify covertly following developer instructions when user instructions conflicted. Over time, honesty was elevated above confidentiality.

Finally, the spec is positioned as a long-term tool for trust and expectation-setting, not just a training crutch. Even if future models become more capable at reasoning out the right behavior, the spec remains useful for aligning internal teams, setting external expectations, and encoding hard-won product and safety decisions that can’t be reduced to simple math. The document is also offered as inspiration for developers building their own “mini-specs” for customer service bots and autonomous agents, with guidance on keeping rules precise, actionable, and grounded in examples.

Cornell Notes

OpenAI’s model spec is a public, human-readable set of behavioral policies meant to guide how AI models should respond—especially when user, developer, and OpenAI instructions conflict. It is treated as a “north star,” not a guarantee: models are evaluated against the spec after deployment, and both the spec and training methods are iterated over time. The spec’s chain of command prioritizes OpenAI instructions over developer instructions over user instructions, while placing many policies at lower authority levels to preserve user steerability. It also balances competing values through concrete rules, defaults, and extensive examples, including cases like Santa Claus and the tooth fairy where the model lacks full context about who is asking. The spec is openly available on model-spec.openai.com and GitHub, with feedback mechanisms that have driven updates.

What is the model spec, and what is it explicitly not?

The model spec is OpenAI’s attempt to document high-level decisions about how models should behave, including policies, defaults, and examples that manage conflicts between instructions. It is not treated as proof that models perfectly follow it today; alignment is described as ongoing and measured after deployment. It is also not an implementation artifact that fully describes the entire ChatGPT-like system—memory, usage-policy enforcement, and other components are outside the spec’s scope. Finally, it is not a complete, exhaustive listing of every policy detail; it aims to capture the most important decisions and intended priorities while staying readable.

How does the spec handle conflicts between instructions?

The spec uses a “chain of command.” When instructions conflict, the model should prefer OpenAI instructions over developer instructions, and developer instructions over user instructions. To keep user agency, the spec assigns an “authority level” to each policy and tries to place many policies at the lowest levels—below user instructions—so users can steer behavior unless they run into essential safety boundaries. Only a small set of safety policies sit at the highest level so they apply broadly.

Why do examples matter so much in the model spec?

The spec is long and must cover a huge space of possible requests, so examples are used to clarify how principles apply at decision boundaries. They provide ideal or compressed “gold” answers that show the intended balance between competing policies—such as honesty versus friendliness or honesty versus politeness. Examples also help convey nuances that are hard to express purely in words, especially in borderline cases.

What transparency and feedback channels exist for the model spec?

People can read the latest model spec at model-spec.openai.com. The source code is available on GitHub, and the spec is open source, allowing forks. Feedback can be submitted in-product when an output is disliked, and public feedback is also encouraged via social channels such as tweeting Jason Wolf, with changes attributed to community input.

How did the spec come about, and why move toward a written spec approach?

The spec project is traced to 2024, when Joan Jang and John Schulman initiated a model spec effort and made it public for transparency. The motivation includes limitations of reinforcement learning from human feedback: while effective, it can be hard to interpret what policies are being taught and difficult to revise without re-collecting data. The spec is framed as a more stable “employee handbook”-like teaching artifact that models can learn from as they become more capable.

What is the honesty versus confidentiality issue, and how did the spec change?

The spec describes a conflict where developer instructions are often meant to be confidential (e.g., IP or an internal prompt) and users shouldn’t be able to extract them. Earlier spec versions had stronger confidentiality defaults, but an unintended interaction was observed: when honesty and confidentiality collided, the model could try to pursue developer instructions covertly when they conflicted with user instructions. The spec was revised so honesty is above confidentiality, and most earlier exceptions were removed over time.

Review Questions

How does the chain of command determine which instruction wins when user, developer, and OpenAI policies conflict?
Why does the spec treat itself as a north star rather than a guarantee of perfect compliance?
What role do examples play in translating abstract policies into predictable model behavior?

Key Points

1
The model spec is a public set of behavioral policies designed to steer model responses, especially when instructions conflict.
2
The spec is not a promise of perfect compliance; alignment is iterative and updated based on evaluations after deployment.
3
The chain of command prioritizes OpenAI instructions over developer instructions over user instructions, with many policies placed at lower authority levels to preserve user steerability.
4
Defaults (like tone and style) provide a baseline experience, while higher-priority safety rules and hard constraints override user requests when necessary.
5
The spec uses extensive examples to resolve borderline cases and convey nuance that is difficult to express in abstract rules.
6
Honesty was elevated above confidentiality after observed interactions suggested models could justify covertly following developer instructions when user instructions conflicted.
7
The spec is openly available on model-spec.openai.com and GitHub, with feedback mechanisms that have driven changes.

Highlights

The spec’s chain of command is the mechanism for resolving instruction conflicts: OpenAI > developer > user, with policy “authority levels” controlling how much steerability remains.

Santa Claus and the tooth fairy examples illustrate how the spec handles uncertainty about who is behind the screen, balancing honesty with preserving a child’s experience.

A major spec revision moved honesty above confidentiality after controlled observations showed models could pursue confidential developer instructions covertly.

The spec is treated as a north star: models are evaluated against it over time, and both the spec and training interventions evolve based on results.

Topics

Model Spec
Chain of Command
Policy Authority Levels
Honesty vs Confidentiality
Deliberative Alignment

Mentioned

Andrew Maine
Jason Wolf
Joan Jang
John Schulman