OpenAI DevDay 2024 | Structured outputs for reliable applications
Based on OpenAI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Structured outputs enforce exact JSON schema conformance, reducing production failures caused by invalid JSON, wrong types, missing fields, and extra surrounding text.
Briefing
Structured outputs are OpenAI’s push to make LLM results dependable for real applications by forcing model outputs to match developer-supplied JSON schemas—eliminating the recurring failures that come from “mostly JSON” responses. The core shift matters because modern AI products don’t just generate text; they trigger API calls, update user interfaces, extract fields from documents, and run multi-step agent workflows. When outputs drift from the expected format—extra text around JSON, invalid JSON syntax, wrong data types, missing parameters—production systems break or require brittle parsing workarounds.
The talk traces why earlier approaches weren’t enough. Function calling introduced a way to define tool signatures using JSON schema, but models could still emit invalid JSON (like trailing commas) or produce the wrong parameter types or omit required fields. JSON mode tightened things by guaranteeing valid JSON, but it still allowed schema-level mistakes such as incorrect types or missing fields. Structured outputs is positioned as the “constraining” layer that goes beyond asking the model to follow a schema: developers provide the schema, and the API ensures the generated output conforms exactly.
Structured outputs arrives in two API modes. In function calling mode, developers supply a tool schema and can enable strict schema adherence with a single setting (strict: true). The result is that the model must choose only valid enum values and supported operators, which prevents subtle integration bugs. A playground demo shows a query builder where the model initially chooses a “greater than or equal to” operator that the ORM can’t handle; turning on structured outputs forces the model to use the supported “greater than” logic and adjust the date boundary accordingly.
In response format mode, structured outputs targets direct user responses rather than tool calls. Developers move formatting instructions into a response_format schema so the model always returns the required keys and types. Another demo uses an AI glasses scenario: the schema requires a voiceover string (spelled out for TTS) and a short display string sized for the device screen. With strict schema enforcement, the model consistently returns exactly those fields.
Beyond demos, the talk highlights practical agentic reliability. A recruiting workflow (“Convex”) uses response formats to extract structured data from resumes (including arrays of work experiences) and function calling to filter candidates and generate UI components dynamically. A multi-step scheduling flow—check calendar availability, schedule interviews, then send emails—illustrates why schema-constrained outputs reduce cascading failures. The reliability argument is quantified: if each step has a 1% error rate, structured outputs help keep the overall workflow from ballooning into a much higher failure rate.
Under the hood, the implementation relies on constrained decoding. The system converts JSON schema into a formal grammar and uses token masking during autoregressive generation so only tokens that keep the partial output valid can be sampled. To meet tight latency budgets between tokens, the API precomputes an index (cached after the first request) so mask computation becomes a fast lookup. The engineering choice also supports a broader subset of JSON schema than regular expressions can handle—especially recursive schemas needed for nested UI structures.
Research complements engineering: models are trained to follow complex response formats and to understand the semantic “quality” of fields, not just their types. Evaluation results cited in the talk show accuracy rising substantially after training, with constrained decoding pushing performance to a perfect score in the reported setup. Finally, the API design makes constraints explicit by default—disallowing additional properties unless developers opt in, requiring properties unless marked optional via nullable—so schema mismatches fail fast instead of surfacing as runtime surprises.
Cornell Notes
Structured outputs make LLM outputs reliably match developer-defined JSON schemas, addressing long-standing integration failures from “almost JSON” responses. The API supports two modes: function calling (tool parameters) and response formats (direct user responses), with strict schema enforcement via settings like strict: true. Under the hood, constrained decoding uses token masking driven by a grammar derived from the JSON schema, so only schema-valid tokens can be generated at each step. The first request may take longer because the system builds and caches an index for fast mask lookups; subsequent requests run at normal speed. Model training also improves format adherence, and constrained decoding closes the remaining gap to achieve perfect scores in reported evaluations.
Why did function calling and JSON mode still fall short for production reliability?
How do structured outputs work in function calling mode, and what does strict: true change?
What’s the difference between response formats and function calling in structured outputs?
How does structured outputs enable complex UI generation with schemas?
What is constrained decoding, and why does it require token masking at every step?
Why does the first structured-output request take longer than later ones?
Review Questions
- What specific schema-level errors does structured outputs eliminate that JSON mode alone doesn’t prevent?
- Explain how token masking works in constrained decoding and why masks must be recomputed at each generated token.
- What API design defaults did OpenAI choose for additional properties, required properties, and property ordering, and how do those defaults affect failure modes?
Key Points
- 1
Structured outputs enforce exact JSON schema conformance, reducing production failures caused by invalid JSON, wrong types, missing fields, and extra surrounding text.
- 2
The feature supports two modes: function calling for tool parameters and response formats for direct user responses with guaranteed JSON keys and types.
- 3
Enabling strict: true constrains generation so the model can only use schema-valid values (like supported operators and enum options), preventing integration bugs.
- 4
Constrained decoding uses schema-derived grammars and token masking so only schema-valid tokens can be sampled at each autoregressive step.
- 5
Structured outputs may incur a slower first request because the system builds and caches an index for fast mask lookups; later requests run closer to normal latency.
- 6
Model training improves format-following accuracy on complex and nested schemas, while constrained decoding closes the remaining gap in reported evaluations.
- 7
API defaults make constraints explicit by disallowing additional properties and requiring properties unless developers opt in via schema directives (e.g., nullable for optional fields).