OpenAI DevDay 2024 | Structured outputs for reliable applications

TL;DR

Structured outputs enforce exact JSON schema conformance, reducing production failures caused by invalid JSON, wrong types, missing fields, and extra surrounding text.

Briefing Cornell Notes

Briefing

Structured outputs are OpenAI’s push to make LLM results dependable for real applications by forcing model outputs to match developer-supplied JSON schemas—eliminating the recurring failures that come from “mostly JSON” responses. The core shift matters because modern AI products don’t just generate text; they trigger API calls, update user interfaces, extract fields from documents, and run multi-step agent workflows. When outputs drift from the expected format—extra text around JSON, invalid JSON syntax, wrong data types, missing parameters—production systems break or require brittle parsing workarounds.

The talk traces why earlier approaches weren’t enough. Function calling introduced a way to define tool signatures using JSON schema, but models could still emit invalid JSON (like trailing commas) or produce the wrong parameter types or omit required fields. JSON mode tightened things by guaranteeing valid JSON, but it still allowed schema-level mistakes such as incorrect types or missing fields. Structured outputs is positioned as the “constraining” layer that goes beyond asking the model to follow a schema: developers provide the schema, and the API ensures the generated output conforms exactly.

Structured outputs arrives in two API modes. In function calling mode, developers supply a tool schema and can enable strict schema adherence with a single setting (strict: true). The result is that the model must choose only valid enum values and supported operators, which prevents subtle integration bugs. A playground demo shows a query builder where the model initially chooses a “greater than or equal to” operator that the ORM can’t handle; turning on structured outputs forces the model to use the supported “greater than” logic and adjust the date boundary accordingly.

In response format mode, structured outputs targets direct user responses rather than tool calls. Developers move formatting instructions into a response_format schema so the model always returns the required keys and types. Another demo uses an AI glasses scenario: the schema requires a voiceover string (spelled out for TTS) and a short display string sized for the device screen. With strict schema enforcement, the model consistently returns exactly those fields.

Beyond demos, the talk highlights practical agentic reliability. A recruiting workflow (“Convex”) uses response formats to extract structured data from resumes (including arrays of work experiences) and function calling to filter candidates and generate UI components dynamically. A multi-step scheduling flow—check calendar availability, schedule interviews, then send emails—illustrates why schema-constrained outputs reduce cascading failures. The reliability argument is quantified: if each step has a 1% error rate, structured outputs help keep the overall workflow from ballooning into a much higher failure rate.

Under the hood, the implementation relies on constrained decoding. The system converts JSON schema into a formal grammar and uses token masking during autoregressive generation so only tokens that keep the partial output valid can be sampled. To meet tight latency budgets between tokens, the API precomputes an index (cached after the first request) so mask computation becomes a fast lookup. The engineering choice also supports a broader subset of JSON schema than regular expressions can handle—especially recursive schemas needed for nested UI structures.

Research complements engineering: models are trained to follow complex response formats and to understand the semantic “quality” of fields, not just their types. Evaluation results cited in the talk show accuracy rising substantially after training, with constrained decoding pushing performance to a perfect score in the reported setup. Finally, the API design makes constraints explicit by default—disallowing additional properties unless developers opt in, requiring properties unless marked optional via nullable—so schema mismatches fail fast instead of surfacing as runtime surprises.

Cornell Notes

Structured outputs make LLM outputs reliably match developer-defined JSON schemas, addressing long-standing integration failures from “almost JSON” responses. The API supports two modes: function calling (tool parameters) and response formats (direct user responses), with strict schema enforcement via settings like strict: true. Under the hood, constrained decoding uses token masking driven by a grammar derived from the JSON schema, so only schema-valid tokens can be generated at each step. The first request may take longer because the system builds and caches an index for fast mask lookups; subsequent requests run at normal speed. Model training also improves format adherence, and constrained decoding closes the remaining gap to achieve perfect scores in reported evaluations.

Why did function calling and JSON mode still fall short for production reliability?

Function calling could still produce invalid JSON (for example, trailing commas) and could generate schema-level mistakes like wrong parameter types or missing parameters. JSON mode improved syntax by ensuring valid JSON, but it didn’t guarantee schema correctness—models could still output the wrong types (e.g., a string where a float is expected) or hallucinate parameters. Structured outputs targets those remaining failure modes by enforcing exact schema conformance.

How do structured outputs work in function calling mode, and what does strict: true change?

In function calling mode, developers provide a tool schema (including parameter types and constraints like enum). The API exposes the schema to the model so tool calls are generated with parameters that fit the schema. Setting strict: true constrains generation so the model must use only valid schema constructs—preventing cases where it chooses unsupported operators or invalid enum values. A demo shows a query builder where the model initially uses “greater than or equal to,” which the ORM can’t handle; strict schema enforcement forces the model to use the supported “greater than” operator and adjust the date boundary.

What’s the difference between response formats and function calling in structured outputs?

Function calling is for cases where the model emits tool calls and parameters for external actions. Response formats are for cases where the model responds directly to users but still needs a guaranteed JSON structure. In response formats, developers move formatting requirements into a response_format schema so the model always returns exactly the required keys and types—useful for UI rendering and TTS pipelines.

How does structured outputs enable complex UI generation with schemas?

The talk describes a recruiting app where the model generates React components dynamically. A schema defines a top-level component (like a card) with children that can include other components (header, bar chart, etc.). Structured outputs supports recursive schema definitions using defs and refs, allowing nested component trees. Constrained decoding ensures the generated component structure remains valid even with recursion.

What is constrained decoding, and why does it require token masking at every step?

Constrained decoding restricts which tokens can be sampled during generation so the partial output always remains compatible with the schema. Because generation is autoregressive (one token at a time), the set of valid next tokens changes after each sampled token. That means token masks must be updated at every inference step, not just once at the beginning.

Why does the first structured-output request take longer than later ones?

The system converts the JSON schema into a grammar, builds a parser, and then iterates over possible parser states and vocabulary tokens to create an index (a tree) that supports fast mask lookup. Building this index is computationally expensive, so it’s done once and cached. After caching, subsequent requests can reuse the index, keeping inference latency low.

Review Questions

What specific schema-level errors does structured outputs eliminate that JSON mode alone doesn’t prevent?
Explain how token masking works in constrained decoding and why masks must be recomputed at each generated token.
What API design defaults did OpenAI choose for additional properties, required properties, and property ordering, and how do those defaults affect failure modes?

Key Points

1
Structured outputs enforce exact JSON schema conformance, reducing production failures caused by invalid JSON, wrong types, missing fields, and extra surrounding text.
2
The feature supports two modes: function calling for tool parameters and response formats for direct user responses with guaranteed JSON keys and types.
3
Enabling strict: true constrains generation so the model can only use schema-valid values (like supported operators and enum options), preventing integration bugs.
4
Constrained decoding uses schema-derived grammars and token masking so only schema-valid tokens can be sampled at each autoregressive step.
5
Structured outputs may incur a slower first request because the system builds and caches an index for fast mask lookups; later requests run closer to normal latency.
6
Model training improves format-following accuracy on complex and nested schemas, while constrained decoding closes the remaining gap in reported evaluations.
7
API defaults make constraints explicit by disallowing additional properties and requiring properties unless developers opt in via schema directives (e.g., nullable for optional fields).

Highlights

Structured outputs replace “asking nicely” with hard constraints: the API forces outputs to match the developer’s JSON schema exactly.

Strict schema enforcement can fix subtle logic bugs—like choosing an unsupported ORM operator—by constraining the model to allowed schema constructs.

Under the hood, constrained decoding turns JSON schema into a grammar and uses token masking during generation, with an index cached after the first request.

Recursive schemas (needed for nested UI component trees) can’t be handled reliably with regular expressions, so the system uses a context-free grammar approach with memory via a stack.

Reliability is framed as cumulative: schema-constrained outputs reduce how per-step errors compound in multi-step agent workflows.

Topics

Structured Outputs
Constrained Decoding
Function Calling
Response Formats
JSON Schema

Mentioned

Atty Eleti
Michelle Pokrass
TTS
ORM
GPU
API
JSON
CFG