Build Hour: Responses API

TL;DR

Responses API is built around an agentic loop that can perform multiple model samples and tool calls within a single request, producing a final answer only after tool outputs are incorporated.

Briefing Cornell Notes

Briefing

Responses API is OpenAI’s new “core primitive” for building agentic, multimodal applications—designed to fix pain points from chat completions by enabling multi-step tool use, clearer typed outputs, and faster long tool-calling rollouts. Instead of treating each request as a single model sample, the Responses API runs an agentic loop inside one call: the model can think, call tools (including hosted tools and MCP servers), receive results, and then sample again until it produces a final answer. That shift matters because modern models are increasingly agentic and multimodal, and developers need a single API shape that can handle everything from simple text responses to multi-minute workflows.

The migration pitch is grounded in a clear historical arc. OpenAI’s early completions API matched an era where models simply continued text. Chat completions arrived with conversational training and quickly became the default, later gaining features like tool calling and vision. But newer model families (described as agentic and highly multimodal) demand longer, more interactive execution patterns—especially when tools, code execution, and state management are involved. The Responses API is positioned as the platform layer that can keep up as model capabilities evolve.

A central design change is “items in, items out.” In chat completions, tool calls and other actions were bolted onto a message-centric interface, making it harder to reason about mixed outputs. In the Responses API, everything is an item type—messages, function calls, MCP calls, reasoning artifacts, and more—so developers can handle results with straightforward control flow (for example, iterating over items and switching by type). The API also supports rehydrating context across requests: reasoning can be preserved and passed back so reasoning models can continue where they left off. OpenAI claims this improves tool-calling performance (citing a 5% gain on an eval tobench tool-calling benchmark) and boosts speed for long multi-turn, multi-tool rollouts (citing ~20% faster at P50), alongside lower cost due to fewer repeated “thinking” tokens.

The Responses API also modernizes multimodal and streaming behavior. For multimodal workflows, it makes it easier to pass images via base64 or external URLs and supports “context stuffing” by letting developers send files like PDFs for extraction and analysis. Streaming is redesigned to emit a finite set of strongly typed events (such as text deltas, start/finish/failure signals, and tool-call lifecycle events) rather than forcing developers to accumulate opaque “object deltas.”

To help developers move from chat completions, OpenAI provides a migration pack powered by Codex CLI that rewrites a simple chat app: it converts conversation mapping to input items, switches the model to GPT5, updates streaming handling to the Responses event model, and adds reasoning encrypted content for chain-of-thought rehydration. A second demo (“OpenAI simulator”) shows how reasoning summaries can make UIs more responsive while waiting for GPT5, and how MCP tool calls can drive real actions in an external task board (listing issues, creating issues) with approval gates.

Finally, the session previews Agent Kit as a drag-and-drop workflow layer built on top of the Responses API, including a decision node that routes between a web-search agent and a conversational agent. Q&A then addressed practical concerns: using few-shot examples to reduce hallucinated JSON, prompt caching via stable context prefixes, and how MCP tool calls work under the hood (tool discovery first, then model-emitted JSON arguments, then server execution and acknowledgement).

Cornell Notes

OpenAI’s Responses API replaces chat completions as the main building block for agentic applications. It runs an “agentic loop” inside a single request, letting models think, call tools, receive tool outputs, and sample again until a final answer is ready. The API’s “items in, items out” design breaks actions (messages, function calls, MCP calls, reasoning artifacts) into typed units, making it easier to stream, persist, and route results. Responses is also built for reasoning models and multimodal workflows, including reasoning rehydration across requests and simpler image/file inputs. OpenAI claims better tool-calling performance (5% on a tool-calling eval) and faster, cheaper long multi-tool rollouts (~20% faster at P50) due to preserved planning and fewer repeated thinking tokens.

What problem with chat completions does the Responses API try to fix, and how does it do that?

Chat completions effectively samples the model once per request, which makes multi-step agent behavior awkward: the model must “think again” between steps when tools are involved. The Responses API instead uses an agentic loop so a single request can include multiple model samples. A typical flow is: the model writes code, the code interpreter tool executes server-side, the tool output is returned to the model, and then the model samples again to produce a final answer. This design is meant to support longer, tool-heavy workflows that match newer agentic and multimodal model behavior.

What does “items in, items out” mean in practice, and why does it matter for developers?

In the Responses API, everything is represented as an item with a specific type—examples include message items, function call items, MCP call items, and reasoning items. This replaces the message-centric interface where tool calls were bolted on and harder to reason about. With typed items, developers can iterate over outputs and use switch/for-loop logic to handle each type differently—such as rendering UI elements, persisting results, or feeding selected items back into the next request.

How does reasoning rehydration work across requests, and what performance impact is claimed?

Responses can preserve reasoning artifacts (like a reasoning item) and allow them to be passed back into subsequent requests so the model can continue with the same chain of thought. OpenAI describes the API as stateful by default when passing items back, and also notes options for stateless usage via encrypted reasoning content. The claimed impact is improved tool-calling performance: a cited 5% increase on the “eval tobench” tool-calling benchmark when comparing Responses to chat completions.

What changes to streaming and event handling does Responses introduce?

Chat completions streaming emits “object deltas,” which forces developers to accumulate many incremental events to reconstruct what happened. Responses streaming emits a finite number of strongly typed events—such as text deltas for incremental tokens and lifecycle events for start/finish/failure. This makes it easier to write a small switch statement over event types and build predictable UI updates.

How does the API support multimodal workflows like images and PDFs?

Responses is designed for multimodal inputs: it supports passing images as base64 or external URLs for vision tasks. It also supports context stuffing by accepting files such as PDFs; the system extracts content and provides it to the model so it can answer questions grounded in the document (the demo example asks why a bill was high in a specific month after passing a PDF).

How do MCP tool calls work end-to-end in Responses?

When an MCP server is enabled, Responses first contacts the MCP server to discover which tools are available for the user. It then provides the model with a namespaced list of function definitions. The model chooses a tool and emits JSON arguments; Responses sends that JSON back to the MCP server for execution. The MCP server returns an acknowledgement or structured result, which the model can then use to produce a final summarized response. The demo also shows approval gating (requiring user confirmation before executing certain MCP calls).

Review Questions

In what ways does the agentic loop in Responses change the number of times a model must “think” compared with chat completions during multi-tool workflows?
How does the typed “items” interface simplify UI rendering and backend persistence compared with message-centric tool calling?
What mechanisms does Responses provide for maintaining context and reasoning across requests (and what changes when using stateless mode)?

Key Points

1
Responses API is built around an agentic loop that can perform multiple model samples and tool calls within a single request, producing a final answer only after tool outputs are incorporated.
2
The “items in, items out” model replaces message-centric output handling by representing messages, tool calls, MCP calls, and reasoning as typed items that are easier to route and persist.
3
Reasoning rehydration across requests is a first-class feature for reasoning models, with OpenAI claiming improved tool-calling performance (5% on a tool-calling benchmark) and faster long rollouts (~20% at P50).
4
Responses modernizes multimodal workflows with simpler image inputs (base64 or URLs) and context stuffing for files like PDFs.
5
Streaming is redesigned to emit strongly typed events (text deltas, lifecycle signals, tool-call states), reducing the need to accumulate “object deltas.”
6
OpenAI provides a Codex CLI-powered migration pack to convert chat completions integrations to Responses, including updates to input mapping, streaming handling, and reasoning encrypted content.
7
MCP integration follows a discovery-then-execution pattern: Responses queries the MCP server for available tools, the model emits JSON arguments, and the MCP server executes and returns results (optionally with approval gates).

Highlights

The Responses API’s agentic loop lets a single request include multiple tool calls and multiple model samples, avoiding the “think again between steps” problem common in chat completions.

Typed “items” (including reasoning and tool-call items) make it easier to build deterministic UI and backend logic using switch/for-loop handling.

OpenAI claims long multi-tool rollouts are faster and cheaper with Responses because planning can be rehydrated and fewer “thinking” tokens are emitted.

MCP support is built in: tool discovery happens first, then the model emits JSON arguments, then the MCP server executes and returns structured results.

A migration pack can automatically rewrite a chat app from chat completions to Responses, including streaming and reasoning changes.

Topics

Responses API
Agentic Loop
Items In Items Out
MCP Tool Calling
Reasoning Rehydration

Mentioned

Christine
Steve
Sam Altman
MCP
UI
JSON
P50
RAG
ZDR
SDK