Build Hour: Responses API
Based on OpenAI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Responses API is built around an agentic loop that can perform multiple model samples and tool calls within a single request, producing a final answer only after tool outputs are incorporated.
Briefing
Responses API is OpenAI’s new “core primitive” for building agentic, multimodal applications—designed to fix pain points from chat completions by enabling multi-step tool use, clearer typed outputs, and faster long tool-calling rollouts. Instead of treating each request as a single model sample, the Responses API runs an agentic loop inside one call: the model can think, call tools (including hosted tools and MCP servers), receive results, and then sample again until it produces a final answer. That shift matters because modern models are increasingly agentic and multimodal, and developers need a single API shape that can handle everything from simple text responses to multi-minute workflows.
The migration pitch is grounded in a clear historical arc. OpenAI’s early completions API matched an era where models simply continued text. Chat completions arrived with conversational training and quickly became the default, later gaining features like tool calling and vision. But newer model families (described as agentic and highly multimodal) demand longer, more interactive execution patterns—especially when tools, code execution, and state management are involved. The Responses API is positioned as the platform layer that can keep up as model capabilities evolve.
A central design change is “items in, items out.” In chat completions, tool calls and other actions were bolted onto a message-centric interface, making it harder to reason about mixed outputs. In the Responses API, everything is an item type—messages, function calls, MCP calls, reasoning artifacts, and more—so developers can handle results with straightforward control flow (for example, iterating over items and switching by type). The API also supports rehydrating context across requests: reasoning can be preserved and passed back so reasoning models can continue where they left off. OpenAI claims this improves tool-calling performance (citing a 5% gain on an eval tobench tool-calling benchmark) and boosts speed for long multi-turn, multi-tool rollouts (citing ~20% faster at P50), alongside lower cost due to fewer repeated “thinking” tokens.
The Responses API also modernizes multimodal and streaming behavior. For multimodal workflows, it makes it easier to pass images via base64 or external URLs and supports “context stuffing” by letting developers send files like PDFs for extraction and analysis. Streaming is redesigned to emit a finite set of strongly typed events (such as text deltas, start/finish/failure signals, and tool-call lifecycle events) rather than forcing developers to accumulate opaque “object deltas.”
To help developers move from chat completions, OpenAI provides a migration pack powered by Codex CLI that rewrites a simple chat app: it converts conversation mapping to input items, switches the model to GPT5, updates streaming handling to the Responses event model, and adds reasoning encrypted content for chain-of-thought rehydration. A second demo (“OpenAI simulator”) shows how reasoning summaries can make UIs more responsive while waiting for GPT5, and how MCP tool calls can drive real actions in an external task board (listing issues, creating issues) with approval gates.
Finally, the session previews Agent Kit as a drag-and-drop workflow layer built on top of the Responses API, including a decision node that routes between a web-search agent and a conversational agent. Q&A then addressed practical concerns: using few-shot examples to reduce hallucinated JSON, prompt caching via stable context prefixes, and how MCP tool calls work under the hood (tool discovery first, then model-emitted JSON arguments, then server execution and acknowledgement).
Cornell Notes
OpenAI’s Responses API replaces chat completions as the main building block for agentic applications. It runs an “agentic loop” inside a single request, letting models think, call tools, receive tool outputs, and sample again until a final answer is ready. The API’s “items in, items out” design breaks actions (messages, function calls, MCP calls, reasoning artifacts) into typed units, making it easier to stream, persist, and route results. Responses is also built for reasoning models and multimodal workflows, including reasoning rehydration across requests and simpler image/file inputs. OpenAI claims better tool-calling performance (5% on a tool-calling eval) and faster, cheaper long multi-tool rollouts (~20% faster at P50) due to preserved planning and fewer repeated thinking tokens.
What problem with chat completions does the Responses API try to fix, and how does it do that?
What does “items in, items out” mean in practice, and why does it matter for developers?
How does reasoning rehydration work across requests, and what performance impact is claimed?
What changes to streaming and event handling does Responses introduce?
How does the API support multimodal workflows like images and PDFs?
How do MCP tool calls work end-to-end in Responses?
Review Questions
- In what ways does the agentic loop in Responses change the number of times a model must “think” compared with chat completions during multi-tool workflows?
- How does the typed “items” interface simplify UI rendering and backend persistence compared with message-centric tool calling?
- What mechanisms does Responses provide for maintaining context and reasoning across requests (and what changes when using stateless mode)?
Key Points
- 1
Responses API is built around an agentic loop that can perform multiple model samples and tool calls within a single request, producing a final answer only after tool outputs are incorporated.
- 2
The “items in, items out” model replaces message-centric output handling by representing messages, tool calls, MCP calls, and reasoning as typed items that are easier to route and persist.
- 3
Reasoning rehydration across requests is a first-class feature for reasoning models, with OpenAI claiming improved tool-calling performance (5% on a tool-calling benchmark) and faster long rollouts (~20% at P50).
- 4
Responses modernizes multimodal workflows with simpler image inputs (base64 or URLs) and context stuffing for files like PDFs.
- 5
Streaming is redesigned to emit strongly typed events (text deltas, lifecycle signals, tool-call states), reducing the need to accumulate “object deltas.”
- 6
OpenAI provides a Codex CLI-powered migration pack to convert chat completions integrations to Responses, including updates to input mapping, streaming handling, and reasoning encrypted content.
- 7
MCP integration follows a discovery-then-execution pattern: Responses queries the MCP server for available tools, the model emits JSON arguments, and the MCP server executes and returns results (optionally with approval gates).