Get AI summaries of any video or article — Sign up free
Open Responses - The NEW Standard API for Open Models thumbnail

Open Responses - The NEW Standard API for Open Models

Sam Witteveen·
5 min read

Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Open responses is positioned as an interoperability standard for agent-style model features, aiming to prevent developers from rewriting integrations for each open model’s custom endpoint.

Briefing

OpenAI’s push for an “open responses” standard aims to make today’s agent-style features—tool calling, streaming, multimodal inputs, and structured agent loops—work across many open-model providers without forcing developers to learn a new API for every model. The practical problem behind the initiative is that open models increasingly ship with their own “frontier” interfaces, so code written for one model’s reasoning or tool-calling behavior often breaks when swapped to another.

The transcript places this effort in context: earlier compatibility approaches let developers reuse OpenAI SDKs and chat-completions endpoints via compatibility modes. That era is fading as systems and agents demand richer, more structured APIs than plain chat. Instead, model ecosystems have been building their own endpoints—most visibly around Anthropic’s Claude Code and its tool- and agent-oriented workflow. The result is fragmentation: providers that want adoption tend to implement the most popular agent tooling interface rather than invent a new one.

OpenAI’s proposed standard, “open responses,” is framed as a multi-provider specification designed to reduce that fragmentation. It’s intended to let models plug into a common agentic API surface, while still allowing model makers to extend behavior without breaking the baseline contract. The transcript highlights that community adopters are already lining up, including Hugging Face and infrastructure layers such as Vercel, OpenRouter, and local/cloud runtime tools like LM Studio, Ollama, and vLLM. The expectation is that once these building blocks support the standard, more frameworks (and model-serving platforms) will follow.

Inside the spec, much of what’s described looks like a structured mapping from OpenAI’s responses API concepts into a broader standard: “items” that can represent messages, function calls (with names and arguments), and intermediate states such as in-progress versus completed steps. A major emphasis is on reasoning handling. Open models often expose reasoning differently across vendors, forcing developers to rewrite logic to hide, stream, or summarize reasoning tokens. Open responses is presented as baking in support for both raw reasoning tokens and reasoning summaries, which could make it easier to build consistent developer experiences.

The standard also supports tool-related workflows beyond simple external function calls. It includes “internally hosted” tools—similar to server-side utilities like search or sandboxed code execution—alongside patterns that return tool calls for external execution (including MCP-style setups). It further includes explicit tool-choice controls (must use tools, must not use tools, or choose among tools), which the transcript notes open models have not consistently supported.

Finally, the transcript demonstrates the standard in code with both Hugging Face inference endpoints and local Ollama. It shows responses-style calls with event-based streaming, tool calling, and reasoning traces, while noting that model support can be uneven—some models stream reasoning tokens more reliably than others. The closing assessment is conditional: if major open-model labs train to this schema, local deployment could become dramatically easier for developers migrating from proprietary ecosystems. At the same time, the transcript predicts competing compatibility efforts—especially around Claude/Anthropic-style interfaces—may also accelerate, with Chinese providers potentially aligning to those APIs for adoption.

In short, open responses is pitched as an interoperability layer for agentic model features, with the biggest payoff coming if enough open-model makers treat it as a training target rather than just an API wrapper.

Cornell Notes

OpenAI’s “open responses” standard is designed to unify how developers interact with open models for agent-style tasks—tool calling, streaming, multimodal inputs, and multi-step agent loops—so code doesn’t have to be rewritten for every model’s custom endpoint. The spec organizes interactions into structured “items” such as messages and function calls, and it includes explicit support for reasoning handling, including both raw reasoning tokens and reasoning summaries. It also accommodates tool execution patterns, including internally hosted tools (server-side utilities) and external tool-call workflows (including MCP-like setups), plus tool-choice constraints. Early adoption by platforms like Hugging Face and runtime layers such as Ollama and vLLM suggests the standard could become a practical bridge for running open models locally or via hosted inference.

Why does open responses matter more now than earlier OpenAI-compatibility modes?

Earlier compatibility modes let developers reuse OpenAI SDKs and chat-completions endpoints across different providers. As models shift toward agentic systems—where tool calling, streaming events, and multi-step reasoning matter—simple chat-completions interfaces no longer capture the needed structure. Open responses targets that gap by standardizing the agent-style API surface so developers can keep one integration pattern across many open models.

What does the standard mean by “items,” and how does that help with agent loops?

The spec uses “items” to represent structured elements in an interaction. Items can include message roles (assistant/user/system), function calls with a function name and arguments, and intermediate states such as “in progress” versus “completed.” This structure supports agentic loops where the model can request tool actions, stream intermediate progress, and then return a final response in a consistent format.

How does open responses aim to reduce pain around reasoning tokens?

Open models often expose reasoning differently across vendors—some stream reasoning tokens, others hide them, and others provide only summaries. Open responses is presented as baking in reasoning support so developers can handle both raw reasoning tokens and reasoning summaries through the same API contract, avoiding per-model rewrites when switching models.

What’s the difference between internally hosted tools and external tool calls in this standard?

Internally hosted tools are server-side utilities (examples mentioned include search-like tools and sandboxed code execution) that the provider can run as part of the request flow. External tool calls return a tool invocation for execution outside the model provider (the transcript also references MCP-style patterns). Open responses supports both, letting developers choose an execution model without changing the integration shape.

What did the transcript’s code demos show about real-world support?

The demos used Hugging Face inference endpoints and local Ollama, both calling a client.responses-style interface with event-based streaming, tool calling, and reasoning traces. Results were described as fast on hosted inference and slower on local runs due to model loading. Reasoning streaming was described as “hit and miss” depending on model support, with some GPT OSS models streaming reasoning tokens and enabling token-level breakdowns (prompt, completion, reasoning).

Review Questions

  1. What specific agentic capabilities (beyond plain chat) does open responses standardize, and why are those capabilities hard to support with older compatibility modes?
  2. How do “items” and function-call structures enable multi-step tool-using workflows?
  3. Why is consistent reasoning handling (tokens vs summaries) a major interoperability challenge for open models?

Key Points

  1. 1

    Open responses is positioned as an interoperability standard for agent-style model features, aiming to prevent developers from rewriting integrations for each open model’s custom endpoint.

  2. 2

    The standard uses structured “items” to represent messages, function calls (name + arguments), and intermediate states like in-progress versus completed steps.

  3. 3

    Reasoning support is treated as first-class, including both raw reasoning tokens and reasoning summaries to reduce vendor-specific handling differences.

  4. 4

    Tool execution is standardized across two patterns: internally hosted server-side tools and external tool-call workflows (including MCP-like approaches).

  5. 5

    Tool-choice controls (must use tools / must not use tools / choose) are included to make tool behavior more predictable for agent training and deployment.

  6. 6

    Community adoption is already underway through platforms and runtimes such as Hugging Face, Vercel, OpenRouter, LM Studio, Ollama, and vLLM, which could accelerate ecosystem convergence.

  7. 7

    Local and hosted implementations can use the same responses-style calls, but model-level support for reasoning streaming may vary, affecting developer expectations.

Highlights

Open responses targets the agent era—tool calling, streaming events, multimodal inputs, and multi-step loops—where chat-completions compatibility no longer suffices.
The spec’s reasoning handling is designed to unify how developers consume reasoning tokens and summaries across different open-model vendors.
Support for both internally hosted tools and external tool-call patterns is framed as key for real-world agent deployments.
Code demos show the same client.responses-style workflow working on Hugging Face inference and local Ollama, with performance and reasoning-stream reliability varying by setup and model.

Topics