I can't believe nobody's done this before...
Based on Theo - t3․gg's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
OpenAI’s WebSocket-based API approach targets the repeated full-context retransmission that occurs after every tool call in agentic workflows.
Briefing
OpenAI’s shift from REST-style request/response to WebSockets for its APIs is poised to cut the biggest hidden cost in agentic AI: repeatedly sending the entire conversation and tool history back to the model after every tool call. The practical payoff is bandwidth reduction of 90%+ and speed gains of roughly 20–40% for agent runs with many tool calls—especially when a single user prompt triggers dozens or hundreds of tool invocations.
The core issue starts with how tool-using agents work. A typical flow stacks a system prompt, the user message, and then a sequence of agent messages and tool calls. When the agent decides it needs external information—like scanning a codebase with an `ls`-style tool—the model effectively pauses generation. Once the tool returns results, the model must resume, but it does so in a way that requires the full prior context so it knows what happened. In practice, every tool call completion triggers another API round trip that includes the entire history up to that point, not just the new tool output. That means long agent runs keep re-uploading growing payloads, even when the model will only generate a handful of tokens in the next step.
Caching doesn’t solve the bandwidth problem. Cache keys are derived from hashes of the history, so even if compute is reduced, the client still transmits the full context so the system can determine whether earlier portions can be reused. Compaction is different: it summarizes history to shorten tokens, but it also undermines caching benefits. The result is a structural inefficiency: tool-heavy agents spend network and orchestration overhead processing and routing massive text inputs to produce small outputs.
The bottleneck is partly architectural. Requests typically route through an orchestration/API layer that checks permissions, finds available GPU capacity, and manages caches. Because different requests may land on different backend boxes, the system can’t reliably keep state in memory across turns. Without state persistence, each subsequent request must resend the full context so the model can continue correctly.
WebSockets change the game by enabling a persistent connection to the same server during an agent run. That persistence acts less like a “faster protocol” and more like a guarantee: follow-up tool-call turns can hit the same API server that already has the in-memory state. With that guarantee, the system can avoid rechecking routing/caching decisions and avoid resending the entire history. Instead, it can send only the new inputs (like the latest tool output) and continue generation immediately.
The improvement matters most for agentic workloads, not simple chat. For a chat app where users rarely follow up mid-run, reconnecting per user message is acceptable. But for tool chains with 20+ tool calls, the repeated context shipping becomes expensive enough that the WebSocket approach can deliver outsized gains.
Finally, the change is positioned as part of a broader ecosystem shift: OpenAI’s Responses API has influenced an open standard for structured request/response and tool invocation patterns, with clients and providers able to interoperate consistently. The missing piece—WebSocket support in that standard—was noted as not yet implemented, but expected soon. The bottom line: this isn’t just about token speed; it’s about making agent execution cheaper, faster, and less wasteful across the stack.
Cornell Notes
OpenAI’s API move to WebSockets targets a major inefficiency in tool-using agents: after each tool call, the system often needs the full prior context, so every follow-up request can resend the entire growing history. That repeated context transfer wastes bandwidth and forces extra orchestration work, even when the next model output may be only a few tokens. WebSockets enable a persistent connection that can keep in-memory state on the same backend server during an agent run, so follow-ups can send only new inputs (like tool results) instead of the whole transcript. The expected outcome is 90%+ bandwidth reduction and about 20–40% faster agentic runs with many tool calls. This benefit is strongest for multi-tool workflows, less so for simple chat turns.
Why does an agent’s tool call often trigger sending the entire conversation history again?
How do caching and compaction differ in terms of bandwidth and traffic volume?
What role does the orchestration/API layer play in why state is hard to keep across requests?
What does WebSockets add beyond “a faster protocol”?
Why does the benefit matter less for a typical chat app follow-up?
What performance numbers are associated with the WebSocket change?
Review Questions
- In a tool-using agent workflow, what specific information must the model have after each tool call, and why does that lead to resending history in stateless setups?
- Explain why caching reduces compute but may not reduce bandwidth in this architecture. What determines the cache key?
- How does a persistent WebSocket connection change backend routing assumptions, and why does that enable sending only new tool outputs?
Key Points
- 1
OpenAI’s WebSocket-based API approach targets the repeated full-context retransmission that occurs after every tool call in agentic workflows.
- 2
Tool calls pause generation, then require the model to resume with the full prior system/user/tool history, which drives large payloads across many agent steps.
- 3
Caching can cut compute and speed up token generation, but it typically doesn’t reduce the amount of data sent because cache keys depend on hashed history.
- 4
Compaction can reduce token length by summarizing history, but it can undermine cache reuse because the history no longer matches prior cached hashes.
- 5
The orchestration layer makes state persistence difficult in stateless request/response systems because follow-up requests may route to different backend boxes.
- 6
WebSockets provide a persistence guarantee that keeps subsequent tool-call turns on the same server, enabling in-memory state reuse and sending only new inputs.
- 7
The largest gains are expected for agent runs with many tool calls (20+), while simple chat follow-ups benefit less because context reload per user message is usually acceptable.