Get AI summaries of any video or article — Sign up free
I can't believe nobody's done this before... thumbnail

I can't believe nobody's done this before...

Theo - t3․gg·
6 min read

Based on Theo - t3․gg's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

OpenAI’s WebSocket-based API approach targets the repeated full-context retransmission that occurs after every tool call in agentic workflows.

Briefing

OpenAI’s shift from REST-style request/response to WebSockets for its APIs is poised to cut the biggest hidden cost in agentic AI: repeatedly sending the entire conversation and tool history back to the model after every tool call. The practical payoff is bandwidth reduction of 90%+ and speed gains of roughly 20–40% for agent runs with many tool calls—especially when a single user prompt triggers dozens or hundreds of tool invocations.

The core issue starts with how tool-using agents work. A typical flow stacks a system prompt, the user message, and then a sequence of agent messages and tool calls. When the agent decides it needs external information—like scanning a codebase with an `ls`-style tool—the model effectively pauses generation. Once the tool returns results, the model must resume, but it does so in a way that requires the full prior context so it knows what happened. In practice, every tool call completion triggers another API round trip that includes the entire history up to that point, not just the new tool output. That means long agent runs keep re-uploading growing payloads, even when the model will only generate a handful of tokens in the next step.

Caching doesn’t solve the bandwidth problem. Cache keys are derived from hashes of the history, so even if compute is reduced, the client still transmits the full context so the system can determine whether earlier portions can be reused. Compaction is different: it summarizes history to shorten tokens, but it also undermines caching benefits. The result is a structural inefficiency: tool-heavy agents spend network and orchestration overhead processing and routing massive text inputs to produce small outputs.

The bottleneck is partly architectural. Requests typically route through an orchestration/API layer that checks permissions, finds available GPU capacity, and manages caches. Because different requests may land on different backend boxes, the system can’t reliably keep state in memory across turns. Without state persistence, each subsequent request must resend the full context so the model can continue correctly.

WebSockets change the game by enabling a persistent connection to the same server during an agent run. That persistence acts less like a “faster protocol” and more like a guarantee: follow-up tool-call turns can hit the same API server that already has the in-memory state. With that guarantee, the system can avoid rechecking routing/caching decisions and avoid resending the entire history. Instead, it can send only the new inputs (like the latest tool output) and continue generation immediately.

The improvement matters most for agentic workloads, not simple chat. For a chat app where users rarely follow up mid-run, reconnecting per user message is acceptable. But for tool chains with 20+ tool calls, the repeated context shipping becomes expensive enough that the WebSocket approach can deliver outsized gains.

Finally, the change is positioned as part of a broader ecosystem shift: OpenAI’s Responses API has influenced an open standard for structured request/response and tool invocation patterns, with clients and providers able to interoperate consistently. The missing piece—WebSocket support in that standard—was noted as not yet implemented, but expected soon. The bottom line: this isn’t just about token speed; it’s about making agent execution cheaper, faster, and less wasteful across the stack.

Cornell Notes

OpenAI’s API move to WebSockets targets a major inefficiency in tool-using agents: after each tool call, the system often needs the full prior context, so every follow-up request can resend the entire growing history. That repeated context transfer wastes bandwidth and forces extra orchestration work, even when the next model output may be only a few tokens. WebSockets enable a persistent connection that can keep in-memory state on the same backend server during an agent run, so follow-ups can send only new inputs (like tool results) instead of the whole transcript. The expected outcome is 90%+ bandwidth reduction and about 20–40% faster agentic runs with many tool calls. This benefit is strongest for multi-tool workflows, less so for simple chat turns.

Why does an agent’s tool call often trigger sending the entire conversation history again?

Tool-using agents build up state across system prompt, user message, agent messages, and tool calls. When a tool call completes, the model must resume generation with knowledge of what happened earlier (including the tool outputs). In the common request/response setup, each follow-up call is effectively stateless from the backend’s perspective, so the client resends the full history up to that point so the model can continue correctly. The transcript’s example frames this as “the model is paused/dead” during the tool call, then “wakes up” needing the full prior context to proceed.

How do caching and compaction differ in terms of bandwidth and traffic volume?

Caching reduces compute time but doesn’t necessarily reduce how much text is transmitted. Cache keys are based on hashes of the history, so the system can only reuse cached computation after receiving the full history payload to compute the hash and locate prior cached segments. Compaction, by contrast, shortens the history by summarizing it, which can reduce tokens sent—but it can break cache reuse because the history no longer matches prior cached hashes. So caching helps latency/cost of processing, while compaction changes the content/length and can forgo caching benefits.

What role does the orchestration/API layer play in why state is hard to keep across requests?

Requests typically go to an API/orchestration layer that checks permissions, selects available GPU capacity, and routes work to backend servers. Because there are many backend boxes and requests may be routed to different ones, the system can’t assume the same in-memory state exists for the next turn. Without a persistent session pinned to one server, each request must resend the full context so the backend can reconstruct the agent state. The transcript emphasizes that even if GPUs exist, clients aren’t directly calling them; they’re going through orchestration that may distribute requests across different machines.

What does WebSockets add beyond “a faster protocol”?

The key value is not just throughput; it’s the ability to keep hitting the same backend server during an agent run. With a persistent WebSocket connection, the backend can maintain in-memory state across tool-call turns. That removes the need to resend the entire history and reduces repeated checks for routing, caching, and permissions. The transcript frames this as a guarantee: tool call 2 can reliably land on the same box as tool call 1, so the system can send only the new inputs.

Why does the benefit matter less for a typical chat app follow-up?

In a chat app, the second user message usually arrives after the first turn finishes, and it may not reuse the same persistent connection/session. Reloading context once per user message is therefore acceptable. The transcript argues the big win comes when one user message spawns many tool calls in a single agent run (e.g., 20+ tool calls), where repeated full-history retransmission becomes a dominant cost.

What performance numbers are associated with the WebSocket change?

The transcript cites bandwidth reduction of 90%+ and speed improvements of about 20–40% for agentic runs with many tool calls. It also notes that early testing suggested the gains could be even larger than what was publicly shared, while OpenAI’s own description attributes the improvement to sending only new inputs and maintaining in-memory state across interactions.

Review Questions

  1. In a tool-using agent workflow, what specific information must the model have after each tool call, and why does that lead to resending history in stateless setups?
  2. Explain why caching reduces compute but may not reduce bandwidth in this architecture. What determines the cache key?
  3. How does a persistent WebSocket connection change backend routing assumptions, and why does that enable sending only new tool outputs?

Key Points

  1. 1

    OpenAI’s WebSocket-based API approach targets the repeated full-context retransmission that occurs after every tool call in agentic workflows.

  2. 2

    Tool calls pause generation, then require the model to resume with the full prior system/user/tool history, which drives large payloads across many agent steps.

  3. 3

    Caching can cut compute and speed up token generation, but it typically doesn’t reduce the amount of data sent because cache keys depend on hashed history.

  4. 4

    Compaction can reduce token length by summarizing history, but it can undermine cache reuse because the history no longer matches prior cached hashes.

  5. 5

    The orchestration layer makes state persistence difficult in stateless request/response systems because follow-up requests may route to different backend boxes.

  6. 6

    WebSockets provide a persistence guarantee that keeps subsequent tool-call turns on the same server, enabling in-memory state reuse and sending only new inputs.

  7. 7

    The largest gains are expected for agent runs with many tool calls (20+), while simple chat follow-ups benefit less because context reload per user message is usually acceptable.

Highlights

The biggest hidden cost in agentic AI isn’t just model speed—it’s the repeated network transfer of an ever-growing tool-and-chat history after each tool call.
Caching helps latency and cost of computation, but it doesn’t automatically reduce bandwidth because the system still needs the full history to find cached segments.
WebSockets matter mainly because they can keep tool-call follow-ups pinned to the same backend server with in-memory state, avoiding repeated routing/caching checks.
The performance target is concrete: 90%+ bandwidth reduction and about 20–40% faster agentic runs with many tool calls.

Topics

Mentioned