Build Hour: Prompt Caching
Based on OpenAI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Prompt caching reuses computation when requests share an identical input prefix, skipping reprocessing of already-seen tokens and only running inference for new content.
Briefing
Prompt caching is positioned as a straightforward way to cut both latency and cost in OpenAI-powered applications by reusing computation whenever new requests share the same input prefix. The core mechanism is compute reuse: when multiple requests begin with identical content (the “prefix”), the system skips reprocessing tokens it has already handled and only runs inference for the new portion. That matters most for long prompts, high-volume workloads, and agentic systems where conversations and tool traces repeat large chunks turn after turn.
Caching eligibility starts at 1,024 tokens. Prompts shorter than that don’t qualify, while caching begins in 128-token blocks once the request reaches the threshold. A key constraint is contiguity and order: cache hits require the exact same prefix content in the same sequence. OpenAI describes this as implicit prompt caching across modalities—text, images, audio, and more—triggered automatically on the Responses API (and also on certain other endpoints mentioned), without code changes.
The financial impact is tied to cache hit rate. OpenAI introduced steep discounts for cached input tokens across model families (with larger discounts for newer families), and the session highlights that audio caching on the speech-to-speech model can reach nearly 99% discount on cached audio tokens. Latency gains depend on prompt length: a benchmark with variable-length inputs (from 1,024 up to 200,000 tokens) shows cached requests keeping time-to-first-token relatively stable, while uncached requests slow sharply as inputs grow. The practical takeaway is that the longer the input, the more caching helps time-to-first-token.
Under the hood, OpenAI hashes the first 256 tokens of the prefix and uses that hash to route requests to the right backend engine. Engines have capacity limits (about 15 requests per minute), so traffic distribution can trade off cache hit rate versus overall system health. When a request arrives, the engine checks how many subsequent 128-token chunks match what it has already cached; inference runs only from the first mismatched token onward. After generating output, the system updates the cache so future requests can reuse the intermediate key-value tensors (not raw tokens or media).
Several developer tactics are presented to maximize cache hits. The optional prompt cache key is emphasized as a way to improve routing locality when many requests share only part of the prefix; it hashes the prefix plus the key so related requests land on the same machine. Demos show that accidentally breaking the cache—such as inserting timestamps or dynamic whitespace early in the prompt—can drive cache hit rates to 0%, while using prompt cache keys can materially raise hit rates (and reduce input token cost) even when latency changes are small.
Context engineering is treated as a balancing act: shaping prompts dynamically can improve task quality, but it can invalidate cached prefixes. Newer compaction options (including server-side compaction and a responses/compact endpoint) help manage long-running contexts by summarizing or trimming history before it grows too large. For real-time speech-to-speech use, a retention ratio parameter is described as a way to avoid truncating every turn and instead make larger cuts less frequently.
The session also adds operational guidance: use the Responses API with reasoning models to preserve hidden chain-of-thought tokens that otherwise reduce cache reuse; consider flex processing for asynchronous workloads; and use allowed tools to change tool access without invalidating the cache. A customer spotlight from Warp frames prompt caching as essential for agentic loops—where system prompts, tool definitions, and earlier turns repeat—then describes Warp’s “scopes” approach (global, user, task) and its use of prompt cache keys to more than double cache hit rates. The closing message is that prompt caching is framed as having no direct intelligence trade-off when prefixes remain identical; the only real trade-off is architectural—how to structure prompts and context so caching remains effective.
Cornell Notes
Prompt caching reuses computation when new requests share the same input prefix, skipping fully processing tokens already handled and only running inference on the new portion. Caching eligibility begins at 1,024 tokens, then proceeds in 128-token blocks, and cache hits require the exact same prefix content in the same order. OpenAI’s system hashes the first 256 prefix tokens to route requests to backend engines, where matching 128-token chunks determine how much work can be reused. Developers can raise cache hit rates using the optional prompt cache key (to improve routing locality), avoid cache-breaking dynamic content early in prompts, and manage long contexts with compaction or retention strategies. For agentic workloads like Warp’s coding agents, consistent system prompts, tools, and stable “scopes” make prompt caching a major lever for cutting cost and improving time-to-first-token.
What exactly must match for a prompt caching hit to occur?
How does OpenAI decide where a request can reuse cached computation?
Why does the prompt cache key matter if caching is “implicit”?
How should developers think about context engineering without destroying cache reuse?
What role do compaction and retention ratio play in long-running sessions?
What endpoint and reasoning-model details affect cache hit rates?
Review Questions
- If a request’s first 256 prefix tokens match but later content differs, how does OpenAI determine which parts are cached versus uncached?
- Why can adding dynamic content like timestamps early in a prompt destroy cache hits even if the rest of the prompt is identical?
- How does the prompt cache key change routing behavior, and what problem does it solve when many requests share only a partial prefix?
Key Points
- 1
Prompt caching reuses computation when requests share an identical input prefix, skipping reprocessing of already-seen tokens and only running inference for new content.
- 2
Caching eligibility starts at 1,024 tokens, then proceeds in 128-token blocks; cache hits require the exact same prefix content in the exact same order.
- 3
OpenAI hashes the first 256 prefix tokens to route requests to backend engines; matching cached 128-token chunks determines how much work can be reused.
- 4
The optional prompt cache key improves cache hit rates by influencing routing locality—especially when many requests share only the initial prefix but would otherwise be load-balanced across machines.
- 5
Long-context systems need a strategy: preserve stable prefix content, and use compaction/truncation (including retention ratio) to control prompt growth without invalidating caches every turn.
- 6
Endpoint choice matters for reasoning models: the Responses API preserves hidden chain-of-thought tokens that can otherwise reduce cache reuse.
- 7
Operational monitoring should track cached versus uncached input tokens to diagnose cache misses (e.g., prefix mismatches, time expiry, request routing, or context-window handling).