Get AI summaries of any video or article — Sign up free
Build Hour: Prompt Caching thumbnail

Build Hour: Prompt Caching

OpenAI·
6 min read

Based on OpenAI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Prompt caching reuses computation when requests share an identical input prefix, skipping reprocessing of already-seen tokens and only running inference for new content.

Briefing

Prompt caching is positioned as a straightforward way to cut both latency and cost in OpenAI-powered applications by reusing computation whenever new requests share the same input prefix. The core mechanism is compute reuse: when multiple requests begin with identical content (the “prefix”), the system skips reprocessing tokens it has already handled and only runs inference for the new portion. That matters most for long prompts, high-volume workloads, and agentic systems where conversations and tool traces repeat large chunks turn after turn.

Caching eligibility starts at 1,024 tokens. Prompts shorter than that don’t qualify, while caching begins in 128-token blocks once the request reaches the threshold. A key constraint is contiguity and order: cache hits require the exact same prefix content in the same sequence. OpenAI describes this as implicit prompt caching across modalities—text, images, audio, and more—triggered automatically on the Responses API (and also on certain other endpoints mentioned), without code changes.

The financial impact is tied to cache hit rate. OpenAI introduced steep discounts for cached input tokens across model families (with larger discounts for newer families), and the session highlights that audio caching on the speech-to-speech model can reach nearly 99% discount on cached audio tokens. Latency gains depend on prompt length: a benchmark with variable-length inputs (from 1,024 up to 200,000 tokens) shows cached requests keeping time-to-first-token relatively stable, while uncached requests slow sharply as inputs grow. The practical takeaway is that the longer the input, the more caching helps time-to-first-token.

Under the hood, OpenAI hashes the first 256 tokens of the prefix and uses that hash to route requests to the right backend engine. Engines have capacity limits (about 15 requests per minute), so traffic distribution can trade off cache hit rate versus overall system health. When a request arrives, the engine checks how many subsequent 128-token chunks match what it has already cached; inference runs only from the first mismatched token onward. After generating output, the system updates the cache so future requests can reuse the intermediate key-value tensors (not raw tokens or media).

Several developer tactics are presented to maximize cache hits. The optional prompt cache key is emphasized as a way to improve routing locality when many requests share only part of the prefix; it hashes the prefix plus the key so related requests land on the same machine. Demos show that accidentally breaking the cache—such as inserting timestamps or dynamic whitespace early in the prompt—can drive cache hit rates to 0%, while using prompt cache keys can materially raise hit rates (and reduce input token cost) even when latency changes are small.

Context engineering is treated as a balancing act: shaping prompts dynamically can improve task quality, but it can invalidate cached prefixes. Newer compaction options (including server-side compaction and a responses/compact endpoint) help manage long-running contexts by summarizing or trimming history before it grows too large. For real-time speech-to-speech use, a retention ratio parameter is described as a way to avoid truncating every turn and instead make larger cuts less frequently.

The session also adds operational guidance: use the Responses API with reasoning models to preserve hidden chain-of-thought tokens that otherwise reduce cache reuse; consider flex processing for asynchronous workloads; and use allowed tools to change tool access without invalidating the cache. A customer spotlight from Warp frames prompt caching as essential for agentic loops—where system prompts, tool definitions, and earlier turns repeat—then describes Warp’s “scopes” approach (global, user, task) and its use of prompt cache keys to more than double cache hit rates. The closing message is that prompt caching is framed as having no direct intelligence trade-off when prefixes remain identical; the only real trade-off is architectural—how to structure prompts and context so caching remains effective.

Cornell Notes

Prompt caching reuses computation when new requests share the same input prefix, skipping fully processing tokens already handled and only running inference on the new portion. Caching eligibility begins at 1,024 tokens, then proceeds in 128-token blocks, and cache hits require the exact same prefix content in the same order. OpenAI’s system hashes the first 256 prefix tokens to route requests to backend engines, where matching 128-token chunks determine how much work can be reused. Developers can raise cache hit rates using the optional prompt cache key (to improve routing locality), avoid cache-breaking dynamic content early in prompts, and manage long contexts with compaction or retention strategies. For agentic workloads like Warp’s coding agents, consistent system prompts, tools, and stable “scopes” make prompt caching a major lever for cutting cost and improving time-to-first-token.

What exactly must match for a prompt caching hit to occur?

Cache hits require a contiguous prefix that is identical in both content and order. OpenAI describes the prefix as the inputs sent before the model’s new content—such as system prompt text, images, audio, and prior chat messages. Caching begins once the request reaches 1,024 tokens; then it caches in 128-token blocks. If even early prefix content changes (for example, inserting a timestamp or adding dynamic whitespace near the start), the prefix hash no longer matches and cache hit rate can collapse to 0%.

How does OpenAI decide where a request can reuse cached computation?

When a request reaches the backend, OpenAI computes a hash of the first 256 tokens of the prefix and checks it in routing components. That hash (optionally combined with the prompt cache key) helps select the right engine. Each engine has limited throughput (about 15 requests per minute), so traffic may be distributed for engine health, which can reduce cache hit rate. After routing, the engine compares cached 128-token chunks until the first mismatched token; everything after that point is treated as uncached and must be processed.

Why does the prompt cache key matter if caching is “implicit”?

Implicit caching happens automatically, but cache reuse depends on where requests land. OpenAI hashes the first 256 tokens to route; if many requests share only that common portion, load balancing can spread them across machines, lowering reuse. The prompt cache key adds intentional grouping by hashing the prefix plus the key, improving the chance that related requests route to the same engine. A coding customer cited in the session saw cache hit rate jump from 60 to 87% after using prompt cache key.

How should developers think about context engineering without destroying cache reuse?

Context engineering reshapes what the model sees (for example, trimming, summarizing, or changing tool instructions), which can invalidate cached prefixes. The session frames them as inherently at odds: caching requires identical prefixes, while context engineering often introduces variation. The practical approach is to decide when intelligence gains from updated context are worth the cache invalidation, using compaction/truncation strategies to keep prompts within context limits while preserving as much stable prefix content as possible.

What role do compaction and retention ratio play in long-running sessions?

Compaction manages prompt growth by summarizing or trimming earlier context so the model continues with a smaller, curated representation. OpenAI mentions server-side compaction and a standalone responses/compact endpoint that returns an encrypted compaction. For real-time speech-to-speech, the session highlights a retention ratio parameter (e.g., 7 retention ratio means cut so 30% is removed and 70% retained when near the context window). This reduces how often cache-breaking truncation happens compared with naive truncation every turn.

What endpoint and reasoning-model details affect cache hit rates?

For reasoning models, hidden chain-of-thought tokens are only passed when using the Responses API; check completions drops them. That difference can reduce cache reuse across turns because the model’s hidden reasoning tokens aren’t persisted. The session claims switching to the Responses API with reasoning models can increase cache from 40 to 80%.

Review Questions

  1. If a request’s first 256 prefix tokens match but later content differs, how does OpenAI determine which parts are cached versus uncached?
  2. Why can adding dynamic content like timestamps early in a prompt destroy cache hits even if the rest of the prompt is identical?
  3. How does the prompt cache key change routing behavior, and what problem does it solve when many requests share only a partial prefix?

Key Points

  1. 1

    Prompt caching reuses computation when requests share an identical input prefix, skipping reprocessing of already-seen tokens and only running inference for new content.

  2. 2

    Caching eligibility starts at 1,024 tokens, then proceeds in 128-token blocks; cache hits require the exact same prefix content in the exact same order.

  3. 3

    OpenAI hashes the first 256 prefix tokens to route requests to backend engines; matching cached 128-token chunks determines how much work can be reused.

  4. 4

    The optional prompt cache key improves cache hit rates by influencing routing locality—especially when many requests share only the initial prefix but would otherwise be load-balanced across machines.

  5. 5

    Long-context systems need a strategy: preserve stable prefix content, and use compaction/truncation (including retention ratio) to control prompt growth without invalidating caches every turn.

  6. 6

    Endpoint choice matters for reasoning models: the Responses API preserves hidden chain-of-thought tokens that can otherwise reduce cache reuse.

  7. 7

    Operational monitoring should track cached versus uncached input tokens to diagnose cache misses (e.g., prefix mismatches, time expiry, request routing, or context-window handling).

Highlights

Caching begins only after prompts reach 1,024 tokens, and it works in 128-token blocks—so “almost 1,024” can still unlock meaningful savings.
A timestamp or other dynamic content inserted near the start of the prefix can drive cache hit rate to 0%, turning a cost-saving feature into a cost multiplier.
The prompt cache key is less about changing what gets cached and more about ensuring requests land on the same backend engine to preserve reuse.
For reasoning models, using the Responses API can materially increase cache hit rates because hidden chain-of-thought tokens are handled differently than with check completions.
Compaction and retention ratio help manage long-running agent contexts by reducing how frequently cache-breaking truncation happens.

Topics

  • Prompt Caching
  • Cache Hit Rate
  • Prefix Hashing
  • Context Compaction
  • Agentic Loops

Mentioned