Claude Prompt Caching: Did Anthropic Create a Better Alternative to RAG?

TL;DR

Prompt Caching reuses a matched prompt prefix across API calls to reduce both latency and cost for long, repeated context.

Briefing Cornell Notes

Briefing

Anthropic’s new Prompt Caching for Claude is designed to cut both cost and latency by reusing frequently used prompt context across API calls—an approach that could rival parts of RAG when the “retrieval” content is stable and repetitive. In public beta, developers can mark a prompt prefix for caching; when a later request shares that same prefix, the API serves the cached version instead of reprocessing the entire input. Anthropic positions the payoff as up to 90% lower cost and up to 85% lower latency for long prompts, with a cache lifetime that refreshes on use and expires after a short window (described as about five minutes).

The practical setup centers on a “cache control” block placed near the beginning of the prompt. In the example walkthrough, a large text—Harry Potter and the Chamber of Secrets—is loaded from a file, and the system is instructed to cache the book content using the cache control parameter. The caching logic is straightforward: the API checks whether the prompt prefix already exists from a recently processed request; if it does, it reuses the cached prefix, reducing processing time and cost. If it doesn’t, the full prompt is processed once and the prefix is cached for future calls. The cache is not manually clearable in the current beta; it expires and refreshes on use, which matters for testing and for production scheduling.

Pricing details reinforce why caching targets long, repeated context. Cached input is billed at a higher per-token rate than base input tokens, but the overall economics improve when the cached portion is reused enough times. In the transcript’s example numbers for Claude 3.5 Sonnet, base input is $3 per million tokens, while cached input is $3.75 per million tokens. Yet when the cached content is reused, the effective cost of the repeated input drops dramatically—illustrated as roughly 10% of the original input cost for the reused portion—while output pricing stays the same.

The walkthrough then tests two concrete use cases. First, it summarizes the Harry Potter book and measures timing: the initial run takes about 21 seconds and caches roughly 138,000 input tokens. A second run of the same request drops to around 8 seconds, with no additional cached input tokens—showing the latency benefit from reusing the cached prefix. A “needle-in-a-haystack” style query (asking about a specific sentence buried in the book) is then run after the cache window has likely expired; surprisingly, the cached token count appears again as ~138,000, with much faster runtimes (around 2–3 seconds), though the transcript notes some confusion about the exact behavior.

Second, the transcript demonstrates caching a smaller but still structured prompt: a dataset of 40 example YouTube comments used to mimic a specific response style. The cached portion is about 3,600 tokens, and the measured runtime is roughly 3.2 seconds both times, suggesting limited speedup for smaller prompts but potential cost savings at larger scales (e.g., hundreds of examples). Overall, Prompt Caching is framed as a promising alternative or complement to RAG when the “context” is long and stable, especially as Anthropic’s slower models (like Claude 3.5 Opus, mentioned as upcoming) could benefit most from reduced latency through reuse.

Cornell Notes

Anthropic’s Prompt Caching lets developers reuse a prompt prefix across API calls, so Claude doesn’t have to reprocess the same long context every time. Developers place a cache control block near the start of the prompt; later requests that share that prefix can hit the cache and run faster and cheaper. In the transcript’s tests with Claude 3.5 Sonnet, caching a large book (~138,000 input tokens) reduced runtime from about 21 seconds to about 8 seconds on a repeated request. A “needle-in-a-haystack” query also worked quickly, and caching a 40-example YouTube comment dataset showed similar runtimes but would matter more for larger example sets. The cache is time-limited (about five minutes) and currently can’t be manually cleared, so reuse timing is key.

How does Prompt Caching decide whether it can reuse prior work?

When a request includes caching enabled, the API checks whether the prompt prefix already exists from a recently processed query. If the prefix matches, the cached version is used, reducing processing time and cost. If there’s no match, the system processes the full prompt once and caches the prefix for future calls. In the transcript, this is described as caching the prefix and then reusing it on subsequent requests that share the same cached content.

What does the cache control block do in practice?

A cache control block is inserted into the prompt (near the beginning) to mark which portion should be cached. In the walkthrough, the book text is wrapped/marked so that the “prefix” containing the book can be cached. The result is that later calls can avoid re-sending and reprocessing that long prefix, as long as the prefix content and structure match.

Why can caching reduce cost even if cached tokens cost more per token?

Cached input is billed at a higher per-token rate than base input (example given: $3 vs $3.75 per million tokens for Claude 3.5 Sonnet). But if the cached portion is reused, the effective cost of the repeated input drops sharply because the system reuses the cached computation rather than reprocessing everything. The transcript illustrates this with an example where reused cached input is described as costing about 10% of the original input cost, while output pricing remains unchanged.

What timing constraints affect cache hits?

The cache has a short lifetime—described as roughly five minutes—and it refreshes when the cached content is used. There’s no manual way to clear the cache in the beta, so tests and production workflows must rely on waiting for expiry or structuring requests to reuse cached prefixes within that window.

When does caching help most: small prompts or long, repetitive context?

Caching is most valuable for long prompts with large, repeated context—like book passages, system instructions, background knowledge, tool definitions, or multi-turn conversational scaffolding. In the transcript, caching a large book produced a noticeable latency drop (about 21 seconds to about 8 seconds). Caching a smaller 40-example style dataset produced little runtime change, but the cost savings would grow as the number of examples (and reuse frequency) increases.

Review Questions

What prompt elements should be placed inside a cache control block to maximize cache hits and performance gains?
How do cache lifetime and the inability to manually clear the cache influence how you would schedule repeated API calls?
Using the pricing logic described, under what reuse frequency would caching become cost-effective for long prompts?

Key Points

1
Prompt Caching reuses a matched prompt prefix across API calls to reduce both latency and cost for long, repeated context.
2
Cache control blocks are placed near the start of the prompt to mark which prefix content should be cached.
3
Cache hits occur when a later request shares the same cached prefix; otherwise the full prompt is processed and cached for future use.
4
Cached input tokens cost more per token than base input, but repeated reuse can make the effective cost of the repeated portion much lower.
5
The cache lifetime is short (about five minutes) and refreshes on use; manual cache clearing isn’t available in the beta.
6
Caching is most beneficial for long prompts with stable content such as books, system instructions, background knowledge, and frequent tool definitions.
7
Smaller prompts may show limited latency improvement, but caching can still matter for cost when large example sets are reused many times.

Highlights

Anthropic’s Prompt Caching targets stable, repeated prompt prefixes—so Claude can skip reprocessing the same long context on later calls.

In the Harry Potter test, caching ~138,000 input tokens cut runtime from ~21 seconds to ~8 seconds on a repeated request.

Cached input is priced higher per token than base input, yet reuse can drive the effective cost of the repeated portion down substantially (illustrated as ~10%).

The cache expires on a short schedule (about five minutes) and can’t be manually cleared, making timing and request structure critical.

Topics

Prompt Caching
Claude API
Latency Reduction
Cost Optimization
RAG Alternatives