Prompt caching is the single biggest cost lever in the Claude API. You mark a stable portion of your prompt as cacheable; the next call that reuses that exact prefix reads it from cache at roughly 10% of the normal input-token price. For long system prompts, large RAG context, or chat histories that pile up across a session, that translates to 70-90% off the input bill.
The mechanics matter though. Cache breakpoints are positional. The 5-minute TTL is short. Invalidation gotchas will burn you the first time. This guide covers how it actually works in May 2026, the price math, the code patterns, and when to reach for the 1-hour extended cache instead.
For where this fits in the broader cost picture, see our LLM gateway pillar. For OpenAI's equivalent, see How to reduce OpenAI API costs.
TL;DR
- Mark up to 4 stable prefixes in a prompt with
cache_control: { type: "ephemeral" }. Subsequent calls within the TTL hit cache. - 5-minute cache (default): write costs 1.25x base input price, read costs 10% of base.
- 1-hour cache (extended TTL): higher write multiplier, same low read price, useful for stable system prompts that get hit across a longer session window.
- Cache reads do not count toward your ITPM rate limit on current models. Caching is also a rate-limit multiplier.
- Order matters: cacheable content goes before dynamic content. The cache is matched as a prefix, not a substring.
- Watch the response's
cache_creation_input_tokensandcache_read_input_tokensfields to confirm hits.
What prompt caching actually does
A Claude request looks roughly like: system prompt, conversation history, current user message. Most of that is the same on the next call. The system prompt does not change. The conversation history is the same plus one new turn. Anthropic's API lets you mark up to 4 points in that prompt with cache_control. Each marked block becomes a cache breakpoint.
On the first call that creates the cache, you pay a slight premium (cache write). On any subsequent call within the TTL whose prompt has the same prefix up to that breakpoint, you pay roughly 10% of input price for the cached tokens (cache read). The model sees the same content; you pay much less.
A cache entry is keyed on the exact token sequence leading up to the breakpoint. Any change to that prefix, including whitespace, invalidates the cache.
The price math
For Sonnet 4.6, base input is $3 per million tokens (MTok). The 5-minute cache writes at $3.75/MTok and reads at $0.30/MTok. So a 20K-token system prompt that is hit 100 times within 5 minutes pays:
- Without cache: 100 calls x 20K x $3/MTok = $6.00
- With 5-min cache: 1 write (20K x $3.75/MTok = $0.075) + 99 reads (99 x 20K x $0.30/MTok = $0.594) = $0.669
That is roughly an 89% reduction on input tokens for the system prompt portion. Across all current model tiers, the same ratios hold:
| Model | Base Input | 5-min Write | 5-min Read |
|---|---|---|---|
| Opus 4.7 | $5/MTok | $6.25/MTok | $0.50/MTok |
| Sonnet 4.6 | $3/MTok | $3.75/MTok | $0.30/MTok |
| Haiku 4.5 | $1/MTok | $1.25/MTok | $0.10/MTok |
The write premium pays back after a small number of reads. For anything that gets called more than 2-3 times in a 5-minute window, caching is a clear win.
Where to place cache breakpoints
You get up to 4 cache_control markers. Think of them as nested prefixes, longest stable content first. The standard arrangement:
- System prompt (the longest stable text)
- Tool definitions (stable across the session)
- Conversation history up to the last user turn
- (Reserve) for a fourth layer if needed
Anything after the last breakpoint is treated as fresh input on every call.
Python example: system prompt caching
import os
import anthropic
client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
SYSTEM_PROMPT = open("system_prompt.md").read() # imagine this is 15K tokens
def ask(user_message: str):
resp = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"},
}
],
messages=[{"role": "user", "content": user_message}],
)
print("cache_creation_input_tokens:", resp.usage.cache_creation_input_tokens)
print("cache_read_input_tokens:", resp.usage.cache_read_input_tokens)
print("input_tokens:", resp.usage.input_tokens)
return resp
ask("First question") # writes cache
ask("Second question") # reads cacheOn the first call you should see cache_creation_input_tokens populated with the system prompt token count. On the second, cache_read_input_tokens carries that count and cache_creation_input_tokens is zero or near it.
TypeScript example: RAG context caching
The same pattern in TypeScript, applied to a long RAG context block. The context is large (say, 50K tokens of retrieved chunks) and reused for follow-up questions about the same document.
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY });
async function askWithContext(ragContext: string, userMessage: string) {
const resp = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 1024,
system: [
{
type: "text",
text: "You are a research assistant. Answer based on the provided context.",
cache_control: { type: "ephemeral" },
},
],
messages: [
{
role: "user",
content: [
{
type: "text",
text: `<context>\n${ragContext}\n</context>`,
cache_control: { type: "ephemeral" },
},
{
type: "text",
text: userMessage,
},
],
},
],
});
console.log("cache_read:", resp.usage.cache_read_input_tokens);
console.log("cache_create:", resp.usage.cache_creation_input_tokens);
return resp;
}Two breakpoints here: one on the system role, one on the RAG context. The user's actual question is outside both breakpoints, so it varies freely without disturbing the cache. Five follow-up questions on the same document pay for one cache write plus five cache reads on the 50K context block.
5-minute vs 1-hour cache
The default ephemeral cache TTL is roughly 5 minutes from last access. Each cache hit refreshes the timer, so an actively used cache can live longer than 5 minutes wall-clock, but a 6-minute gap will evict it.
The extended 1-hour cache is the right choice when:
- A user session is bursty (questions arrive every 8-15 minutes).
- You serve the same system prompt across many users and want shared cache stability.
- You run nightly evals that reuse the same context across many test cases over a longer window.
The write premium is higher for the 1-hour cache. Read price stays in the same ballpark (~10% of base input). The break-even point is roughly: if the same prefix gets re-used more than 5-7 times over an hour, the 1-hour TTL pays off.
Specify it with the TTL field:
system=[
{
"type": "text",
"text": SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral", "ttl": "1h"},
}
]Cache invalidation gotchas
This is where most teams get burned the first time:
- Whitespace changes break the cache. A trailing newline difference between two clients invalidates the cache.
- Tool definitions count as part of the prefix. Reordering tools, renaming a parameter, even tweaking a description string busts the cache for everything after it.
- Image content can be cached, but only when bytes are identical. Re-encoded JPEGs from different clients differ at the byte level.
- Beta header changes. If you toggle a beta flag (extended thinking, computer use, etc.) on or off, you may invalidate the cache.
- The
systemfield type matters. Passing system as a string vs an array of typed blocks is a different shape and produces different cache keys. - Cache is per organization, not per API key. Multiple keys in the same org share cache. This is usually what you want.
A practical guardrail: log cache_read_input_tokens and cache_creation_input_tokens as separate metrics. If creation suddenly spikes while you expected reads, something just changed in your prefix.
Caching across an agent loop
For agents that loop (tool call, observe, tool call, observe), each iteration appends one more turn. The cache pattern that wins: put a cache_control on the second-to-last turn, so each new iteration writes a small delta and reads the much larger prefix.
def chat_step(history, user_input):
# cache the prior history (stable), append fresh turn outside cache
messages = []
for i, turn in enumerate(history):
block = {"role": turn["role"], "content": turn["content"]}
# mark the last turn of history as the cache breakpoint
if i == len(history) - 1:
block["content"] = [
{
"type": "text",
"text": turn["content"],
"cache_control": {"type": "ephemeral"},
}
]
messages.append(block)
messages.append({"role": "user", "content": user_input})
return client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
messages=messages,
)Each iteration recreates the cache one step further along the conversation. A 20-step agent loop on a 1M-context Claude run can pay 90% off on the bulky prefix.
Caching plus the 1M context window
Opus and Sonnet 4.6+ support a flat-rate 1M context window. Without caching, a 500K-token prompt every turn would shred your budget. With caching, the 500K is paid once at the write rate then read cheaply for the rest of the session. Caching is what makes the 1M context window financially viable for anything beyond a one-shot question.
A gateway makes caching observable
Anthropic's API returns cache metrics per call. A gateway that captures every call rolls those up into per-feature dashboards: cache hit rate, dollars saved by caching, drift events (sudden creation spikes signaling a prompt change). See LLM Observability for the broader telemetry picture.
FAQ
How much does Claude prompt caching cost? Reads cost roughly 10% of base input price. Writes cost 1.25x base input for the 5-minute cache. Net savings are 70-90% on cached portions when you have repeat traffic.
How long does the cache last? The default ephemeral cache is roughly 5 minutes from last access. The 1-hour cache lasts roughly 60 minutes from last access. Both refresh on hit.
How many cache breakpoints can I have?
Up to 4 cache_control markers per request.
Why is my cache_read always zero? Most common causes: whitespace difference between calls, tool definitions reordered, system content type mismatch (string vs typed array), or the 5-minute TTL elapsed.
Do cached tokens count toward rate limits?
Not for current models. cache_read_input_tokens are excluded from ITPM. This is a free rate-limit multiplier on top of the cost savings. See Anthropic API rate limits.
Can I cache tool use definitions? Yes. Tool definitions are part of the prompt prefix; place a breakpoint after them to cache.
Should I use 5-min or 1-hour caching? Start with 5-min. Switch to 1-hour for system prompts shared across many users or sessions that span longer windows. The 1-hour write premium is only worth it if you will get more than 5-7 reads.