Prompt caching is the single highest-ROI optimization available for production LLM applications. If you run 10,000 requests per day with an 8K-token cached system prompt on Anthropic Claude, you save roughly $576/month — with a few lines of code change. OpenAI’s automatic caching requires zero code changes and gives you a 50% discount on repeated input tokens. Anthropic’s explicit caching offers up to 90% savings. This guide covers both, plus Gemini, with production code examples, real cost numbers, and the anti-patterns that silently destroy your cache hit rate.
How Prompt Caching Works: KV Cache, Prefix Matching, and Why Order Matters
Prompt caching works by storing the key-value (KV) computation for a prefix of your prompt in GPU memory, then reusing those stored activations for subsequent requests that share the same prefix. When your request arrives, the provider checks whether the incoming prompt’s beginning matches a cached prefix. If it does — a cache hit — the model skips recomputing that prefix and starts generating immediately. Hugging Face technical analysis measured roughly a 5.21x speedup on T4 GPUs from KV cache reuse alone. The cost reduction follows the same logic: you pay a lower rate for cached input tokens because the provider doesn’t need to run full inference on that portion of the prompt.
Why order matters critically: Prefix matching is exact and sequential. If your prompt reads system → context → user query, the cache key covers everything from the start up to your designated breakpoint. Change anything before the breakpoint — even a single character — and the entire cached prefix is invalidated. This means timestamps, session IDs, or user-specific data embedded early in your prompt will kill your cache hit rate entirely. The universal rule: place static content first, dynamic content last. Tools definitions → system instructions → document context → few-shot examples → current conversation history → user query. This ordering directly determines your API bill.
Minimum token requirements vary by provider: Anthropic requires at least 1,024 tokens in the cached prefix; OpenAI caches in 128-token increments with a 1,024-token minimum. Short prompts below these thresholds simply don’t qualify for caching and should be excluded from your optimization planning.
Provider Comparison: OpenAI vs Anthropic vs Gemini
Prompt caching is now supported by all three major LLM providers — OpenAI, Anthropic, and Google Gemini — but they implement it in fundamentally different ways with meaningfully different economics. OpenAI’s caching is fully automatic: you write no special code, the API detects repeated prefixes, and you see a 50% discount on cached tokens with no TTL configuration available. Anthropic gives you the highest savings rate at 90% but requires explicit cache_control markers (simplified significantly by the February 2026 automatic caching update). Gemini sits between the two, offering implicit automatic caching for Gemini 2.5 models and named cache objects for explicit control with configurable TTL. Choosing between providers comes down to your optimization priorities: zero-friction savings (OpenAI), maximum cost reduction with fine-grained control (Anthropic), or configurable persistence for document-heavy workloads (Gemini). Most teams using Anthropic as their primary provider see the February 2026 changes as a reason to migrate previously-uncached workflows — the implementation barrier dropped significantly.
| Feature | OpenAI | Anthropic | Gemini |
|---|---|---|---|
| Caching type | Automatic | Automatic + Explicit | Implicit + Explicit |
| Cost savings | 50% on input | 90% on input | ~90% on input |
| TTL | 5–10 min | 5 min or 1 hour | Configurable |
| Minimum tokens | 1,024 (128-token increments) | 1,024 | Varies |
| Code changes required | None | Minimal (cache_control) | Named cache objects |
| Control granularity | None (auto) | Up to 4 breakpoints | Named cache objects |
| 2026 update | GPT-5.1: 24h retention | Feb 2026: auto caching | Gemini 2.5 implicit caching |
OpenAI Prompt Caching: Automatic, Zero-Config
OpenAI prompt caching is automatic and requires zero code changes — the API detects repeated input prefixes and applies a 50% discount on cached input tokens automatically. You don’t set any flags; you just observe the discount in your usage dashboard and billing. The GPT-5.1 series introduced 24-hour cache retention, making it viable for system prompts used across long workdays or batch pipelines that span multiple processing windows. Cache hits appear in the usage object of the API response as cached_tokens, so you can monitor performance without any instrumentation changes.
OpenAI caches in 128-token increments, meaning your cached prefix must be at least 1,024 tokens and matches extend in 128-token steps. A 1,100-token prefix gets cached at 1,024 tokens, with the remaining 76 tokens billed at full price. This granularity matters for borderline cases but rarely affects the economics of real system prompts, which typically run 2,000–10,000 tokens.
from openai import OpenAI
client = OpenAI()
# No special configuration needed — caching is automatic
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
# Static system prompt (1024+ tokens for caching eligibility)
"content": "You are an expert software engineer specializing in Python..."
# ... (long static content)
},
{
"role": "user",
"content": user_query # Dynamic — place last
}
]
)
# Check cache hit in response
cached_tokens = response.usage.prompt_tokens_details.cached_tokens
print(f"Cached tokens: {cached_tokens}")
The tradeoff vs Anthropic: OpenAI’s automatic approach is the right choice for teams that want savings with zero engineering overhead. You get 50% off repeated input tokens with no prompt restructuring. The downside is loss of control — you can’t force specific breakpoints, can’t choose TTL, and can’t target multiple cache boundaries within a single prompt. For high-volume applications where every dollar matters, Anthropic’s 90% savings on cache reads typically justifies the additional implementation work.
Anthropic Prompt Caching: 90% Savings with Explicit Breakpoints
Anthropic prompt caching delivers up to 90% cost reduction on cached input tokens, the highest discount available from any major provider in 2026. Cache reads for Claude Sonnet 4.5 cost $0.30/1M tokens versus $3.00/1M for standard input — exactly a 10x reduction. The February 2026 automatic caching update simplified implementation significantly: a single top-level cache_control marker now causes the API to auto-place the breakpoint on the last cacheable block, eliminating the need to annotate every section individually. For most use cases, this single-marker approach is sufficient.
For fine-grained control, Anthropic supports up to 4 explicit cache breakpoints per prompt. Automatic caching consumes 1 of those 4 slots — adding automatic caching plus 4 explicit breakpoints triggers a 400 error. The cache invalidation hierarchy is tools → system → messages: changing anything earlier in this chain invalidates caches for everything that follows. Place your least-changing content at the top (tool definitions), most-changing content at the bottom (current user message).
5-minute vs 1-hour TTL: Choose based on request cadence, not preference. If requests arrive more than every 5 minutes on average, 1-hour TTL pays for itself immediately — you pay 2x base input price on writes instead of 1.25x, but cache reads stay at 0.1x for both. The 1-hour write premium recovers after just 2 cache hits. If your traffic is bursty with long idle gaps, 5-minute TTL may be more economical. One team learned this the hard way: a library update silently changed their TTL from 1-hour to 5-minutes, causing a $13.86/day bill increase before anyone noticed.
import anthropic
client = anthropic.Anthropic()
# February 2026 approach: single cache_control at top level (auto places breakpoint)
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are an expert software engineer...",
# This triggers automatic cache placement on the last cacheable block
"cache_control": {"type": "ephemeral"}
}
],
messages=[
{"role": "user", "content": user_query}
]
)
# Monitor cache usage
usage = response.usage
print(f"Input tokens: {usage.input_tokens}")
print(f"Cache creation tokens: {usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {usage.cache_read_input_tokens}")
Multi-turn conversation caching: In multi-turn chat, Anthropic’s automatic caching advances the cache breakpoint forward as the conversation grows — without requiring you to update cache_control markers manually. The 20-block lookback window limits how far back the provider searches for matching prefixes. Keep your conversation history compaction logic in sync with this window to avoid unnecessary cache misses in very long conversations.
Explicit Multi-Breakpoint Example
# For fine-grained control: multiple explicit breakpoints
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are an expert software engineer...",
"cache_control": {"type": "ephemeral"} # Breakpoint 1: system prompt
}
],
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": large_document_context, # Your reference docs
"cache_control": {"type": "ephemeral"} # Breakpoint 2: context
},
{
"type": "text",
"text": user_query # Dynamic — no cache_control
}
]
}
]
)
Gemini Prompt Caching: Implicit Caching and Named Cache Objects
Gemini prompt caching operates through two mechanisms: implicit caching (where the API automatically detects and reuses repeated content) and explicit named cache objects for precise control. Gemini 2.5 expanded implicit caching capabilities, making it the most hands-off option for teams already using Google’s infrastructure. Named cache objects persist across requests with configurable TTL, behaving more like a traditional database cache than the prefix-matching approach used by OpenAI and Anthropic. Savings are approximately 90% on cached content, comparable to Anthropic’s rates.
The named cache approach works well for RAG pipelines that repeatedly query the same knowledge base — you cache the document corpus once, assign it a cache ID, and reference that ID in subsequent requests rather than retransmitting the full content. This makes Gemini caching particularly well-suited for document Q&A applications where the reference material doesn’t change between queries.
import google.generativeai as genai
genai.configure(api_key="YOUR_API_KEY")
# Create a named cache for long-lived content
cache = genai.caching.CachedContent.create(
model="gemini-2.5-flash",
contents=[
{
"role": "user",
"parts": [{"text": large_document_context}]
}
],
ttl="3600s" # 1-hour TTL
)
# Reference the cache in subsequent requests
model = genai.GenerativeModel.from_cached_content(cache)
response = model.generate_content(user_query)
# Clean up when done
cache.delete()
Production Cost Calculator: Real Dollar Amounts
Prompt caching economics depend on three variables: prompt length (in tokens), daily request volume, and cache hit rate. The formula is simple — compare (cache write cost + cache read cost × hit rate) against (full input cost × all requests). In practice, applications with 2,000-token system prompts running 100 requests/day save around $12/month on Anthropic; growth-stage applications with 8,000-token prefixes at 10,000 requests/day save over $6,000/month. At enterprise scale — 100,000 requests/day with a 10,000-token cached prefix — annual savings exceed $1M on Anthropic. OpenAI’s 50% discount produces roughly half these savings for the same workload. The numbers below use Anthropic Claude Sonnet 4.5 pricing ($3.00/1M standard input, $0.30/1M cache read, $3.75/1M cache write) with a representative 85-90% cache hit rate, which healthy production systems consistently achieve.
These calculations use Anthropic Claude Sonnet 4.5 pricing ($3.00/1M input, $0.30/1M cache read, $3.75/1M cache write) unless noted.
Hobby: 2K System Prompt, 100 Requests/Day
Without caching: 2,000 tokens × 100 requests = 200K tokens/day × $3.00/1M = $0.60/day ($18/month)
With caching: 1 cache write ($0.0075) + 99 cache reads (2,000 × 99 × $0.30/1M = $0.059) + user query tokens ≈ $0.066/day
Monthly savings: ~$12.60/month (70% reduction)
Growth: 8K Cached Prefix, 10K Requests/Day
Without caching: 8,000 × 10,000 = 80M tokens/day × $3.00/1M = $240/day ($7,200/month)
With caching (90% hit rate): ~$19.20/day in cache reads vs $240/day baseline
Monthly savings: ~$6,240/month
Enterprise: 10K Cached Prefix, 100K Requests/Day
Without caching: 10,000 × 100,000 = 1B tokens/day × $3.00/1M = $3,000/day
With caching (85% hit rate): ~$76.50/day in cache reads
Monthly savings: ~$87,600/month ($1.05M/year)
These numbers explain why prompt caching is treated as a P0 optimization by any team running LLMs at scale.
Anti-Patterns That Kill Your Cache Hit Rate
Cache anti-patterns are the silent killers of LLM API budgets. A well-designed prompt structure can achieve 80-90% cache hit rates in production; the same application with anti-patterns typically sees 10-30% — meaning you’re paying near-full price and getting none of the latency benefits. Below are the most common patterns to avoid, each with a concrete fix.
Anti-pattern 1: Timestamps or session IDs in the system prompt
# WRONG — kills cache every request
system = f"You are an AI assistant. Current time: {datetime.now()}. Session: {session_id}"
# RIGHT — put dynamic data elsewhere
system = "You are an AI assistant."
# Inject time/session into the user message if needed
Anti-pattern 2: User content before static content
# WRONG — user name appears before the cacheable instructions
messages = [
{"role": "user", "content": f"Hi, I'm {user_name}. {user_query}"}
]
# RIGHT — static instructions in system, user identity in messages
system = "You are an expert assistant with access to the following knowledge base: [static docs]"
messages = [{"role": "user", "content": user_query}]
Anti-pattern 3: Rotating few-shot examples
# WRONG — shuffled examples invalidate cache every time
import random
examples = random.sample(all_examples, 5)
# RIGHT — fixed ordered examples in system prompt, random examples in user messages
fixed_examples = all_examples[:5] # Static, always the same
Anti-pattern 4: Dynamic tool definitions
# WRONG — enabling different tools per user breaks prefix matching
tools = get_user_tools(user_id) # Different per user
# RIGHT — use a fixed superset of tools, filter in application logic
tools = ALL_TOOLS # Identical for every request
Anti-pattern 5: Prompts below minimum threshold
Short prompts (< 1,024 tokens) don’t qualify for caching on any major provider. If your system prompt is 800 tokens, add structured documentation, examples, or reasoning guidelines to push above the threshold — the cost of additional tokens is trivial compared to the caching savings you unlock.
Monitoring Cache Hit Rates in Production
Production systems should target 70-90% cache hit rates. Rates below 50% indicate a structural problem with your prompt ordering — revisit the anti-patterns section. Each provider exposes cache metrics differently, but all include the data in API responses.
Anthropic monitoring:
def track_cache_metrics(response, metrics_client):
usage = response.usage
total_input = (usage.input_tokens +
usage.cache_creation_input_tokens +
usage.cache_read_input_tokens)
hit_rate = usage.cache_read_input_tokens / total_input if total_input > 0 else 0
metrics_client.gauge("llm.cache_hit_rate", hit_rate)
metrics_client.increment("llm.cache_reads", usage.cache_read_input_tokens)
metrics_client.increment("llm.cache_writes", usage.cache_creation_input_tokens)
if hit_rate < 0.5:
alert("Cache hit rate below 50% — check prompt structure")
return hit_rate
OpenAI monitoring:
def track_openai_cache(response):
details = response.usage.prompt_tokens_details
cached = details.cached_tokens if details else 0
total = response.usage.prompt_tokens
hit_rate = cached / total if total > 0 else 0
return hit_rate
Key metrics to track in production:
- Cache hit rate (target: 70%+; alert threshold: 50%)
- Cache creation cost (should be small relative to cache read savings)
- Time-to-first-token (cache hits typically reduce TTFT by 40-80%)
- Daily cache savings (compare cached read cost vs estimated uncached cost)
Set up weekly cost attribution reports that separate cached vs uncached spending. This makes optimization work visible to stakeholders and helps justify the engineering investment in prompt structure.
Prompt Caching with RAG, Multi-Turn Chat, and Agentic Systems
Prompt caching interacts differently with each major LLM application pattern, and the configuration choices that maximize savings in a simple chatbot may perform poorly in an agentic system. RAG pipelines benefit most from caching the retrieval instructions and knowledge base preamble while letting retrieved documents flow through a second breakpoint. Multi-turn chat applications benefit from Anthropic’s automatic cache advancement, which moves the cache boundary forward as conversation history grows — no manual re-marking needed. Agentic systems using tool-calling loops (AutoGen, LangGraph, CrewAI) require careful static/dynamic separation: cache the tool definitions and agent persona, let tool call results remain uncached. In all three patterns, the 31% semantic similarity rate observed across production LLM queries (Burnwise 2026 analysis) means that even applications with moderate request volumes see real cache hits — not just the high-frequency request patterns typically highlighted in provider documentation. Gemini’s named cache objects are uniquely well-suited for document corpora shared across many different query types, making it the preferred choice for multi-tenant RAG deployments where the same document set serves many users.
RAG Pipelines
RAG applications are the ideal use case for prompt caching. The retrieved documents change per query, but your system prompt, retrieval instructions, and output format guidelines are static. Structure your RAG prompt as:
- System instructions (static, cached)
- Retrieved documents (semi-static per document set, explicitly cached with breakpoint)
- User question (dynamic, not cached)
For Gemini, use named cache objects for your document corpus — create the cache once per document set and reference it by ID across all queries against that corpus.
Multi-Turn Conversations
Anthropic’s automatic cache advancement handles multi-turn chat without manual cache_control updates per message. The breakpoint moves forward automatically as conversation history grows. Watch for the 20-block lookback window — conversations longer than ~20 exchanges may see the oldest context fall outside the cacheable window. Implement a summarization or context compaction step before hitting this limit.
# Multi-turn with Anthropic — cache_control only on the system,
# automatic caching handles the rest
messages = conversation_history # Growing list of messages
response = client.messages.create(
model="claude-sonnet-4-5",
system=[{"type": "text", "text": system_prompt,
"cache_control": {"type": "ephemeral"}}],
messages=messages,
max_tokens=1024
)
# Append response to conversation_history for next turn
Agentic Systems
Agentic systems (AutoGen, LangGraph, CrewAI) make many tool calls in a loop, often with overlapping system prompts and tool definitions. Cache your tool registry and agent persona at the top of the prompt, and let the dynamic tool call results flow through the uncached portion. The consistency requirement is strict — if your tool definitions change between agent steps (e.g., tools are conditionally available), you’ll get cache misses. Prefer a static superset of tools and handle conditional availability in application logic.
When Prompt Caching Genuinely Doesn’t Help
Prompt caching is not a universal win. Avoid optimizing for it in these scenarios:
- Short prompts (< 1,024 tokens): You don’t meet the minimum threshold. Engineering time is better spent elsewhere.
- Highly unique contexts: If every request has a completely different long context (e.g., analyzing a unique document per user), you write a cache but never read it — you pay the write premium for nothing.
- Low request volume: At under 50 requests/day, cache writes may cost more than reads save. Run the math with your actual prompt length and request rate.
- Frequently changing system prompts: If your system prompt changes every hour or day (A/B testing, personalization), TTL selection becomes tricky and hit rates drop.
- One-off batch jobs: A batch that runs once and never repeats gets no cache reads. Use Anthropic’s Batch API for cost savings instead.
The honest assessment: if your system prompt is under 2K tokens and you run under 1,000 requests/day, the savings are real but modest (under $50/month on most providers). At that scale, model selection and prompt length optimization likely offer better ROI than caching architecture.
FAQ
Q: Does prompt caching work with streaming responses?
Yes. All three providers support prompt caching with streaming. The cache hit check happens before token generation begins, so your streaming latency still benefits from reduced time-to-first-token on cache hits. The usage statistics (including cache read tokens) appear in the final delta of the stream.
Q: What happens if I exceed Anthropic’s 4-breakpoint limit?
The API returns a 400 error. If you’re using automatic caching (which consumes one slot), you can add up to 3 explicit breakpoints. If you need more granularity, restructure your prompt to consolidate static sections rather than adding more breakpoints.
Q: Is prompt caching the same as semantic caching?
No. Prompt caching is exact prefix matching at the token level — it requires identical byte sequences to hit. Semantic caching (tools like GPTCache, Redis + embeddings) matches semantically similar queries and returns cached responses. They’re complementary: use prompt caching to reduce per-request compute costs, and semantic caching to avoid calling the LLM at all for near-duplicate queries.
Q: Will using prompt caching affect response quality?
No. Cache hits use the exact same KV states that would have been computed fresh, so responses are statistically identical. The only observable difference is lower latency and cost. There’s no quality-cost tradeoff involved.
Q: How do I choose between Anthropic and OpenAI for cost optimization?
Run the math with your actual numbers. OpenAI gives 50% savings with zero engineering work. Anthropic gives 90% savings with minimal implementation effort. At 10,000 requests/day with a 5K-token system prompt, Anthropic saves roughly twice as much per month despite higher base prices, assuming 80%+ cache hit rate. Below about 5,000 requests/day, the difference narrows significantly, and OpenAI’s simplicity may win on total cost including engineering time.
