<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>API Cost Optimization on RockB</title><link>https://baeseokjae.github.io/tags/api-cost-optimization/</link><description>Recent content in API Cost Optimization on RockB</description><image><title>RockB</title><url>https://baeseokjae.github.io/images/og-default.png</url><link>https://baeseokjae.github.io/images/og-default.png</link></image><generator>Hugo</generator><language>en-us</language><lastBuildDate>Tue, 21 Apr 2026 01:02:58 +0000</lastBuildDate><atom:link href="https://baeseokjae.github.io/tags/api-cost-optimization/index.xml" rel="self" type="application/rss+xml"/><item><title>LLM Prompt Caching Guide 2026: Cut API Costs 70% with Anthropic and OpenAI</title><link>https://baeseokjae.github.io/posts/llm-prompt-caching-guide-2026/</link><pubDate>Tue, 21 Apr 2026 01:02:58 +0000</pubDate><guid>https://baeseokjae.github.io/posts/llm-prompt-caching-guide-2026/</guid><description>LLM prompt caching guide 2026: Anthropic, OpenAI, Gemini code examples, cost calculators, anti-patterns, and production monitoring tips.</description><content:encoded><![CDATA[<p>Prompt caching is the single highest-ROI optimization available for production LLM applications. If you run 10,000 requests per day with an 8K-token cached system prompt on Anthropic Claude, you save roughly $576/month — with a few lines of code change. OpenAI&rsquo;s automatic caching requires zero code changes and gives you a 50% discount on repeated input tokens. Anthropic&rsquo;s explicit caching offers up to 90% savings. This guide covers both, plus Gemini, with production code examples, real cost numbers, and the anti-patterns that silently destroy your cache hit rate.</p>
<h2 id="how-prompt-caching-works-kv-cache-prefix-matching-and-why-order-matters">How Prompt Caching Works: KV Cache, Prefix Matching, and Why Order Matters</h2>
<p>Prompt caching works by storing the key-value (KV) computation for a prefix of your prompt in GPU memory, then reusing those stored activations for subsequent requests that share the same prefix. When your request arrives, the provider checks whether the incoming prompt&rsquo;s beginning matches a cached prefix. If it does — a cache hit — the model skips recomputing that prefix and starts generating immediately. Hugging Face technical analysis measured roughly a 5.21x speedup on T4 GPUs from KV cache reuse alone. The cost reduction follows the same logic: you pay a lower rate for cached input tokens because the provider doesn&rsquo;t need to run full inference on that portion of the prompt.</p>
<p><strong>Why order matters critically:</strong> Prefix matching is exact and sequential. If your prompt reads <code>system → context → user query</code>, the cache key covers everything from the start up to your designated breakpoint. Change anything before the breakpoint — even a single character — and the entire cached prefix is invalidated. This means timestamps, session IDs, or user-specific data embedded early in your prompt will kill your cache hit rate entirely. The universal rule: place static content first, dynamic content last. Tools definitions → system instructions → document context → few-shot examples → current conversation history → user query. This ordering directly determines your API bill.</p>
<p>Minimum token requirements vary by provider: Anthropic requires at least 1,024 tokens in the cached prefix; OpenAI caches in 128-token increments with a 1,024-token minimum. Short prompts below these thresholds simply don&rsquo;t qualify for caching and should be excluded from your optimization planning.</p>
<h2 id="provider-comparison-openai-vs-anthropic-vs-gemini">Provider Comparison: OpenAI vs Anthropic vs Gemini</h2>
<p>Prompt caching is now supported by all three major LLM providers — OpenAI, Anthropic, and Google Gemini — but they implement it in fundamentally different ways with meaningfully different economics. OpenAI&rsquo;s caching is fully automatic: you write no special code, the API detects repeated prefixes, and you see a 50% discount on cached tokens with no TTL configuration available. Anthropic gives you the highest savings rate at 90% but requires explicit <code>cache_control</code> markers (simplified significantly by the February 2026 automatic caching update). Gemini sits between the two, offering implicit automatic caching for Gemini 2.5 models and named cache objects for explicit control with configurable TTL. Choosing between providers comes down to your optimization priorities: zero-friction savings (OpenAI), maximum cost reduction with fine-grained control (Anthropic), or configurable persistence for document-heavy workloads (Gemini). Most teams using Anthropic as their primary provider see the February 2026 changes as a reason to migrate previously-uncached workflows — the implementation barrier dropped significantly.</p>
<table>
  <thead>
      <tr>
          <th>Feature</th>
          <th>OpenAI</th>
          <th>Anthropic</th>
          <th>Gemini</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Caching type</td>
          <td>Automatic</td>
          <td>Automatic + Explicit</td>
          <td>Implicit + Explicit</td>
      </tr>
      <tr>
          <td>Cost savings</td>
          <td>50% on input</td>
          <td>90% on input</td>
          <td>~90% on input</td>
      </tr>
      <tr>
          <td>TTL</td>
          <td>5–10 min</td>
          <td>5 min or 1 hour</td>
          <td>Configurable</td>
      </tr>
      <tr>
          <td>Minimum tokens</td>
          <td>1,024 (128-token increments)</td>
          <td>1,024</td>
          <td>Varies</td>
      </tr>
      <tr>
          <td>Code changes required</td>
          <td>None</td>
          <td>Minimal (cache_control)</td>
          <td>Named cache objects</td>
      </tr>
      <tr>
          <td>Control granularity</td>
          <td>None (auto)</td>
          <td>Up to 4 breakpoints</td>
          <td>Named cache objects</td>
      </tr>
      <tr>
          <td>2026 update</td>
          <td>GPT-5.1: 24h retention</td>
          <td>Feb 2026: auto caching</td>
          <td>Gemini 2.5 implicit caching</td>
      </tr>
  </tbody>
</table>
<h2 id="openai-prompt-caching-automatic-zero-config">OpenAI Prompt Caching: Automatic, Zero-Config</h2>
<p>OpenAI prompt caching is automatic and requires zero code changes — the API detects repeated input prefixes and applies a 50% discount on cached input tokens automatically. You don&rsquo;t set any flags; you just observe the discount in your usage dashboard and billing. The GPT-5.1 series introduced 24-hour cache retention, making it viable for system prompts used across long workdays or batch pipelines that span multiple processing windows. Cache hits appear in the <code>usage</code> object of the API response as <code>cached_tokens</code>, so you can monitor performance without any instrumentation changes.</p>
<p>OpenAI caches in 128-token increments, meaning your cached prefix must be at least 1,024 tokens and matches extend in 128-token steps. A 1,100-token prefix gets cached at 1,024 tokens, with the remaining 76 tokens billed at full price. This granularity matters for borderline cases but rarely affects the economics of real system prompts, which typically run 2,000–10,000 tokens.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> openai <span style="color:#f92672">import</span> OpenAI
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>client <span style="color:#f92672">=</span> OpenAI()
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># No special configuration needed — caching is automatic</span>
</span></span><span style="display:flex;"><span>response <span style="color:#f92672">=</span> client<span style="color:#f92672">.</span>chat<span style="color:#f92672">.</span>completions<span style="color:#f92672">.</span>create(
</span></span><span style="display:flex;"><span>    model<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;gpt-4o&#34;</span>,
</span></span><span style="display:flex;"><span>    messages<span style="color:#f92672">=</span>[
</span></span><span style="display:flex;"><span>        {
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;role&#34;</span>: <span style="color:#e6db74">&#34;system&#34;</span>,
</span></span><span style="display:flex;"><span>            <span style="color:#75715e"># Static system prompt (1024+ tokens for caching eligibility)</span>
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;content&#34;</span>: <span style="color:#e6db74">&#34;You are an expert software engineer specializing in Python...&#34;</span>
</span></span><span style="display:flex;"><span>            <span style="color:#75715e"># ... (long static content)</span>
</span></span><span style="display:flex;"><span>        },
</span></span><span style="display:flex;"><span>        {
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;role&#34;</span>: <span style="color:#e6db74">&#34;user&#34;</span>,
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;content&#34;</span>: user_query  <span style="color:#75715e"># Dynamic — place last</span>
</span></span><span style="display:flex;"><span>        }
</span></span><span style="display:flex;"><span>    ]
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Check cache hit in response</span>
</span></span><span style="display:flex;"><span>cached_tokens <span style="color:#f92672">=</span> response<span style="color:#f92672">.</span>usage<span style="color:#f92672">.</span>prompt_tokens_details<span style="color:#f92672">.</span>cached_tokens
</span></span><span style="display:flex;"><span>print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Cached tokens: </span><span style="color:#e6db74">{</span>cached_tokens<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span></code></pre></div><p><strong>The tradeoff vs Anthropic:</strong> OpenAI&rsquo;s automatic approach is the right choice for teams that want savings with zero engineering overhead. You get 50% off repeated input tokens with no prompt restructuring. The downside is loss of control — you can&rsquo;t force specific breakpoints, can&rsquo;t choose TTL, and can&rsquo;t target multiple cache boundaries within a single prompt. For high-volume applications where every dollar matters, Anthropic&rsquo;s 90% savings on cache reads typically justifies the additional implementation work.</p>
<h2 id="anthropic-prompt-caching-90-savings-with-explicit-breakpoints">Anthropic Prompt Caching: 90% Savings with Explicit Breakpoints</h2>
<p>Anthropic prompt caching delivers up to 90% cost reduction on cached input tokens, the highest discount available from any major provider in 2026. Cache reads for Claude Sonnet 4.5 cost $0.30/1M tokens versus $3.00/1M for standard input — exactly a 10x reduction. The February 2026 automatic caching update simplified implementation significantly: a single top-level <code>cache_control</code> marker now causes the API to auto-place the breakpoint on the last cacheable block, eliminating the need to annotate every section individually. For most use cases, this single-marker approach is sufficient.</p>
<p>For fine-grained control, Anthropic supports up to 4 explicit cache breakpoints per prompt. Automatic caching consumes 1 of those 4 slots — adding automatic caching plus 4 explicit breakpoints triggers a 400 error. The cache invalidation hierarchy is tools → system → messages: changing anything earlier in this chain invalidates caches for everything that follows. Place your least-changing content at the top (tool definitions), most-changing content at the bottom (current user message).</p>
<p><strong>5-minute vs 1-hour TTL:</strong> Choose based on request cadence, not preference. If requests arrive more than every 5 minutes on average, 1-hour TTL pays for itself immediately — you pay 2x base input price on writes instead of 1.25x, but cache reads stay at 0.1x for both. The 1-hour write premium recovers after just 2 cache hits. If your traffic is bursty with long idle gaps, 5-minute TTL may be more economical. One team learned this the hard way: a library update silently changed their TTL from 1-hour to 5-minutes, causing a $13.86/day bill increase before anyone noticed.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> anthropic
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>client <span style="color:#f92672">=</span> anthropic<span style="color:#f92672">.</span>Anthropic()
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># February 2026 approach: single cache_control at top level (auto places breakpoint)</span>
</span></span><span style="display:flex;"><span>response <span style="color:#f92672">=</span> client<span style="color:#f92672">.</span>messages<span style="color:#f92672">.</span>create(
</span></span><span style="display:flex;"><span>    model<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;claude-sonnet-4-5&#34;</span>,
</span></span><span style="display:flex;"><span>    max_tokens<span style="color:#f92672">=</span><span style="color:#ae81ff">1024</span>,
</span></span><span style="display:flex;"><span>    system<span style="color:#f92672">=</span>[
</span></span><span style="display:flex;"><span>        {
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;type&#34;</span>: <span style="color:#e6db74">&#34;text&#34;</span>,
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;text&#34;</span>: <span style="color:#e6db74">&#34;You are an expert software engineer...&#34;</span>,
</span></span><span style="display:flex;"><span>            <span style="color:#75715e"># This triggers automatic cache placement on the last cacheable block</span>
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;cache_control&#34;</span>: {<span style="color:#e6db74">&#34;type&#34;</span>: <span style="color:#e6db74">&#34;ephemeral&#34;</span>}
</span></span><span style="display:flex;"><span>        }
</span></span><span style="display:flex;"><span>    ],
</span></span><span style="display:flex;"><span>    messages<span style="color:#f92672">=</span>[
</span></span><span style="display:flex;"><span>        {<span style="color:#e6db74">&#34;role&#34;</span>: <span style="color:#e6db74">&#34;user&#34;</span>, <span style="color:#e6db74">&#34;content&#34;</span>: user_query}
</span></span><span style="display:flex;"><span>    ]
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Monitor cache usage</span>
</span></span><span style="display:flex;"><span>usage <span style="color:#f92672">=</span> response<span style="color:#f92672">.</span>usage
</span></span><span style="display:flex;"><span>print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Input tokens: </span><span style="color:#e6db74">{</span>usage<span style="color:#f92672">.</span>input_tokens<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span><span style="display:flex;"><span>print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Cache creation tokens: </span><span style="color:#e6db74">{</span>usage<span style="color:#f92672">.</span>cache_creation_input_tokens<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span><span style="display:flex;"><span>print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Cache read tokens: </span><span style="color:#e6db74">{</span>usage<span style="color:#f92672">.</span>cache_read_input_tokens<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span></code></pre></div><p><strong>Multi-turn conversation caching:</strong> In multi-turn chat, Anthropic&rsquo;s automatic caching advances the cache breakpoint forward as the conversation grows — without requiring you to update <code>cache_control</code> markers manually. The 20-block lookback window limits how far back the provider searches for matching prefixes. Keep your conversation history compaction logic in sync with this window to avoid unnecessary cache misses in very long conversations.</p>
<h3 id="explicit-multi-breakpoint-example">Explicit Multi-Breakpoint Example</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#75715e"># For fine-grained control: multiple explicit breakpoints</span>
</span></span><span style="display:flex;"><span>response <span style="color:#f92672">=</span> client<span style="color:#f92672">.</span>messages<span style="color:#f92672">.</span>create(
</span></span><span style="display:flex;"><span>    model<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;claude-sonnet-4-5&#34;</span>,
</span></span><span style="display:flex;"><span>    max_tokens<span style="color:#f92672">=</span><span style="color:#ae81ff">1024</span>,
</span></span><span style="display:flex;"><span>    system<span style="color:#f92672">=</span>[
</span></span><span style="display:flex;"><span>        {
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;type&#34;</span>: <span style="color:#e6db74">&#34;text&#34;</span>,
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;text&#34;</span>: <span style="color:#e6db74">&#34;You are an expert software engineer...&#34;</span>,
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;cache_control&#34;</span>: {<span style="color:#e6db74">&#34;type&#34;</span>: <span style="color:#e6db74">&#34;ephemeral&#34;</span>}  <span style="color:#75715e"># Breakpoint 1: system prompt</span>
</span></span><span style="display:flex;"><span>        }
</span></span><span style="display:flex;"><span>    ],
</span></span><span style="display:flex;"><span>    messages<span style="color:#f92672">=</span>[
</span></span><span style="display:flex;"><span>        {
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;role&#34;</span>: <span style="color:#e6db74">&#34;user&#34;</span>,
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;content&#34;</span>: [
</span></span><span style="display:flex;"><span>                {
</span></span><span style="display:flex;"><span>                    <span style="color:#e6db74">&#34;type&#34;</span>: <span style="color:#e6db74">&#34;text&#34;</span>,
</span></span><span style="display:flex;"><span>                    <span style="color:#e6db74">&#34;text&#34;</span>: large_document_context,  <span style="color:#75715e"># Your reference docs</span>
</span></span><span style="display:flex;"><span>                    <span style="color:#e6db74">&#34;cache_control&#34;</span>: {<span style="color:#e6db74">&#34;type&#34;</span>: <span style="color:#e6db74">&#34;ephemeral&#34;</span>}  <span style="color:#75715e"># Breakpoint 2: context</span>
</span></span><span style="display:flex;"><span>                },
</span></span><span style="display:flex;"><span>                {
</span></span><span style="display:flex;"><span>                    <span style="color:#e6db74">&#34;type&#34;</span>: <span style="color:#e6db74">&#34;text&#34;</span>,
</span></span><span style="display:flex;"><span>                    <span style="color:#e6db74">&#34;text&#34;</span>: user_query  <span style="color:#75715e"># Dynamic — no cache_control</span>
</span></span><span style="display:flex;"><span>                }
</span></span><span style="display:flex;"><span>            ]
</span></span><span style="display:flex;"><span>        }
</span></span><span style="display:flex;"><span>    ]
</span></span><span style="display:flex;"><span>)
</span></span></code></pre></div><h2 id="gemini-prompt-caching-implicit-caching-and-named-cache-objects">Gemini Prompt Caching: Implicit Caching and Named Cache Objects</h2>
<p>Gemini prompt caching operates through two mechanisms: implicit caching (where the API automatically detects and reuses repeated content) and explicit named cache objects for precise control. Gemini 2.5 expanded implicit caching capabilities, making it the most hands-off option for teams already using Google&rsquo;s infrastructure. Named cache objects persist across requests with configurable TTL, behaving more like a traditional database cache than the prefix-matching approach used by OpenAI and Anthropic. Savings are approximately 90% on cached content, comparable to Anthropic&rsquo;s rates.</p>
<p>The named cache approach works well for RAG pipelines that repeatedly query the same knowledge base — you cache the document corpus once, assign it a cache ID, and reference that ID in subsequent requests rather than retransmitting the full content. This makes Gemini caching particularly well-suited for document Q&amp;A applications where the reference material doesn&rsquo;t change between queries.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> google.generativeai <span style="color:#66d9ef">as</span> genai
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>genai<span style="color:#f92672">.</span>configure(api_key<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;YOUR_API_KEY&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Create a named cache for long-lived content</span>
</span></span><span style="display:flex;"><span>cache <span style="color:#f92672">=</span> genai<span style="color:#f92672">.</span>caching<span style="color:#f92672">.</span>CachedContent<span style="color:#f92672">.</span>create(
</span></span><span style="display:flex;"><span>    model<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;gemini-2.5-flash&#34;</span>,
</span></span><span style="display:flex;"><span>    contents<span style="color:#f92672">=</span>[
</span></span><span style="display:flex;"><span>        {
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;role&#34;</span>: <span style="color:#e6db74">&#34;user&#34;</span>,
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;parts&#34;</span>: [{<span style="color:#e6db74">&#34;text&#34;</span>: large_document_context}]
</span></span><span style="display:flex;"><span>        }
</span></span><span style="display:flex;"><span>    ],
</span></span><span style="display:flex;"><span>    ttl<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;3600s&#34;</span>  <span style="color:#75715e"># 1-hour TTL</span>
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Reference the cache in subsequent requests</span>
</span></span><span style="display:flex;"><span>model <span style="color:#f92672">=</span> genai<span style="color:#f92672">.</span>GenerativeModel<span style="color:#f92672">.</span>from_cached_content(cache)
</span></span><span style="display:flex;"><span>response <span style="color:#f92672">=</span> model<span style="color:#f92672">.</span>generate_content(user_query)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Clean up when done</span>
</span></span><span style="display:flex;"><span>cache<span style="color:#f92672">.</span>delete()
</span></span></code></pre></div><h2 id="production-cost-calculator-real-dollar-amounts">Production Cost Calculator: Real Dollar Amounts</h2>
<p>Prompt caching economics depend on three variables: prompt length (in tokens), daily request volume, and cache hit rate. The formula is simple — compare (cache write cost + cache read cost × hit rate) against (full input cost × all requests). In practice, applications with 2,000-token system prompts running 100 requests/day save around $12/month on Anthropic; growth-stage applications with 8,000-token prefixes at 10,000 requests/day save over $6,000/month. At enterprise scale — 100,000 requests/day with a 10,000-token cached prefix — annual savings exceed $1M on Anthropic. OpenAI&rsquo;s 50% discount produces roughly half these savings for the same workload. The numbers below use Anthropic Claude Sonnet 4.5 pricing ($3.00/1M standard input, $0.30/1M cache read, $3.75/1M cache write) with a representative 85-90% cache hit rate, which healthy production systems consistently achieve.</p>
<p>These calculations use Anthropic Claude Sonnet 4.5 pricing ($3.00/1M input, $0.30/1M cache read, $3.75/1M cache write) unless noted.</p>
<h3 id="hobby-2k-system-prompt-100-requestsday">Hobby: 2K System Prompt, 100 Requests/Day</h3>
<p>Without caching: 2,000 tokens × 100 requests = 200K tokens/day × $3.00/1M = $0.60/day ($18/month)</p>
<p>With caching: 1 cache write ($0.0075) + 99 cache reads (2,000 × 99 × $0.30/1M = $0.059) + user query tokens ≈ $0.066/day</p>
<p><strong>Monthly savings: ~$12.60/month</strong> (70% reduction)</p>
<h3 id="growth-8k-cached-prefix-10k-requestsday">Growth: 8K Cached Prefix, 10K Requests/Day</h3>
<p>Without caching: 8,000 × 10,000 = 80M tokens/day × $3.00/1M = $240/day ($7,200/month)</p>
<p>With caching (90% hit rate): ~$19.20/day in cache reads vs $240/day baseline</p>
<p><strong>Monthly savings: ~$6,240/month</strong></p>
<h3 id="enterprise-10k-cached-prefix-100k-requestsday">Enterprise: 10K Cached Prefix, 100K Requests/Day</h3>
<p>Without caching: 10,000 × 100,000 = 1B tokens/day × $3.00/1M = $3,000/day</p>
<p>With caching (85% hit rate): ~$76.50/day in cache reads</p>
<p><strong>Monthly savings: ~$87,600/month ($1.05M/year)</strong></p>
<p>These numbers explain why prompt caching is treated as a P0 optimization by any team running LLMs at scale.</p>
<h2 id="anti-patterns-that-kill-your-cache-hit-rate">Anti-Patterns That Kill Your Cache Hit Rate</h2>
<p>Cache anti-patterns are the silent killers of LLM API budgets. A well-designed prompt structure can achieve 80-90% cache hit rates in production; the same application with anti-patterns typically sees 10-30% — meaning you&rsquo;re paying near-full price and getting none of the latency benefits. Below are the most common patterns to avoid, each with a concrete fix.</p>
<p><strong>Anti-pattern 1: Timestamps or session IDs in the system prompt</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#75715e"># WRONG — kills cache every request</span>
</span></span><span style="display:flex;"><span>system <span style="color:#f92672">=</span> <span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;You are an AI assistant. Current time: </span><span style="color:#e6db74">{</span>datetime<span style="color:#f92672">.</span>now()<span style="color:#e6db74">}</span><span style="color:#e6db74">. Session: </span><span style="color:#e6db74">{</span>session_id<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># RIGHT — put dynamic data elsewhere</span>
</span></span><span style="display:flex;"><span>system <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;You are an AI assistant.&#34;</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Inject time/session into the user message if needed</span>
</span></span></code></pre></div><p><strong>Anti-pattern 2: User content before static content</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#75715e"># WRONG — user name appears before the cacheable instructions</span>
</span></span><span style="display:flex;"><span>messages <span style="color:#f92672">=</span> [
</span></span><span style="display:flex;"><span>    {<span style="color:#e6db74">&#34;role&#34;</span>: <span style="color:#e6db74">&#34;user&#34;</span>, <span style="color:#e6db74">&#34;content&#34;</span>: <span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Hi, I&#39;m </span><span style="color:#e6db74">{</span>user_name<span style="color:#e6db74">}</span><span style="color:#e6db74">. </span><span style="color:#e6db74">{</span>user_query<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>}
</span></span><span style="display:flex;"><span>]
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># RIGHT — static instructions in system, user identity in messages</span>
</span></span><span style="display:flex;"><span>system <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;You are an expert assistant with access to the following knowledge base: [static docs]&#34;</span>
</span></span><span style="display:flex;"><span>messages <span style="color:#f92672">=</span> [{<span style="color:#e6db74">&#34;role&#34;</span>: <span style="color:#e6db74">&#34;user&#34;</span>, <span style="color:#e6db74">&#34;content&#34;</span>: user_query}]
</span></span></code></pre></div><p><strong>Anti-pattern 3: Rotating few-shot examples</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#75715e"># WRONG — shuffled examples invalidate cache every time</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> random
</span></span><span style="display:flex;"><span>examples <span style="color:#f92672">=</span> random<span style="color:#f92672">.</span>sample(all_examples, <span style="color:#ae81ff">5</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># RIGHT — fixed ordered examples in system prompt, random examples in user messages</span>
</span></span><span style="display:flex;"><span>fixed_examples <span style="color:#f92672">=</span> all_examples[:<span style="color:#ae81ff">5</span>]  <span style="color:#75715e"># Static, always the same</span>
</span></span></code></pre></div><p><strong>Anti-pattern 4: Dynamic tool definitions</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#75715e"># WRONG — enabling different tools per user breaks prefix matching</span>
</span></span><span style="display:flex;"><span>tools <span style="color:#f92672">=</span> get_user_tools(user_id)  <span style="color:#75715e"># Different per user</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># RIGHT — use a fixed superset of tools, filter in application logic</span>
</span></span><span style="display:flex;"><span>tools <span style="color:#f92672">=</span> ALL_TOOLS  <span style="color:#75715e"># Identical for every request</span>
</span></span></code></pre></div><p><strong>Anti-pattern 5: Prompts below minimum threshold</strong></p>
<p>Short prompts (&lt; 1,024 tokens) don&rsquo;t qualify for caching on any major provider. If your system prompt is 800 tokens, add structured documentation, examples, or reasoning guidelines to push above the threshold — the cost of additional tokens is trivial compared to the caching savings you unlock.</p>
<h2 id="monitoring-cache-hit-rates-in-production">Monitoring Cache Hit Rates in Production</h2>
<p>Production systems should target 70-90% cache hit rates. Rates below 50% indicate a structural problem with your prompt ordering — revisit the anti-patterns section. Each provider exposes cache metrics differently, but all include the data in API responses.</p>
<p><strong>Anthropic monitoring:</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">track_cache_metrics</span>(response, metrics_client):
</span></span><span style="display:flex;"><span>    usage <span style="color:#f92672">=</span> response<span style="color:#f92672">.</span>usage
</span></span><span style="display:flex;"><span>    total_input <span style="color:#f92672">=</span> (usage<span style="color:#f92672">.</span>input_tokens <span style="color:#f92672">+</span> 
</span></span><span style="display:flex;"><span>                   usage<span style="color:#f92672">.</span>cache_creation_input_tokens <span style="color:#f92672">+</span> 
</span></span><span style="display:flex;"><span>                   usage<span style="color:#f92672">.</span>cache_read_input_tokens)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    hit_rate <span style="color:#f92672">=</span> usage<span style="color:#f92672">.</span>cache_read_input_tokens <span style="color:#f92672">/</span> total_input <span style="color:#66d9ef">if</span> total_input <span style="color:#f92672">&gt;</span> <span style="color:#ae81ff">0</span> <span style="color:#66d9ef">else</span> <span style="color:#ae81ff">0</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    metrics_client<span style="color:#f92672">.</span>gauge(<span style="color:#e6db74">&#34;llm.cache_hit_rate&#34;</span>, hit_rate)
</span></span><span style="display:flex;"><span>    metrics_client<span style="color:#f92672">.</span>increment(<span style="color:#e6db74">&#34;llm.cache_reads&#34;</span>, usage<span style="color:#f92672">.</span>cache_read_input_tokens)
</span></span><span style="display:flex;"><span>    metrics_client<span style="color:#f92672">.</span>increment(<span style="color:#e6db74">&#34;llm.cache_writes&#34;</span>, usage<span style="color:#f92672">.</span>cache_creation_input_tokens)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">if</span> hit_rate <span style="color:#f92672">&lt;</span> <span style="color:#ae81ff">0.5</span>:
</span></span><span style="display:flex;"><span>        alert(<span style="color:#e6db74">&#34;Cache hit rate below 50% — check prompt structure&#34;</span>)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span> hit_rate
</span></span></code></pre></div><p><strong>OpenAI monitoring:</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">track_openai_cache</span>(response):
</span></span><span style="display:flex;"><span>    details <span style="color:#f92672">=</span> response<span style="color:#f92672">.</span>usage<span style="color:#f92672">.</span>prompt_tokens_details
</span></span><span style="display:flex;"><span>    cached <span style="color:#f92672">=</span> details<span style="color:#f92672">.</span>cached_tokens <span style="color:#66d9ef">if</span> details <span style="color:#66d9ef">else</span> <span style="color:#ae81ff">0</span>
</span></span><span style="display:flex;"><span>    total <span style="color:#f92672">=</span> response<span style="color:#f92672">.</span>usage<span style="color:#f92672">.</span>prompt_tokens
</span></span><span style="display:flex;"><span>    hit_rate <span style="color:#f92672">=</span> cached <span style="color:#f92672">/</span> total <span style="color:#66d9ef">if</span> total <span style="color:#f92672">&gt;</span> <span style="color:#ae81ff">0</span> <span style="color:#66d9ef">else</span> <span style="color:#ae81ff">0</span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span> hit_rate
</span></span></code></pre></div><p>Key metrics to track in production:</p>
<ul>
<li><strong>Cache hit rate</strong> (target: 70%+; alert threshold: 50%)</li>
<li><strong>Cache creation cost</strong> (should be small relative to cache read savings)</li>
<li><strong>Time-to-first-token</strong> (cache hits typically reduce TTFT by 40-80%)</li>
<li><strong>Daily cache savings</strong> (compare cached read cost vs estimated uncached cost)</li>
</ul>
<p>Set up weekly cost attribution reports that separate cached vs uncached spending. This makes optimization work visible to stakeholders and helps justify the engineering investment in prompt structure.</p>
<h2 id="prompt-caching-with-rag-multi-turn-chat-and-agentic-systems">Prompt Caching with RAG, Multi-Turn Chat, and Agentic Systems</h2>
<p>Prompt caching interacts differently with each major LLM application pattern, and the configuration choices that maximize savings in a simple chatbot may perform poorly in an agentic system. RAG pipelines benefit most from caching the retrieval instructions and knowledge base preamble while letting retrieved documents flow through a second breakpoint. Multi-turn chat applications benefit from Anthropic&rsquo;s automatic cache advancement, which moves the cache boundary forward as conversation history grows — no manual re-marking needed. Agentic systems using tool-calling loops (AutoGen, LangGraph, CrewAI) require careful static/dynamic separation: cache the tool definitions and agent persona, let tool call results remain uncached. In all three patterns, the 31% semantic similarity rate observed across production LLM queries (Burnwise 2026 analysis) means that even applications with moderate request volumes see real cache hits — not just the high-frequency request patterns typically highlighted in provider documentation. Gemini&rsquo;s named cache objects are uniquely well-suited for document corpora shared across many different query types, making it the preferred choice for multi-tenant RAG deployments where the same document set serves many users.</p>
<h3 id="rag-pipelines">RAG Pipelines</h3>
<p>RAG applications are the ideal use case for prompt caching. The retrieved documents change per query, but your system prompt, retrieval instructions, and output format guidelines are static. Structure your RAG prompt as:</p>
<ol>
<li>System instructions (static, cached)</li>
<li>Retrieved documents (semi-static per document set, explicitly cached with breakpoint)</li>
<li>User question (dynamic, not cached)</li>
</ol>
<p>For Gemini, use named cache objects for your document corpus — create the cache once per document set and reference it by ID across all queries against that corpus.</p>
<h3 id="multi-turn-conversations">Multi-Turn Conversations</h3>
<p>Anthropic&rsquo;s automatic cache advancement handles multi-turn chat without manual cache_control updates per message. The breakpoint moves forward automatically as conversation history grows. Watch for the 20-block lookback window — conversations longer than ~20 exchanges may see the oldest context fall outside the cacheable window. Implement a summarization or context compaction step before hitting this limit.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#75715e"># Multi-turn with Anthropic — cache_control only on the system, </span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># automatic caching handles the rest</span>
</span></span><span style="display:flex;"><span>messages <span style="color:#f92672">=</span> conversation_history  <span style="color:#75715e"># Growing list of messages</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>response <span style="color:#f92672">=</span> client<span style="color:#f92672">.</span>messages<span style="color:#f92672">.</span>create(
</span></span><span style="display:flex;"><span>    model<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;claude-sonnet-4-5&#34;</span>,
</span></span><span style="display:flex;"><span>    system<span style="color:#f92672">=</span>[{<span style="color:#e6db74">&#34;type&#34;</span>: <span style="color:#e6db74">&#34;text&#34;</span>, <span style="color:#e6db74">&#34;text&#34;</span>: system_prompt, 
</span></span><span style="display:flex;"><span>             <span style="color:#e6db74">&#34;cache_control&#34;</span>: {<span style="color:#e6db74">&#34;type&#34;</span>: <span style="color:#e6db74">&#34;ephemeral&#34;</span>}}],
</span></span><span style="display:flex;"><span>    messages<span style="color:#f92672">=</span>messages,
</span></span><span style="display:flex;"><span>    max_tokens<span style="color:#f92672">=</span><span style="color:#ae81ff">1024</span>
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Append response to conversation_history for next turn</span>
</span></span></code></pre></div><h3 id="agentic-systems">Agentic Systems</h3>
<p>Agentic systems (AutoGen, LangGraph, CrewAI) make many tool calls in a loop, often with overlapping system prompts and tool definitions. Cache your tool registry and agent persona at the top of the prompt, and let the dynamic tool call results flow through the uncached portion. The consistency requirement is strict — if your tool definitions change between agent steps (e.g., tools are conditionally available), you&rsquo;ll get cache misses. Prefer a static superset of tools and handle conditional availability in application logic.</p>
<h2 id="when-prompt-caching-genuinely-doesnt-help">When Prompt Caching Genuinely Doesn&rsquo;t Help</h2>
<p>Prompt caching is not a universal win. Avoid optimizing for it in these scenarios:</p>
<ul>
<li><strong>Short prompts (&lt; 1,024 tokens):</strong> You don&rsquo;t meet the minimum threshold. Engineering time is better spent elsewhere.</li>
<li><strong>Highly unique contexts:</strong> If every request has a completely different long context (e.g., analyzing a unique document per user), you write a cache but never read it — you pay the write premium for nothing.</li>
<li><strong>Low request volume:</strong> At under 50 requests/day, cache writes may cost more than reads save. Run the math with your actual prompt length and request rate.</li>
<li><strong>Frequently changing system prompts:</strong> If your system prompt changes every hour or day (A/B testing, personalization), TTL selection becomes tricky and hit rates drop.</li>
<li><strong>One-off batch jobs:</strong> A batch that runs once and never repeats gets no cache reads. Use Anthropic&rsquo;s Batch API for cost savings instead.</li>
</ul>
<p>The honest assessment: if your system prompt is under 2K tokens and you run under 1,000 requests/day, the savings are real but modest (under $50/month on most providers). At that scale, model selection and prompt length optimization likely offer better ROI than caching architecture.</p>
<h2 id="faq">FAQ</h2>
<p><strong>Q: Does prompt caching work with streaming responses?</strong></p>
<p>Yes. All three providers support prompt caching with streaming. The cache hit check happens before token generation begins, so your streaming latency still benefits from reduced time-to-first-token on cache hits. The usage statistics (including cache read tokens) appear in the final delta of the stream.</p>
<p><strong>Q: What happens if I exceed Anthropic&rsquo;s 4-breakpoint limit?</strong></p>
<p>The API returns a 400 error. If you&rsquo;re using automatic caching (which consumes one slot), you can add up to 3 explicit breakpoints. If you need more granularity, restructure your prompt to consolidate static sections rather than adding more breakpoints.</p>
<p><strong>Q: Is prompt caching the same as semantic caching?</strong></p>
<p>No. Prompt caching is exact prefix matching at the token level — it requires identical byte sequences to hit. Semantic caching (tools like GPTCache, Redis + embeddings) matches semantically similar queries and returns cached responses. They&rsquo;re complementary: use prompt caching to reduce per-request compute costs, and semantic caching to avoid calling the LLM at all for near-duplicate queries.</p>
<p><strong>Q: Will using prompt caching affect response quality?</strong></p>
<p>No. Cache hits use the exact same KV states that would have been computed fresh, so responses are statistically identical. The only observable difference is lower latency and cost. There&rsquo;s no quality-cost tradeoff involved.</p>
<p><strong>Q: How do I choose between Anthropic and OpenAI for cost optimization?</strong></p>
<p>Run the math with your actual numbers. OpenAI gives 50% savings with zero engineering work. Anthropic gives 90% savings with minimal implementation effort. At 10,000 requests/day with a 5K-token system prompt, Anthropic saves roughly twice as much per month despite higher base prices, assuming 80%+ cache hit rate. Below about 5,000 requests/day, the difference narrows significantly, and OpenAI&rsquo;s simplicity may win on total cost including engineering time.</p>
]]></content:encoded></item></channel></rss>