<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Llm-Cost on RockB</title><link>https://baeseokjae.github.io/tags/llm-cost/</link><description>Recent content in Llm-Cost on RockB</description><image><title>RockB</title><url>https://baeseokjae.github.io/images/og-default.png</url><link>https://baeseokjae.github.io/images/og-default.png</link></image><generator>Hugo</generator><language>en-us</language><lastBuildDate>Wed, 13 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://baeseokjae.github.io/tags/llm-cost/index.xml" rel="self" type="application/rss+xml"/><item><title>AI Developer Cost Optimization 2026: Token Budgets, Caching &amp; Multi-Model Routing</title><link>https://baeseokjae.github.io/posts/ai-developer-cost-optimization-2026/</link><pubDate>Wed, 13 May 2026 00:00:00 +0000</pubDate><guid>https://baeseokjae.github.io/posts/ai-developer-cost-optimization-2026/</guid><description>Enterprise token costs dropped 67% in 2025-2026 through multi-model routing, prompt caching, and token budgets. Here are the 9 strategies cutting real LLM spend in 2026.</description><content:encoded><![CDATA[<p>Enterprise token costs fell 67% year-over-year in 2025–2026 — not because models got dramatically cheaper overnight, but because engineering teams finally learned to route intelligently, cache aggressively, and set hard budget limits on every agentic step. The average enterprise account now runs 4.7 distinct models (up from 2.1 in Q1 2025), open-source models captured 38% of enterprise token volume for the first time ever, and teams that adopted these nine strategies are seeing cost reductions that outpace every model pricing cut combined.</p>
<h2 id="ai-developer-cost-optimization-2026-how-enterprise-token-spend-dropped-67">AI Developer Cost Optimization 2026: How Enterprise Token Spend Dropped 67%</h2>
<p>Enterprise AI token spend dropped 67% year-over-year in 2025–2026, and the driver is not what most developers expect. Model pricing cuts account for only a fraction of that reduction. The bigger factor is architectural discipline: organizations that implemented structured cost optimization across prompting, routing, caching, and workflow design saw compound savings that dwarf any individual pricing reduction. The average enterprise account now uses 4.7 distinct models — a 124% increase from 2.1 in Q1 2025 — reflecting a fundamental shift from &ldquo;use the best model for everything&rdquo; to &ldquo;match model capability precisely to task complexity.&rdquo; Open-source models captured 38% of enterprise token volume in Q1 2026, the first time that share has crossed the one-third threshold, driven by Llama, Mistral, and Qwen deployments for classification, summarization, and structured extraction tasks that do not require frontier reasoning. This article covers nine strategies responsible for most of those savings: prompt caching, multi-model routing, token budget enforcement, batch processing, semantic caching, context management, output length control, model tiering, and agent workflow optimization — with specific pricing numbers, implementation patterns, and a cost calculator to show what each strategy actually saves at scale.</p>
<h3 id="why-2026-is-the-inflection-point-for-llm-cost-engineering">Why 2026 Is the Inflection Point for LLM Cost Engineering</h3>
<p>Before 2025, most teams used a single model for all workloads and accepted token costs as a fixed operating expense. The emergence of capable sub-$1 models (Gemini Flash at $0.075/M input, Claude Haiku 4.5 at $0.08/M input) combined with infrastructure primitives like prompt caching APIs and batch endpoints created the conditions for systematic cost engineering. Teams that built routing and caching infrastructure now treat token spend as an engineerable variable rather than a pricing-dependent constant.</p>
<hr>
<h2 id="prompt-caching-the-90-discount-youre-probably-not-using">Prompt Caching: The 90% Discount You&rsquo;re Probably Not Using</h2>
<p>Anthropic&rsquo;s prompt caching API delivers a <strong>90% discount on cached input tokens</strong> — the largest single cost lever available to developers in 2026 — yet adoption remains low because the implementation pattern is non-obvious and the benefit only materializes when the same context is reused across many requests. The discount works by storing a hash of a marked context block on Anthropic&rsquo;s servers for up to five minutes (extended on cache hit); subsequent requests that include an identical block pay $0.30 per million tokens instead of $3.00 per million for Claude Sonnet 4. For an application that prepends a 10,000-token system prompt to every user request — a common pattern in enterprise chat, RAG pipelines, and coding assistants — prompt caching reduces the cost of that prefix from $0.03 per request to $0.003 per request. At 100,000 requests per day, that is $2,700 per day in savings from a single configuration change. The cache hit rate is the critical variable: workloads with high prefix reuse (shared system prompts, static RAG context, tool definitions) see cache hit rates above 90%, while highly variable prompts see near zero. The engineering work is to identify which parts of your prompt are static and move them to the front of the context so the caching breakpoint captures maximum token volume.</p>
<h3 id="how-to-implement-prompt-caching-in-the-anthropic-sdk">How to Implement Prompt Caching in the Anthropic SDK</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> anthropic
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>client <span style="color:#f92672">=</span> anthropic<span style="color:#f92672">.</span>Anthropic()
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>response <span style="color:#f92672">=</span> client<span style="color:#f92672">.</span>messages<span style="color:#f92672">.</span>create(
</span></span><span style="display:flex;"><span>    model<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;claude-sonnet-4-5&#34;</span>,
</span></span><span style="display:flex;"><span>    max_tokens<span style="color:#f92672">=</span><span style="color:#ae81ff">1024</span>,
</span></span><span style="display:flex;"><span>    system<span style="color:#f92672">=</span>[
</span></span><span style="display:flex;"><span>        {
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;type&#34;</span>: <span style="color:#e6db74">&#34;text&#34;</span>,
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;text&#34;</span>: <span style="color:#e6db74">&#34;You are a senior software engineer...&#34;</span>,
</span></span><span style="display:flex;"><span>            <span style="color:#75715e"># Mark static context for caching</span>
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;cache_control&#34;</span>: {<span style="color:#e6db74">&#34;type&#34;</span>: <span style="color:#e6db74">&#34;ephemeral&#34;</span>}
</span></span><span style="display:flex;"><span>        }
</span></span><span style="display:flex;"><span>    ],
</span></span><span style="display:flex;"><span>    messages<span style="color:#f92672">=</span>[{<span style="color:#e6db74">&#34;role&#34;</span>: <span style="color:#e6db74">&#34;user&#34;</span>, <span style="color:#e6db74">&#34;content&#34;</span>: user_query}]
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Check cache performance</span>
</span></span><span style="display:flex;"><span>usage <span style="color:#f92672">=</span> response<span style="color:#f92672">.</span>usage
</span></span><span style="display:flex;"><span>cache_hit_rate <span style="color:#f92672">=</span> usage<span style="color:#f92672">.</span>cache_read_input_tokens <span style="color:#f92672">/</span> (
</span></span><span style="display:flex;"><span>    usage<span style="color:#f92672">.</span>cache_read_input_tokens <span style="color:#f92672">+</span> usage<span style="color:#f92672">.</span>cache_creation_input_tokens <span style="color:#f92672">+</span> <span style="color:#ae81ff">1</span>
</span></span><span style="display:flex;"><span>)
</span></span></code></pre></div><p>The <code>cache_control: {&quot;type&quot;: &quot;ephemeral&quot;}</code> marker tells Anthropic&rsquo;s API to cache everything up to and including that content block. Put your largest static blocks — system prompts, document context, tool schemas — first and mark them for caching. Dynamic user content goes last and is never cached.</p>
<h3 id="when-prompt-caching-does-not-help">When Prompt Caching Does Not Help</h3>
<p>Prompt caching is ineffective for highly personalized prompts where the first several thousand tokens vary per user, for workloads with very low request volume (cache overhead cost exceeds savings), and for short prompts where the static prefix is less than ~1,000 tokens. For these cases, the strategies in the following sections deliver better returns.</p>
<hr>
<h2 id="multi-model-routing-using-cheap-models-for-80-of-your-requests">Multi-Model Routing: Using Cheap Models for 80% of Your Requests</h2>
<p>Multi-model routing — automatically directing each request to the cheapest model that can handle it reliably — is the strategy most responsible for the 67% cost reduction seen across enterprise accounts in 2026. The pricing differential between frontier and economy models is now extreme: <strong>GPT-4o costs $2.50/M input tokens and $10/M output tokens</strong>, while <strong>Gemini Flash costs $0.075/M input and $0.30/M output</strong> — a 33× cost gap for input and a 33× gap for output. Claude Sonnet 4 sits at $3/$15 per million, while Claude Haiku 4.5 drops to $0.08/$0.40 — a 37× cost differential within the same model family. The practical implication is that routing even 70% of your requests to economy-tier models while reserving frontier models for complex reasoning tasks produces 10–20× overall cost reduction for most workloads. Enterprise teams with 4.7 models per account are not experimenting — they have built routing infrastructure that classifies incoming tasks by complexity and capability requirements and dispatches accordingly. The key insight is that most production LLM requests are not frontier-model tasks: classification, entity extraction, summarization, simple Q&amp;A, template filling, and format conversion all perform reliably on $0.08–$0.15/M models.</p>
<h3 id="building-a-task-classifier-for-routing">Building a Task Classifier for Routing</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> enum <span style="color:#f92672">import</span> Enum
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">class</span> <span style="color:#a6e22e">TaskComplexity</span>(Enum):
</span></span><span style="display:flex;"><span>    SIMPLE <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;simple&#34;</span>      <span style="color:#75715e"># Classification, extraction, formatting</span>
</span></span><span style="display:flex;"><span>    MEDIUM <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;medium&#34;</span>      <span style="color:#75715e"># Summarization, Q&amp;A with context</span>
</span></span><span style="display:flex;"><span>    COMPLEX <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;complex&#34;</span>    <span style="color:#75715e"># Multi-step reasoning, code generation, analysis</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>MODEL_MAP <span style="color:#f92672">=</span> {
</span></span><span style="display:flex;"><span>    TaskComplexity<span style="color:#f92672">.</span>SIMPLE: <span style="color:#e6db74">&#34;claude-haiku-4-5&#34;</span>,    <span style="color:#75715e"># $0.08/$0.40 per M</span>
</span></span><span style="display:flex;"><span>    TaskComplexity<span style="color:#f92672">.</span>MEDIUM: <span style="color:#e6db74">&#34;gemini-flash&#34;</span>,          <span style="color:#75715e"># $0.075/$0.30 per M</span>
</span></span><span style="display:flex;"><span>    TaskComplexity<span style="color:#f92672">.</span>COMPLEX: <span style="color:#e6db74">&#34;claude-sonnet-4-5&#34;</span>,   <span style="color:#75715e"># $3.00/$15.00 per M</span>
</span></span><span style="display:flex;"><span>}
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">classify_and_route</span>(prompt: str, context_size: int) <span style="color:#f92672">-&gt;</span> str:
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">if</span> context_size <span style="color:#f92672">&gt;</span> <span style="color:#ae81ff">50_000</span> <span style="color:#f92672">or</span> requires_reasoning(prompt):
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> MODEL_MAP[TaskComplexity<span style="color:#f92672">.</span>COMPLEX]
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">elif</span> context_size <span style="color:#f92672">&gt;</span> <span style="color:#ae81ff">5_000</span> <span style="color:#f92672">or</span> requires_synthesis(prompt):
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> MODEL_MAP[TaskComplexity<span style="color:#f92672">.</span>MEDIUM]
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">else</span>:
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> MODEL_MAP[TaskComplexity<span style="color:#f92672">.</span>SIMPLE]
</span></span></code></pre></div><p>Teams that implement routing report that 75–85% of production requests fall into the simple or medium tier, meaning only 15–25% of requests actually need frontier model capability — and are priced accordingly.</p>
<hr>
<h2 id="token-budget-enforcement-preventing-runaway-costs-in-agent-workflows">Token Budget Enforcement: Preventing Runaway Costs in Agent Workflows</h2>
<p>Agentic AI workflows introduced a cost failure mode that did not exist in single-turn applications: <strong>unbounded token consumption across multi-step reasoning loops</strong>. A single misconfigured agent that enters a debugging loop, over-retrieves context at every step, or fails to terminate gracefully can consume millions of tokens and generate hundreds of dollars in cost within a single task execution. Token budget enforcement — setting hard per-step and per-task limits that the model is made aware of and that the orchestration layer enforces — is the primary defense. Anthropic&rsquo;s API supports a <code>budget_tokens</code> parameter for extended thinking that tells Claude exactly how many reasoning tokens it is allowed to consume; similar per-step limits can be enforced at the orchestration layer for any model by tracking cumulative token consumption and truncating or terminating steps that exceed thresholds. Enterprise teams running production agent workflows in 2026 universally set four limits: maximum tokens per LLM call (via <code>max_tokens</code>), maximum tokens per agent step (tracked in orchestration), maximum steps per task, and maximum total tokens per task. These four limits create a cost ceiling for every workflow execution regardless of task complexity or model behavior.</p>
<h3 id="implementing-token-budgets-in-agent-orchestration">Implementing Token Budgets in Agent Orchestration</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#66d9ef">class</span> <span style="color:#a6e22e">BudgetedAgent</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">def</span> __init__(self, max_tokens_per_step<span style="color:#f92672">=</span><span style="color:#ae81ff">2000</span>, max_steps<span style="color:#f92672">=</span><span style="color:#ae81ff">10</span>):
</span></span><span style="display:flex;"><span>        self<span style="color:#f92672">.</span>max_tokens_per_step <span style="color:#f92672">=</span> max_tokens_per_step
</span></span><span style="display:flex;"><span>        self<span style="color:#f92672">.</span>max_steps <span style="color:#f92672">=</span> max_steps
</span></span><span style="display:flex;"><span>        self<span style="color:#f92672">.</span>total_tokens_used <span style="color:#f92672">=</span> <span style="color:#ae81ff">0</span>
</span></span><span style="display:flex;"><span>        self<span style="color:#f92672">.</span>steps_taken <span style="color:#f92672">=</span> <span style="color:#ae81ff">0</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">def</span> <span style="color:#a6e22e">run_step</span>(self, prompt: str, tools: list) <span style="color:#f92672">-&gt;</span> dict:
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">if</span> self<span style="color:#f92672">.</span>steps_taken <span style="color:#f92672">&gt;=</span> self<span style="color:#f92672">.</span>max_steps:
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">raise</span> BudgetExceededError(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Max steps (</span><span style="color:#e6db74">{</span>self<span style="color:#f92672">.</span>max_steps<span style="color:#e6db74">}</span><span style="color:#e6db74">) reached&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>        response <span style="color:#f92672">=</span> client<span style="color:#f92672">.</span>messages<span style="color:#f92672">.</span>create(
</span></span><span style="display:flex;"><span>            model<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;claude-sonnet-4-5&#34;</span>,
</span></span><span style="display:flex;"><span>            max_tokens<span style="color:#f92672">=</span>self<span style="color:#f92672">.</span>max_tokens_per_step,  <span style="color:#75715e"># Hard per-call limit</span>
</span></span><span style="display:flex;"><span>            messages<span style="color:#f92672">=</span>[{<span style="color:#e6db74">&#34;role&#34;</span>: <span style="color:#e6db74">&#34;user&#34;</span>, <span style="color:#e6db74">&#34;content&#34;</span>: prompt}],
</span></span><span style="display:flex;"><span>            tools<span style="color:#f92672">=</span>tools,
</span></span><span style="display:flex;"><span>        )
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>        tokens_used <span style="color:#f92672">=</span> response<span style="color:#f92672">.</span>usage<span style="color:#f92672">.</span>input_tokens <span style="color:#f92672">+</span> response<span style="color:#f92672">.</span>usage<span style="color:#f92672">.</span>output_tokens
</span></span><span style="display:flex;"><span>        self<span style="color:#f92672">.</span>total_tokens_used <span style="color:#f92672">+=</span> tokens_used
</span></span><span style="display:flex;"><span>        self<span style="color:#f92672">.</span>steps_taken <span style="color:#f92672">+=</span> <span style="color:#ae81ff">1</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> response
</span></span></code></pre></div><p>For extended thinking workloads, Anthropic&rsquo;s <code>thinking</code> parameter with <code>budget_tokens</code> gives Claude explicit awareness of its reasoning budget, which tends to produce more concise reasoning paths compared to unconstrained thinking — a secondary cost benefit on top of the hard ceiling.</p>
<h3 id="setting-budget-alerts-before-hard-limits">Setting Budget Alerts Before Hard Limits</h3>
<p>Hard limits terminate workflows abruptly, which can produce corrupted state. A two-tier approach — alert at 70% of budget, terminate at 100% — gives the orchestration layer a chance to gracefully summarize progress and checkpoint state before the task is cancelled, enabling resumable workflows that do not waste tokens re-executing completed steps.</p>
<hr>
<h2 id="batch-api-the-50-discount-for-non-real-time-workloads">Batch API: The 50% Discount for Non-Real-Time Workloads</h2>
<p>Both Anthropic and OpenAI offer <strong>50% discounts on batch processing endpoints</strong> — the largest available discount for non-real-time workloads — yet most teams process everything through synchronous real-time APIs regardless of whether the use case actually requires it. The Anthropic Message Batches API and OpenAI Batch API accept request queues that are processed asynchronously with results available within 24 hours, at exactly half the per-token cost of synchronous endpoints. The 50% discount applies to both input and output tokens with no cap on batch size. Use cases that are natural fits for batch processing are extensive: nightly document classification runs, offline embedding generation, data labeling pipelines, evaluation dataset scoring, content moderation queues, report generation, and any workflow where a human is not waiting synchronously for a response. For teams running large-scale RAG pipelines or regularly processing document corpora, shifting ingestion and classification to batch APIs alone can reduce monthly token spend by 20–30% without any change to model selection or prompt design.</p>
<h3 id="structuring-requests-for-the-anthropic-batch-api">Structuring Requests for the Anthropic Batch API</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> anthropic
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> json
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>client <span style="color:#f92672">=</span> anthropic<span style="color:#f92672">.</span>Anthropic()
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Build batch requests</span>
</span></span><span style="display:flex;"><span>requests <span style="color:#f92672">=</span> [
</span></span><span style="display:flex;"><span>    {
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#34;custom_id&#34;</span>: <span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;doc-</span><span style="color:#e6db74">{</span>i<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#34;params&#34;</span>: {
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;model&#34;</span>: <span style="color:#e6db74">&#34;claude-haiku-4-5&#34;</span>,
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;max_tokens&#34;</span>: <span style="color:#ae81ff">256</span>,
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;messages&#34;</span>: [
</span></span><span style="display:flex;"><span>                {<span style="color:#e6db74">&#34;role&#34;</span>: <span style="color:#e6db74">&#34;user&#34;</span>, <span style="color:#e6db74">&#34;content&#34;</span>: <span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Classify this document: </span><span style="color:#e6db74">{</span>doc<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>}
</span></span><span style="display:flex;"><span>            ]
</span></span><span style="display:flex;"><span>        }
</span></span><span style="display:flex;"><span>    }
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">for</span> i, doc <span style="color:#f92672">in</span> enumerate(documents)
</span></span><span style="display:flex;"><span>]
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Submit batch</span>
</span></span><span style="display:flex;"><span>batch <span style="color:#f92672">=</span> client<span style="color:#f92672">.</span>messages<span style="color:#f92672">.</span>batches<span style="color:#f92672">.</span>create(requests<span style="color:#f92672">=</span>requests)
</span></span><span style="display:flex;"><span>print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Batch ID: </span><span style="color:#e6db74">{</span>batch<span style="color:#f92672">.</span>id<span style="color:#e6db74">}</span><span style="color:#e6db74">, Status: </span><span style="color:#e6db74">{</span>batch<span style="color:#f92672">.</span>processing_status<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Poll for results (or use webhook)</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> time
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">while</span> <span style="color:#66d9ef">True</span>:
</span></span><span style="display:flex;"><span>    batch <span style="color:#f92672">=</span> client<span style="color:#f92672">.</span>messages<span style="color:#f92672">.</span>batches<span style="color:#f92672">.</span>retrieve(batch<span style="color:#f92672">.</span>id)
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">if</span> batch<span style="color:#f92672">.</span>processing_status <span style="color:#f92672">==</span> <span style="color:#e6db74">&#34;ended&#34;</span>:
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">break</span>
</span></span><span style="display:flex;"><span>    time<span style="color:#f92672">.</span>sleep(<span style="color:#ae81ff">60</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Process results</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">for</span> result <span style="color:#f92672">in</span> client<span style="color:#f92672">.</span>messages<span style="color:#f92672">.</span>batches<span style="color:#f92672">.</span>results(batch<span style="color:#f92672">.</span>id):
</span></span><span style="display:flex;"><span>    print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;</span><span style="color:#e6db74">{</span>result<span style="color:#f92672">.</span>custom_id<span style="color:#e6db74">}</span><span style="color:#e6db74">: </span><span style="color:#e6db74">{</span>result<span style="color:#f92672">.</span>result<span style="color:#f92672">.</span>message<span style="color:#f92672">.</span>content<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span></code></pre></div><p>The key engineering discipline for batch workloads is separating your request pipeline into synchronous (user-facing, latency-sensitive) and asynchronous (background, latency-tolerant) lanes and routing each category to the appropriate API endpoint automatically.</p>
<hr>
<h2 id="semantic-caching-stop-paying-for-the-same-llm-response-twice">Semantic Caching: Stop Paying for the Same LLM Response Twice</h2>
<p>Semantic caching addresses a cost pattern that standard prompt caching cannot solve: <strong>users asking semantically equivalent questions with different surface wording</strong>. Where prompt caching works at the token-exact level within a single request, semantic caching works at the meaning level across multiple requests by storing previous LLM responses in a vector database and returning cached answers when a new query is sufficiently similar to a cached one. A question like &ldquo;What is the refund policy?&rdquo; and &ldquo;How do I get a refund?&rdquo; may share zero cached tokens but are semantically equivalent in most support contexts and should return the same answer without an LLM call. Implementations typically use Redis with vector search extensions or Qdrant as the vector store, compute embeddings for incoming queries using a cheap embedding model (OpenAI text-embedding-3-small at $0.02/M tokens or a local model), and set a cosine similarity threshold above which a cached response is returned directly. Teams report cache hit rates of 30–60% for support, documentation, and FAQ workloads — at those hit rates, semantic caching reduces LLM calls (and token costs) by the same percentage, with embedding costs representing less than 1% of the savings.</p>
<h3 id="semantic-cache-implementation-with-redis">Semantic Cache Implementation with Redis</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> redis
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> numpy <span style="color:#66d9ef">as</span> np
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> openai <span style="color:#f92672">import</span> OpenAI
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>client <span style="color:#f92672">=</span> OpenAI()
</span></span><span style="display:flex;"><span>r <span style="color:#f92672">=</span> redis<span style="color:#f92672">.</span>Redis()
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>SIMILARITY_THRESHOLD <span style="color:#f92672">=</span> <span style="color:#ae81ff">0.92</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">get_embedding</span>(text: str) <span style="color:#f92672">-&gt;</span> list[float]:
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span> client<span style="color:#f92672">.</span>embeddings<span style="color:#f92672">.</span>create(
</span></span><span style="display:flex;"><span>        model<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;text-embedding-3-small&#34;</span>,
</span></span><span style="display:flex;"><span>        input<span style="color:#f92672">=</span>text
</span></span><span style="display:flex;"><span>    )<span style="color:#f92672">.</span>data[<span style="color:#ae81ff">0</span>]<span style="color:#f92672">.</span>embedding
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">semantic_cache_lookup</span>(query: str) <span style="color:#f92672">-&gt;</span> str <span style="color:#f92672">|</span> <span style="color:#66d9ef">None</span>:
</span></span><span style="display:flex;"><span>    query_embedding <span style="color:#f92672">=</span> get_embedding(query)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Search cached embeddings for similar queries</span>
</span></span><span style="display:flex;"><span>    cached_keys <span style="color:#f92672">=</span> r<span style="color:#f92672">.</span>keys(<span style="color:#e6db74">&#34;cache:*&#34;</span>)
</span></span><span style="display:flex;"><span>    best_similarity <span style="color:#f92672">=</span> <span style="color:#ae81ff">0</span>
</span></span><span style="display:flex;"><span>    best_response <span style="color:#f92672">=</span> <span style="color:#66d9ef">None</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">for</span> key <span style="color:#f92672">in</span> cached_keys:
</span></span><span style="display:flex;"><span>        cached_data <span style="color:#f92672">=</span> json<span style="color:#f92672">.</span>loads(r<span style="color:#f92672">.</span>get(key))
</span></span><span style="display:flex;"><span>        similarity <span style="color:#f92672">=</span> cosine_similarity(query_embedding, cached_data[<span style="color:#e6db74">&#34;embedding&#34;</span>])
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">if</span> similarity <span style="color:#f92672">&gt;</span> best_similarity:
</span></span><span style="display:flex;"><span>            best_similarity <span style="color:#f92672">=</span> similarity
</span></span><span style="display:flex;"><span>            best_response <span style="color:#f92672">=</span> cached_data[<span style="color:#e6db74">&#34;response&#34;</span>]
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">if</span> best_similarity <span style="color:#f92672">&gt;=</span> SIMILARITY_THRESHOLD:
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> best_response
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span> <span style="color:#66d9ef">None</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">llm_call_with_cache</span>(query: str) <span style="color:#f92672">-&gt;</span> str:
</span></span><span style="display:flex;"><span>    cached <span style="color:#f92672">=</span> semantic_cache_lookup(query)
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">if</span> cached:
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> cached  <span style="color:#75715e"># No token cost</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    response <span style="color:#f92672">=</span> client<span style="color:#f92672">.</span>chat<span style="color:#f92672">.</span>completions<span style="color:#f92672">.</span>create(
</span></span><span style="display:flex;"><span>        model<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;gpt-4o-mini&#34;</span>,
</span></span><span style="display:flex;"><span>        messages<span style="color:#f92672">=</span>[{<span style="color:#e6db74">&#34;role&#34;</span>: <span style="color:#e6db74">&#34;user&#34;</span>, <span style="color:#e6db74">&#34;content&#34;</span>: query}]
</span></span><span style="display:flex;"><span>    )
</span></span><span style="display:flex;"><span>    answer <span style="color:#f92672">=</span> response<span style="color:#f92672">.</span>choices[<span style="color:#ae81ff">0</span>]<span style="color:#f92672">.</span>message<span style="color:#f92672">.</span>content
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Store in cache</span>
</span></span><span style="display:flex;"><span>    embedding <span style="color:#f92672">=</span> get_embedding(query)
</span></span><span style="display:flex;"><span>    r<span style="color:#f92672">.</span>setex(
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;cache:</span><span style="color:#e6db74">{</span>hash(query)<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#ae81ff">3600</span>,  <span style="color:#75715e"># 1-hour TTL</span>
</span></span><span style="display:flex;"><span>        json<span style="color:#f92672">.</span>dumps({<span style="color:#e6db74">&#34;embedding&#34;</span>: embedding, <span style="color:#e6db74">&#34;response&#34;</span>: answer})
</span></span><span style="display:flex;"><span>    )
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span> answer
</span></span></code></pre></div><p>Production implementations use Qdrant&rsquo;s built-in vector similarity search rather than looping over Redis keys, which scales to millions of cached entries with sub-millisecond lookup times.</p>
<hr>
<h2 id="context-window-management-only-include-what-the-model-needs">Context Window Management: Only Include What the Model Needs</h2>
<p>Context window costs scale linearly with every token included in the prompt — and most production prompts include far more context than the model requires to answer the question accurately. <strong>Conversation history accumulation</strong> is the most common source of avoidable context cost: a naive implementation that appends every message to the context window grows token count quadratically over a session, meaning a 20-turn conversation may cost 10× more than a 10-turn conversation even if the task has not grown in complexity. Selective context strategies — summarization, sliding windows, relevance filtering, and hierarchical retrieval — keep context lean without sacrificing response quality. The principle is to include only the context that changes the model&rsquo;s answer: background that is always true belongs in a cached system prompt, recent relevant turns belong in the conversation window, and historical context should be compressed to a running summary. For RAG workloads, retrieving 3–5 highly relevant chunks consistently outperforms retrieving 20 moderately relevant chunks, both in cost and in answer quality — because irrelevant context introduces noise that degrades generation quality while adding token cost.</p>
<h3 id="context-trimming-strategies-by-workload-type">Context Trimming Strategies by Workload Type</h3>
<p><strong>Conversational agents:</strong> maintain a rolling window of the last N turns plus a running summary of earlier context. Summarize every 10 turns using a cheap model (Haiku, GPT-4o mini) to compress history at low cost.</p>
<p><strong>RAG pipelines:</strong> use a reranker (Cohere Rerank or a cross-encoder) to select the top 3–5 chunks from a larger initial retrieval set. The reranker call costs a fraction of the tokens saved by excluding low-relevance chunks from the LLM context.</p>
<p><strong>Agent tool results:</strong> truncate verbose tool outputs to the fields the next step actually needs. A web search result containing full HTML, metadata, and body text can often be compressed to 10% of its original size by extracting only the relevant content before including it in the prompt.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">build_lean_context</span>(conversation_history: list, max_turns: int <span style="color:#f92672">=</span> <span style="color:#ae81ff">6</span>) <span style="color:#f92672">-&gt;</span> list:
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">if</span> len(conversation_history) <span style="color:#f92672">&lt;=</span> max_turns:
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> conversation_history
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Summarize older turns</span>
</span></span><span style="display:flex;"><span>    older_turns <span style="color:#f92672">=</span> conversation_history[:<span style="color:#f92672">-</span>max_turns]
</span></span><span style="display:flex;"><span>    summary_prompt <span style="color:#f92672">=</span> <span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Summarize this conversation concisely:</span><span style="color:#ae81ff">\n</span><span style="color:#e6db74">{</span>json<span style="color:#f92672">.</span>dumps(older_turns)<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>
</span></span><span style="display:flex;"><span>    summary <span style="color:#f92672">=</span> cheap_model_call(summary_prompt)  <span style="color:#75715e"># Haiku or GPT-4o mini</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    recent_turns <span style="color:#f92672">=</span> conversation_history[<span style="color:#f92672">-</span>max_turns:]
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span> [{<span style="color:#e6db74">&#34;role&#34;</span>: <span style="color:#e6db74">&#34;system&#34;</span>, <span style="color:#e6db74">&#34;content&#34;</span>: <span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Earlier context: </span><span style="color:#e6db74">{</span>summary<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>}] <span style="color:#f92672">+</span> recent_turns
</span></span></code></pre></div><hr>
<h2 id="cost-calculator-what-you-actually-spend-across-models">Cost Calculator: What You Actually Spend Across Models</h2>
<p>Enterprise teams that implemented multi-model routing, prompt caching, and batch processing in 2025–2026 achieved a <strong>67% cost reduction</strong> on average — but the variance is enormous. Teams without instrumentation routinely overspend by 5–10× because they cannot see which workloads are consuming disproportionate token volume or which requests are bypassing caching infrastructure. Understanding your actual token spend requires breaking costs down by model, request type, and optimization strategy applied. The table below shows current pricing as of May 2026 and the effective cost after applying the strategies covered in this article. Output tokens cost 3–5× more per token than input tokens across all major providers, which means output length control and structured formatting are disproportionately high-leverage. The worked example following the table quantifies the compounded effect of stacking four optimization strategies on a 1-million-request-per-month workload — a scale representative of mid-size production applications.</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Input ($/M)</th>
          <th>Output ($/M)</th>
          <th>Batch Input</th>
          <th>Cached Input</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>GPT-4o</td>
          <td>$2.50</td>
          <td>$10.00</td>
          <td>$1.25</td>
          <td>$1.25</td>
      </tr>
      <tr>
          <td>Claude Sonnet 4</td>
          <td>$3.00</td>
          <td>$15.00</td>
          <td>$1.50</td>
          <td>$0.30</td>
      </tr>
      <tr>
          <td>GPT-4o mini</td>
          <td>$0.15</td>
          <td>$0.60</td>
          <td>$0.075</td>
          <td>—</td>
      </tr>
      <tr>
          <td>Claude Haiku 4.5</td>
          <td>$0.08</td>
          <td>$0.40</td>
          <td>$0.04</td>
          <td>$0.008</td>
      </tr>
      <tr>
          <td>Gemini Flash</td>
          <td>$0.075</td>
          <td>$0.30</td>
          <td>—</td>
          <td>—</td>
      </tr>
  </tbody>
</table>
<h3 id="sample-cost-calculation-1m-requests-per-month">Sample Cost Calculation: 1M Requests Per Month</h3>
<p>Assume a production application with 1 million requests per month, each averaging 2,000 input tokens and 500 output tokens.</p>
<p><strong>Baseline (all GPT-4o, no optimization):</strong></p>
<ul>
<li>Input: 2B tokens × $2.50/M = $5,000</li>
<li>Output: 500M tokens × $10.00/M = $5,000</li>
<li><strong>Monthly total: $10,000</strong></li>
</ul>
<p><strong>After multi-model routing (80% Haiku, 20% GPT-4o):</strong></p>
<ul>
<li>Haiku input: 1.6B × $0.08/M = $128</li>
<li>Haiku output: 400M × $0.40/M = $160</li>
<li>GPT-4o input: 400M × $2.50/M = $1,000</li>
<li>GPT-4o output: 100M × $10.00/M = $1,000</li>
<li><strong>Monthly total: $2,288 (77% reduction)</strong></li>
</ul>
<p><strong>After adding prompt caching (70% cache hit on Haiku):</strong></p>
<ul>
<li>Haiku cached input: 1.12B × $0.008/M = $9</li>
<li>Haiku uncached input: 480M × $0.08/M = $38</li>
<li><strong>Monthly total: $2,207 (additional 3% reduction)</strong></li>
</ul>
<p><strong>After batch API for offline workloads (40% of requests):</strong></p>
<ul>
<li>400K requests shifted to batch at 50% discount</li>
<li><strong>Monthly total: ~$1,800 (estimated 18% additional reduction)</strong></li>
</ul>
<p><strong>After semantic caching (45% cache hit rate):</strong></p>
<ul>
<li>450K requests served from cache at near-zero cost</li>
<li><strong>Monthly total: ~$1,000 (estimated 44% additional reduction)</strong></li>
</ul>
<p>Combining all four strategies achieves approximately <strong>90% total cost reduction</strong> from the unoptimized baseline — consistent with the enterprise savings patterns observed across the industry in 2025–2026. The compound nature of these optimizations is the core insight: each strategy independently reduces costs, and they stack multiplicatively.</p>
<h3 id="output-length-control-enforcing-max_tokens-and-stop-sequences">Output Length Control: Enforcing max_tokens and Stop Sequences</h3>
<p>Output token cost is often higher per token than input cost (Claude Sonnet 4 charges $15/M output vs. $3/M input — 5× the rate), making output length control disproportionately high-leverage. Set <code>max_tokens</code> on every request to a value appropriate for the task — a classification request needs 10–50 output tokens, not 1,024. Use stop sequences to terminate generation as soon as the required content is produced rather than waiting for the model to generate trailing whitespace or unnecessary verbosity. Structured output formats (JSON schemas) reduce output length by eliminating prose framing, markdown headers, and explanatory text that adds tokens without adding information value for downstream parsing.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#75715e"># Classification: strict max_tokens</span>
</span></span><span style="display:flex;"><span>response <span style="color:#f92672">=</span> client<span style="color:#f92672">.</span>messages<span style="color:#f92672">.</span>create(
</span></span><span style="display:flex;"><span>    model<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;claude-haiku-4-5&#34;</span>,
</span></span><span style="display:flex;"><span>    max_tokens<span style="color:#f92672">=</span><span style="color:#ae81ff">10</span>,  <span style="color:#75715e"># Category label only</span>
</span></span><span style="display:flex;"><span>    stop_sequences<span style="color:#f92672">=</span>[<span style="color:#e6db74">&#34;</span><span style="color:#ae81ff">\n</span><span style="color:#e6db74">&#34;</span>, <span style="color:#e6db74">&#34;.&#34;</span>],
</span></span><span style="display:flex;"><span>    messages<span style="color:#f92672">=</span>[{<span style="color:#e6db74">&#34;role&#34;</span>: <span style="color:#e6db74">&#34;user&#34;</span>, <span style="color:#e6db74">&#34;content&#34;</span>: <span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Classify as POSITIVE, NEGATIVE, or NEUTRAL: </span><span style="color:#e6db74">{</span>text<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>}]
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># JSON extraction: structured output limits tokens</span>
</span></span><span style="display:flex;"><span>response <span style="color:#f92672">=</span> client<span style="color:#f92672">.</span>messages<span style="color:#f92672">.</span>create(
</span></span><span style="display:flex;"><span>    model<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;claude-haiku-4-5&#34;</span>,
</span></span><span style="display:flex;"><span>    max_tokens<span style="color:#f92672">=</span><span style="color:#ae81ff">200</span>,
</span></span><span style="display:flex;"><span>    messages<span style="color:#f92672">=</span>[{
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#34;role&#34;</span>: <span style="color:#e6db74">&#34;user&#34;</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#34;content&#34;</span>: <span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Extract name and email as JSON only: </span><span style="color:#e6db74">{</span>text<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>
</span></span><span style="display:flex;"><span>    }]
</span></span><span style="display:flex;"><span>)
</span></span></code></pre></div><hr>
<h2 id="frequently-asked-questions">Frequently Asked Questions</h2>
<p><strong>Q: What is the single highest-impact cost optimization for most applications in 2026?</strong></p>
<p>Multi-model routing consistently delivers the highest cost reduction for most teams — often 70–85% cost reduction alone — because the price differential between frontier models ($2.50–$3.00/M input) and economy models ($0.075–$0.15/M input) is 20–40×, and the majority of production requests (classification, extraction, simple Q&amp;A) do not require frontier capability. Prompt caching is the highest-impact optimization for applications with large static system prompts that are reused across many requests, delivering 90% savings on cached tokens.</p>
<p><strong>Q: How does prompt caching differ from semantic caching, and when should I use each?</strong></p>
<p>Prompt caching works at the token-exact level within a single request: Anthropic stores a hash of your marked context block and charges 90% less when subsequent requests use an identical block. Semantic caching works across requests at the meaning level: you store previous responses in a vector database and return cached answers when a new query is sufficiently similar. Use prompt caching for static system prompts, document context, and tool definitions that are identical across requests. Use semantic caching for user queries that vary in wording but are semantically equivalent — FAQ systems, support bots, and documentation assistants.</p>
<p><strong>Q: What token budget limits should I set for production agent workflows?</strong></p>
<p>A conservative starting configuration: 2,000–4,000 tokens per LLM call, 10,000–20,000 tokens per agent step (including context), 10–20 steps per task, and 100,000–500,000 total tokens per task. These numbers should be calibrated against your actual workload: run a representative sample of tasks without limits, observe the 95th-percentile token consumption, and set hard limits at 2–3× the median. The goal is catching runaway loops and over-retrieval without terminating legitimately complex tasks.</p>
<p><strong>Q: When does it NOT make sense to use the Batch API?</strong></p>
<p>The Batch API&rsquo;s 24-hour processing window makes it unsuitable for any user-facing, latency-sensitive workload. Do not use it for chatbots, copilots, real-time document processing, or any task where a human is waiting synchronously for a response. The 50% discount is only relevant when the latency trade-off is acceptable — which covers a surprising share of enterprise workloads (nightly pipelines, offline classification, evaluation runs, report generation) but not interactive products.</p>
<p><strong>Q: How do I measure whether my cost optimizations are actually working?</strong></p>
<p>Instrument four metrics from day one: cost per request (total token spend divided by request count), cache hit rate (for both prompt caching and semantic caching), model distribution (what percentage of requests route to each tier), and cost per task or workflow (for agentic workloads). Set up daily cost dashboards that break down spend by model, workload type, and optimization layer. A 10% increase in cache hit rate or a 5% shift of requests to cheaper model tiers shows up immediately in cost-per-request metrics and validates that your optimization infrastructure is functioning correctly.</p>
]]></content:encoded></item></channel></rss>