LLM API prices dropped roughly 80% between 2024 and 2026. The same production workload that cost $3,000/month in 2024 now runs for approximately $150/month. This guide covers every major provider’s current rates, the hidden costs that inflate real bills, and which model wins for each use case.
LLM API Pricing Overview: April 2026 Snapshot
LLM API pricing in 2026 is segmented into three clear tiers: budget (under $1/M input tokens), mid-range ($1–$5/M), and premium ($5+/M). DeepSeek V3.2 leads the budget tier at $0.14/M input tokens — the cheapest major LLM API available as of April 2026. Google’s Gemini 2.5 Flash-Lite sits at $0.10/$0.40 per 1M input/output tokens, making it the cheapest actively supported proprietary model. In the mid tier, Claude Sonnet 4.6 at $3/$15 and Gemini 2.5 Pro at $1.25/$10 compete on quality-per-dollar. The premium tier is anchored by GPT-5.5 at $5/$30 and Claude Opus 4.7 at $5/$25. Across the entire market, inference costs have dropped by a factor of roughly 1,000 in just three years — a compression rate unlike anything seen in prior software infrastructure categories. Critically, the advertised per-token price is only part of the real cost: context window usage, output-to-input ratios, rate limits, and caching behavior all affect total monthly spend. Budget for approximately 1.7x your base token calculation when accounting for these hidden multipliers.
| Provider | Model | Input ($/1M) | Output ($/1M) | Context |
|---|---|---|---|---|
| OpenAI | GPT-5 | $0.625 | $5.00 | 128K |
| OpenAI | GPT-5.5 | $5.00 | $30.00 | 128K |
| Anthropic | Claude Haiku 4.5 | $1.00 | $5.00 | 200K |
| Anthropic | Claude Sonnet 4.6 | $3.00 | $15.00 | 200K |
| Anthropic | Claude Opus 4.7 | $5.00 | $25.00 | 200K |
| Gemini 2.5 Flash-Lite | $0.10 | $0.40 | 1M | |
| Gemini 2.5 Flash | $0.15 | $0.60 | 1M | |
| Gemini 2.5 Pro | $1.25 | $10.00 | 2M | |
| DeepSeek | V3.2 | $0.14 | $0.28 | 64K |
GPT-5 Pricing Breakdown (All Variants)
GPT-5 is OpenAI’s current base frontier model, priced at $0.625/M input tokens and $5.00/M output tokens — a significant reduction from GPT-4o’s launch price. OpenAI now offers four variants under the GPT-5 umbrella. GPT-5 base is the default for most developers: capable, fast, and priced competitively against Claude Sonnet at roughly 80% cheaper on input. GPT-5.2 and GPT-5.4 are intermediate reasoning variants priced between $1–$3/M input. GPT-5.5, the flagship multimodal reasoning model, sits at $5/$30 per 1M tokens — the most expensive OpenAI offering. All GPT-5 variants include a 50% batch API discount for async workloads, and prompt caching delivers up to 90% savings on repeated input segments. If you’re running document analysis or RAG pipelines with recurring system prompts, caching alone can cut your effective input cost below $0.10/M tokens. The main limitation: GPT-5 base has a 128K context window, which forces chunking strategies for long documents where Gemini 2.5 Pro’s 2M context wins outright.
| Variant | Input ($/1M) | Output ($/1M) | Best For |
|---|---|---|---|
| GPT-5 | $0.625 | $5.00 | General tasks, high-volume |
| GPT-5.2 | $1.00 | $8.00 | Light reasoning |
| GPT-5.4 | $3.00 | $15.00 | Complex reasoning |
| GPT-5.5 | $5.00 | $30.00 | Flagship multimodal tasks |
Claude API Pricing: Haiku 4.5 vs Sonnet 4.6 vs Opus 4.7
Anthropic’s Claude API in 2026 follows a clear three-tier structure that maps well to use-case complexity. Claude Haiku 4.5 at $1/$5 per 1M tokens is the go-to for high-volume, latency-sensitive tasks: classification, entity extraction, customer support routing, and summarization at scale. Claude Sonnet 4.6 at $3/$15 is the developer workhorse — strong instruction-following, coding ability, and extended context (200K tokens) at a price that most mid-size production workloads can absorb. Claude Opus 4.7 at $5/$25 is reserved for the hardest tasks: complex multi-step reasoning, agentic workflows, and scenarios where output quality has direct revenue impact. Anthropic also offers batch processing at 50% discount across all tiers, and prompt caching cuts cache-hit input costs by up to 90%. In practice, a production system using Sonnet 4.6 with aggressive caching and batch processing for async jobs can reach effective costs closer to $0.80/M input tokens — competitive with DeepSeek for structured workloads. The 200K context window across all Claude models is a practical advantage for document-heavy applications without the context management overhead required by 128K models.
Gemini API Pricing: Flash-Lite, Flash, Pro, and 2M Context Cost
Google’s Gemini 2.5 lineup spans the widest price range of any provider in 2026, from $0.10/M input for Flash-Lite to $1.25/M for Pro — with the critical 2M token context window that no other major provider matches at scale. Gemini 2.5 Flash-Lite at $0.10/$0.40 is effectively the cheapest proprietary LLM available: suitable for simple classification, translation, and structured data extraction where DeepSeek’s quality edge isn’t needed. Gemini 2.5 Flash at $0.15/$0.60 adds stronger reasoning and better instruction-following, making it the default choice for cost-conscious developers who still need reliable output. Gemini 2.5 Pro at $1.25/$10 is where the 2M context window becomes the selling point — processing entire codebases, legal documents, or long research papers in a single call. However, be aware of the context cost trap: a single call with a 2M token context at $1.25/M input costs $2.50 per call before any output. For applications that routinely use large contexts, Gemini 2.5 Pro bills can scale faster than expected. Google also offers batch processing discounts and a free tier with rate-limited access for prototyping.
DeepSeek V3.2 API: Why It’s the Cheapest LLM at $0.14/M
DeepSeek V3.2 is the most disruptive LLM of 2026, priced at $0.14/M input and $0.28/M output — roughly 4–10x cheaper than comparable proprietary models on a per-token basis. DeepSeek is a Chinese open-source model that drove OpenAI, Anthropic, and Google to slash prices by 60–80% after its V3 release in late 2024 proved that frontier-quality reasoning didn’t require frontier-level infrastructure costs. V3.2 improved multilingual performance, code generation accuracy, and instruction-following fidelity to the point where independent benchmarks place it within 5–10% of GPT-5 base on most standard tasks. The tradeoffs are real: 64K context window (vs 128K–2M for competitors), higher latency on complex tasks, and data residency concerns for regulated industries since inference runs through DeepSeek’s Chinese infrastructure. For many high-volume, cost-sensitive workloads — content generation, bulk classification, translation pipelines — the 10x cost advantage outweighs these limitations. Teams building on DeepSeek should evaluate data sensitivity carefully and consider self-hosting the open-source weights via providers like Together AI or Fireworks if data governance is a concern.
Head-to-Head Comparison: All Major LLMs
Selecting the right LLM requires comparing quality, cost, and context simultaneously. The table below uses a normalized quality score based on aggregate benchmark performance (MMLU, HumanEval, MATH) weighted equally, indexed to GPT-5 base = 100.
| Model | Quality Index | Input ($/1M) | Output ($/1M) | Context | Best For |
|---|---|---|---|---|---|
| GPT-5.5 | 115 | $5.00 | $30.00 | 128K | Flagship reasoning, multimodal |
| Claude Opus 4.7 | 112 | $5.00 | $25.00 | 200K | Agentic workflows, hard reasoning |
| Gemini 2.5 Pro | 108 | $1.25 | $10.00 | 2M | Long-context, document analysis |
| GPT-5 | 100 | $0.625 | $5.00 | 128K | General-purpose, high-volume |
| Claude Sonnet 4.6 | 98 | $3.00 | $15.00 | 200K | Coding, instruction-following |
| DeepSeek V3.2 | 93 | $0.14 | $0.28 | 64K | Bulk tasks, cost-sensitive |
| Claude Haiku 4.5 | 85 | $1.00 | $5.00 | 200K | Classification, summarization |
| Gemini 2.5 Flash | 84 | $0.15 | $0.60 | 1M | Fast, cheap, structured output |
| Gemini 2.5 Flash-Lite | 76 | $0.10 | $0.40 | 1M | Simple tasks, translation |
Cost-per-quality leader: GPT-5 base offers the best quality-per-dollar in the mid tier. DeepSeek V3.2 wins on absolute cost. Gemini 2.5 Pro wins on context-per-dollar.
Hidden Costs That Double Your LLM Bill
The advertised per-token price is only part of the story — production LLM API bills routinely run 1.5–2x the base token calculation once you account for hidden cost multipliers. Output tokens typically cost 2–5x more than input tokens due to autoregressive compute requirements: at GPT-5.5’s $5/$30 ratio, every 1,000 output tokens costs 6x more than 1,000 input tokens. In applications with high output-to-input ratios (code generation, long-form writing), this single factor can double your effective cost. Context window overhead is the second trap: long system prompts, retrieved document chunks, and conversation history all count as input tokens on every request. A 10,000-token system prompt sent 100 times per day costs $0.625 in daily input overhead at GPT-5 pricing — $19/month per user before any actual task tokens. Rate limit architecture also adds cost: burst traffic above rate limits forces retries or queue management infrastructure, adding engineering overhead. Finally, failed requests still incur partial billing in some tiers. The practical rule: budget 1.7x your estimated base token cost for production systems.
How to Calculate Your Real LLM Cost
Your actual cost per request = (system prompt tokens × input rate) + (user input tokens × input rate) + (expected output tokens × output rate). For a RAG application with a 5,000-token system prompt, 500-token query, 3,000-token retrieved context, and 800-token response using Claude Sonnet 4.6: input = (5,000 + 500 + 3,000) × $0.000003 = $0.0255; output = 800 × $0.000015 = $0.012. Total per request: $0.0375. At 10,000 daily requests, that’s $375/day — $11,250/month — before caching. Apply prompt caching to the 5,000-token system prompt (90% discount on cache hits at ~80% cache hit rate) and the effective daily input cost drops from ~$300 to ~$60, reducing the monthly bill to roughly $2,100. This is why caching ROI analysis should precede any model-switching decision — the provider you’re already on with caching often beats a cheaper model without it.
Output Ratio Trap by Use Case
| Use Case | Typical Input:Output Ratio | Cost Impact |
|---|---|---|
| Summarization | 10:1 | Low — output is small |
| Code generation | 1:3 | High — output is verbose |
| Classification | 50:1 | Minimal |
| Long-form writing | 1:5 | Very high |
| RAG QA | 5:1 | Moderate |
For code generation at GPT-5 base pricing, every 1,000 output tokens costs $0.005 — 8x more than the equivalent input tokens. A coding assistant generating 500-token completions 20,000 times/day generates $50/day in output costs alone.
Cost Optimization Strategies: Batch APIs, Prompt Caching, Model Routing
Three strategies can reduce your LLM API spend by 60–90% without degrading output quality. First, batch processing APIs: all major providers now offer 50% discounts for async workloads processed within 24 hours. If your application has offline pipelines — daily report generation, bulk content processing, nightly data enrichment — switching to batch APIs halves your token cost immediately with zero code changes beyond the API endpoint. Second, prompt caching: Anthropic and OpenAI both offer prompt caching that reduces cache-hit input tokens by up to 90%. For applications with long, stable system prompts or frequently repeated document chunks, this is the single highest-ROI optimization. A 10,000-token cached system prompt that previously cost $0.03/request at Claude Sonnet pricing drops to $0.003/request on cache hits. Third, model routing: use cheaper models for simple subtasks and expensive models only for complex reasoning. A classification step using Gemini Flash-Lite at $0.10/M followed by a generation step using Claude Sonnet only for complex cases can reduce average cost by 40–70% compared to routing everything through a single premium model. Implement routing based on input complexity signals: length, vocabulary, query type.
Batch API Discount Summary
| Provider | Batch Discount | Turnaround | Endpoint |
|---|---|---|---|
| OpenAI | 50% off | 24 hours | /v1/batches |
| Anthropic | 50% off | 24 hours | Message Batches API |
| Variable | 24–48 hours | Batch prediction API | |
| DeepSeek | No batch API | — | — |
Which LLM API Should You Choose? Use-Case Decision Guide
Choosing the right LLM API depends on your specific use case, volume, latency requirements, and data sensitivity constraints — not just the headline price. For high-volume classification and structured extraction at scale (10M+ tokens/day), DeepSeek V3.2 or Gemini 2.5 Flash-Lite wins on cost: sub-$0.15/M input with acceptable quality for deterministic tasks. For coding assistants and developer tools, Claude Sonnet 4.6 or GPT-5 base are the default choices — both excel at code generation and instruction-following with strong benchmark performance. For long-document analysis — legal review, codebase comprehension, financial report parsing — Gemini 2.5 Pro’s 2M context window is the decisive factor, avoiding the chunking overhead that 128K models require. For agentic workflows requiring complex multi-step reasoning and tool use, Claude Opus 4.7 leads on reliability and instruction adherence in extended chains. For startups with tight budgets prototyping production features, GPT-5 base at $0.625/M input provides frontier quality at a price that enables iteration without budget anxiety. For regulated industries with data residency requirements, avoid DeepSeek and evaluate Anthropic or OpenAI’s enterprise tiers with DPA agreements.
Quick Decision Matrix
- < $1/M and speed: Gemini 2.5 Flash-Lite or DeepSeek V3.2
- Coding and agents: Claude Sonnet 4.6 or GPT-5 base
- Long documents: Gemini 2.5 Pro (2M context)
- Best reasoning, no cost limit: GPT-5.5 or Claude Opus 4.7
- Enterprise compliance: Anthropic or OpenAI enterprise tiers
One pattern that works well in production: start every new application with GPT-5 base or Claude Sonnet 4.6 to establish quality baselines, then systematically replace subtasks with cheaper models as you gain confidence in output quality for each specific task type. Most teams find that 60–70% of their LLM calls can be downgraded to cheaper models without user-visible quality degradation — that’s where the cost savings actually live, not in negotiating enterprise discounts.
LLM API Price Trends: 80% Drop and What Comes Next
LLM API prices have dropped approximately 80% from 2024 to 2026 across all major providers — a compression rate driven primarily by three forces: DeepSeek’s open-source disruption proving efficient training at lower cost, hardware improvements in H100/H200 clusters, and competitive pressure between OpenAI, Anthropic, and Google forcing regular price cuts. The cost of LLM inference dropped by a factor of roughly 1,000 in just three years — faster than any comparable infrastructure category including cloud compute and storage. The pattern that’s emerging in 2026: commodity tasks (classification, translation, summarization) are approaching near-zero pricing as open-weight models commoditize the capability. The remaining price premium is concentrated in frontier reasoning, multimodal capability, and long-context handling. Looking forward, the next wave of price reductions will likely come from speculative decoding, model distillation, and hardware-specific optimizations (TPU v6, MI300X adoption). For teams building on LLM APIs today, the rational strategy is to architect for model interchangeability — abstract provider-specific SDKs behind a routing layer so you can switch as prices shift — and lock in provider credits during promotional periods to hedge against any short-term price reversals.
FAQ
These are the most common questions developers ask when evaluating LLM APIs for production use in 2026. The short answers: DeepSeek V3.2 is the cheapest at $0.14/M input tokens, GPT-5 base offers the best frontier quality-per-dollar at $0.625/M input, and Gemini 2.5 Pro wins for long-document workloads with its 2M context window at $1.25/M. Batch processing discounts (50% across OpenAI and Anthropic) and prompt caching (up to 90% savings on repeated inputs) are the two highest-ROI optimizations available regardless of provider. Real production bills run approximately 1.7x the base token estimate once output ratios, context overhead, and retry costs are accounted for. Model routing — using cheap models for simple subtasks and expensive models only when necessary — typically reduces average effective cost by 40–70% compared to a single-model architecture.
What is the cheapest LLM API in 2026?
DeepSeek V3.2 at $0.14/M input tokens is the cheapest major LLM API as of April 2026. Gemini 2.5 Flash-Lite at $0.10/M is slightly cheaper but is a smaller, lower-capability model. For proprietary models with strong quality, GPT-5 base at $0.625/M input is the most affordable frontier model.
How much does GPT-5 cost per 1,000 tokens?
GPT-5 base costs $0.000625 per 1,000 input tokens and $0.005 per 1,000 output tokens. GPT-5.5 costs $0.005 per 1,000 input and $0.030 per 1,000 output. All variants include 50% batch API discounts for async workloads.
Is Claude more expensive than GPT in 2026?
It depends on the tier. Claude Sonnet 4.6 at $3/$15 is more expensive than GPT-5 base at $0.625/$5 on input but comparable on output quality per dollar. Claude Haiku 4.5 at $1/$5 is cheaper than GPT-5 mid-tier variants. Claude Opus 4.7 at $5/$25 is slightly cheaper than GPT-5.5 at $5/$30 on output.
What is prompt caching and how much can it save?
Prompt caching stores repeated input segments (system prompts, document chunks) server-side so they don’t need to be reprocessed on each request. Anthropic and OpenAI both offer up to 90% savings on cache-hit tokens. For applications with long system prompts sent on every request, caching is typically the single highest-ROI cost optimization available.
Should I use DeepSeek for production workloads?
DeepSeek V3.2 is production-ready for many use cases — bulk content generation, translation, classification — but has limitations: 64K context window, higher latency on complex tasks, and inference runs through Chinese infrastructure. For workloads with data residency requirements or sensitive data, evaluate self-hosting DeepSeek’s open-source weights via providers like Together AI or Fireworks instead of using the hosted API directly.
