Openai on RockB

LLM Prompt Caching Guide 2026: Cut API Costs 70% with Anthropic and OpenAI

Tue, 21 Apr 2026 01:02:58 +0000

Prompt caching is the single highest-ROI optimization available for production LLM applications. If you run 10,000 requests per day with an 8K-token cached system prompt on Anthropic Claude, you save roughly $576/month — with a few lines of code change. OpenAI’s automatic caching requires zero code changes and gives you a 50% discount on repeated input tokens. Anthropic’s explicit caching offers up to 90% savings. This guide covers both, plus Gemini, with production code examples, real cost numbers, and the anti-patterns that silently destroy your cache hit rate.

How Prompt Caching Works: KV Cache, Prefix Matching, and Why Order Matters

Prompt caching works by storing the key-value (KV) computation for a prefix of your prompt in GPU memory, then reusing those stored activations for subsequent requests that share the same prefix. When your request arrives, the provider checks whether the incoming prompt’s beginning matches a cached prefix. If it does — a cache hit — the model skips recomputing that prefix and starts generating immediately. Hugging Face technical analysis measured roughly a 5.21x speedup on T4 GPUs from KV cache reuse alone. The cost reduction follows the same logic: you pay a lower rate for cached input tokens because the provider doesn’t need to run full inference on that portion of the prompt.

Why order matters critically: Prefix matching is exact and sequential. If your prompt reads system → context → user query, the cache key covers everything from the start up to your designated breakpoint. Change anything before the breakpoint — even a single character — and the entire cached prefix is invalidated. This means timestamps, session IDs, or user-specific data embedded early in your prompt will kill your cache hit rate entirely. The universal rule: place static content first, dynamic content last. Tools definitions → system instructions → document context → few-shot examples → current conversation history → user query. This ordering directly determines your API bill.

Minimum token requirements vary by provider: Anthropic requires at least 1,024 tokens in the cached prefix; OpenAI caches in 128-token increments with a 1,024-token minimum. Short prompts below these thresholds simply don’t qualify for caching and should be excluded from your optimization planning.

Provider Comparison: OpenAI vs Anthropic vs Gemini

Prompt caching is now supported by all three major LLM providers — OpenAI, Anthropic, and Google Gemini — but they implement it in fundamentally different ways with meaningfully different economics. OpenAI’s caching is fully automatic: you write no special code, the API detects repeated prefixes, and you see a 50% discount on cached tokens with no TTL configuration available. Anthropic gives you the highest savings rate at 90% but requires explicit cache_control markers (simplified significantly by the February 2026 automatic caching update). Gemini sits between the two, offering implicit automatic caching for Gemini 2.5 models and named cache objects for explicit control with configurable TTL. Choosing between providers comes down to your optimization priorities: zero-friction savings (OpenAI), maximum cost reduction with fine-grained control (Anthropic), or configurable persistence for document-heavy workloads (Gemini). Most teams using Anthropic as their primary provider see the February 2026 changes as a reason to migrate previously-uncached workflows — the implementation barrier dropped significantly.

Feature	OpenAI	Anthropic	Gemini
Caching type	Automatic	Automatic + Explicit	Implicit + Explicit
Cost savings	50% on input	90% on input	~90% on input
TTL	5–10 min	5 min or 1 hour	Configurable
Minimum tokens	1,024 (128-token increments)	1,024	Varies
Code changes required	None	Minimal (cache_control)	Named cache objects
Control granularity	None (auto)	Up to 4 breakpoints	Named cache objects
2026 update	GPT-5.1: 24h retention	Feb 2026: auto caching	Gemini 2.5 implicit caching

OpenAI Prompt Caching: Automatic, Zero-Config

OpenAI prompt caching is automatic and requires zero code changes — the API detects repeated input prefixes and applies a 50% discount on cached input tokens automatically. You don’t set any flags; you just observe the discount in your usage dashboard and billing. The GPT-5.1 series introduced 24-hour cache retention, making it viable for system prompts used across long workdays or batch pipelines that span multiple processing windows. Cache hits appear in the usage object of the API response as cached_tokens, so you can monitor performance without any instrumentation changes.

OpenAI caches in 128-token increments, meaning your cached prefix must be at least 1,024 tokens and matches extend in 128-token steps. A 1,100-token prefix gets cached at 1,024 tokens, with the remaining 76 tokens billed at full price. This granularity matters for borderline cases but rarely affects the economics of real system prompts, which typically run 2,000–10,000 tokens.

from openai import OpenAI

client = OpenAI()

# No special configuration needed — caching is automatic
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "system",
            # Static system prompt (1024+ tokens for caching eligibility)
            "content": "You are an expert software engineer specializing in Python..."
            # ... (long static content)
        },
        {
            "role": "user",
            "content": user_query  # Dynamic — place last
        }
    ]
)

# Check cache hit in response
cached_tokens = response.usage.prompt_tokens_details.cached_tokens
print(f"Cached tokens: {cached_tokens}")

The tradeoff vs Anthropic: OpenAI’s automatic approach is the right choice for teams that want savings with zero engineering overhead. You get 50% off repeated input tokens with no prompt restructuring. The downside is loss of control — you can’t force specific breakpoints, can’t choose TTL, and can’t target multiple cache boundaries within a single prompt. For high-volume applications where every dollar matters, Anthropic’s 90% savings on cache reads typically justifies the additional implementation work.

Anthropic Prompt Caching: 90% Savings with Explicit Breakpoints

Anthropic prompt caching delivers up to 90% cost reduction on cached input tokens, the highest discount available from any major provider in 2026. Cache reads for Claude Sonnet 4.5 cost $0.30/1M tokens versus $3.00/1M for standard input — exactly a 10x reduction. The February 2026 automatic caching update simplified implementation significantly: a single top-level cache_control marker now causes the API to auto-place the breakpoint on the last cacheable block, eliminating the need to annotate every section individually. For most use cases, this single-marker approach is sufficient.

For fine-grained control, Anthropic supports up to 4 explicit cache breakpoints per prompt. Automatic caching consumes 1 of those 4 slots — adding automatic caching plus 4 explicit breakpoints triggers a 400 error. The cache invalidation hierarchy is tools → system → messages: changing anything earlier in this chain invalidates caches for everything that follows. Place your least-changing content at the top (tool definitions), most-changing content at the bottom (current user message).

5-minute vs 1-hour TTL: Choose based on request cadence, not preference. If requests arrive more than every 5 minutes on average, 1-hour TTL pays for itself immediately — you pay 2x base input price on writes instead of 1.25x, but cache reads stay at 0.1x for both. The 1-hour write premium recovers after just 2 cache hits. If your traffic is bursty with long idle gaps, 5-minute TTL may be more economical. One team learned this the hard way: a library update silently changed their TTL from 1-hour to 5-minutes, causing a $13.86/day bill increase before anyone noticed.

import anthropic

client = anthropic.Anthropic()

# February 2026 approach: single cache_control at top level (auto places breakpoint)
response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are an expert software engineer...",
            # This triggers automatic cache placement on the last cacheable block
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {"role": "user", "content": user_query}
    ]
)

# Monitor cache usage
usage = response.usage
print(f"Input tokens: {usage.input_tokens}")
print(f"Cache creation tokens: {usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {usage.cache_read_input_tokens}")

Multi-turn conversation caching: In multi-turn chat, Anthropic’s automatic caching advances the cache breakpoint forward as the conversation grows — without requiring you to update cache_control markers manually. The 20-block lookback window limits how far back the provider searches for matching prefixes. Keep your conversation history compaction logic in sync with this window to avoid unnecessary cache misses in very long conversations.

Explicit Multi-Breakpoint Example

# For fine-grained control: multiple explicit breakpoints
response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are an expert software engineer...",
            "cache_control": {"type": "ephemeral"}  # Breakpoint 1: system prompt
        }
    ],
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": large_document_context,  # Your reference docs
                    "cache_control": {"type": "ephemeral"}  # Breakpoint 2: context
                },
                {
                    "type": "text",
                    "text": user_query  # Dynamic — no cache_control
                }
            ]
        }
    ]
)

Gemini Prompt Caching: Implicit Caching and Named Cache Objects

Gemini prompt caching operates through two mechanisms: implicit caching (where the API automatically detects and reuses repeated content) and explicit named cache objects for precise control. Gemini 2.5 expanded implicit caching capabilities, making it the most hands-off option for teams already using Google’s infrastructure. Named cache objects persist across requests with configurable TTL, behaving more like a traditional database cache than the prefix-matching approach used by OpenAI and Anthropic. Savings are approximately 90% on cached content, comparable to Anthropic’s rates.

The named cache approach works well for RAG pipelines that repeatedly query the same knowledge base — you cache the document corpus once, assign it a cache ID, and reference that ID in subsequent requests rather than retransmitting the full content. This makes Gemini caching particularly well-suited for document Q&A applications where the reference material doesn’t change between queries.

import google.generativeai as genai

genai.configure(api_key="YOUR_API_KEY")

# Create a named cache for long-lived content
cache = genai.caching.CachedContent.create(
    model="gemini-2.5-flash",
    contents=[
        {
            "role": "user",
            "parts": [{"text": large_document_context}]
        }
    ],
    ttl="3600s"  # 1-hour TTL
)

# Reference the cache in subsequent requests
model = genai.GenerativeModel.from_cached_content(cache)
response = model.generate_content(user_query)

# Clean up when done
cache.delete()

Production Cost Calculator: Real Dollar Amounts

Prompt caching economics depend on three variables: prompt length (in tokens), daily request volume, and cache hit rate. The formula is simple — compare (cache write cost + cache read cost × hit rate) against (full input cost × all requests). In practice, applications with 2,000-token system prompts running 100 requests/day save around $12/month on Anthropic; growth-stage applications with 8,000-token prefixes at 10,000 requests/day save over $6,000/month. At enterprise scale — 100,000 requests/day with a 10,000-token cached prefix — annual savings exceed $1M on Anthropic. OpenAI’s 50% discount produces roughly half these savings for the same workload. The numbers below use Anthropic Claude Sonnet 4.5 pricing ($3.00/1M standard input, $0.30/1M cache read, $3.75/1M cache write) with a representative 85-90% cache hit rate, which healthy production systems consistently achieve.

These calculations use Anthropic Claude Sonnet 4.5 pricing ($3.00/1M input, $0.30/1M cache read, $3.75/1M cache write) unless noted.

Hobby: 2K System Prompt, 100 Requests/Day

Without caching: 2,000 tokens × 100 requests = 200K tokens/day × $3.00/1M = $0.60/day ($18/month)

With caching: 1 cache write ($0.0075) + 99 cache reads (2,000 × 99 × $0.30/1M = $0.059) + user query tokens ≈ $0.066/day

Monthly savings: ~$12.60/month (70% reduction)

Growth: 8K Cached Prefix, 10K Requests/Day

Without caching: 8,000 × 10,000 = 80M tokens/day × $3.00/1M = $240/day ($7,200/month)

With caching (90% hit rate): ~$19.20/day in cache reads vs $240/day baseline

Monthly savings: ~$6,240/month

Enterprise: 10K Cached Prefix, 100K Requests/Day

Without caching: 10,000 × 100,000 = 1B tokens/day × $3.00/1M = $3,000/day

With caching (85% hit rate): ~$76.50/day in cache reads

Monthly savings: ~$87,600/month ($1.05M/year)

These numbers explain why prompt caching is treated as a P0 optimization by any team running LLMs at scale.

Anti-Patterns That Kill Your Cache Hit Rate

Cache anti-patterns are the silent killers of LLM API budgets. A well-designed prompt structure can achieve 80-90% cache hit rates in production; the same application with anti-patterns typically sees 10-30% — meaning you’re paying near-full price and getting none of the latency benefits. Below are the most common patterns to avoid, each with a concrete fix.

Anti-pattern 1: Timestamps or session IDs in the system prompt

# WRONG — kills cache every request
system = f"You are an AI assistant. Current time: {datetime.now()}. Session: {session_id}"

# RIGHT — put dynamic data elsewhere
system = "You are an AI assistant."
# Inject time/session into the user message if needed

Anti-pattern 2: User content before static content

# WRONG — user name appears before the cacheable instructions
messages = [
    {"role": "user", "content": f"Hi, I'm {user_name}. {user_query}"}
]

# RIGHT — static instructions in system, user identity in messages
system = "You are an expert assistant with access to the following knowledge base: [static docs]"
messages = [{"role": "user", "content": user_query}]

Anti-pattern 3: Rotating few-shot examples

# WRONG — shuffled examples invalidate cache every time
import random
examples = random.sample(all_examples, 5)

# RIGHT — fixed ordered examples in system prompt, random examples in user messages
fixed_examples = all_examples[:5]  # Static, always the same

Anti-pattern 4: Dynamic tool definitions

# WRONG — enabling different tools per user breaks prefix matching
tools = get_user_tools(user_id)  # Different per user

# RIGHT — use a fixed superset of tools, filter in application logic
tools = ALL_TOOLS  # Identical for every request

Anti-pattern 5: Prompts below minimum threshold

Short prompts (< 1,024 tokens) don’t qualify for caching on any major provider. If your system prompt is 800 tokens, add structured documentation, examples, or reasoning guidelines to push above the threshold — the cost of additional tokens is trivial compared to the caching savings you unlock.

Monitoring Cache Hit Rates in Production

Production systems should target 70-90% cache hit rates. Rates below 50% indicate a structural problem with your prompt ordering — revisit the anti-patterns section. Each provider exposes cache metrics differently, but all include the data in API responses.

Anthropic monitoring:

def track_cache_metrics(response, metrics_client):
    usage = response.usage
    total_input = (usage.input_tokens + 
                   usage.cache_creation_input_tokens + 
                   usage.cache_read_input_tokens)
    
    hit_rate = usage.cache_read_input_tokens / total_input if total_input > 0 else 0
    
    metrics_client.gauge("llm.cache_hit_rate", hit_rate)
    metrics_client.increment("llm.cache_reads", usage.cache_read_input_tokens)
    metrics_client.increment("llm.cache_writes", usage.cache_creation_input_tokens)
    
    if hit_rate < 0.5:
        alert("Cache hit rate below 50% — check prompt structure")
    
    return hit_rate

OpenAI monitoring:

def track_openai_cache(response):
    details = response.usage.prompt_tokens_details
    cached = details.cached_tokens if details else 0
    total = response.usage.prompt_tokens
    hit_rate = cached / total if total > 0 else 0
    return hit_rate

Key metrics to track in production:

Cache hit rate (target: 70%+; alert threshold: 50%)
Cache creation cost (should be small relative to cache read savings)
Time-to-first-token (cache hits typically reduce TTFT by 40-80%)
Daily cache savings (compare cached read cost vs estimated uncached cost)

Set up weekly cost attribution reports that separate cached vs uncached spending. This makes optimization work visible to stakeholders and helps justify the engineering investment in prompt structure.

Prompt Caching with RAG, Multi-Turn Chat, and Agentic Systems

Prompt caching interacts differently with each major LLM application pattern, and the configuration choices that maximize savings in a simple chatbot may perform poorly in an agentic system. RAG pipelines benefit most from caching the retrieval instructions and knowledge base preamble while letting retrieved documents flow through a second breakpoint. Multi-turn chat applications benefit from Anthropic’s automatic cache advancement, which moves the cache boundary forward as conversation history grows — no manual re-marking needed. Agentic systems using tool-calling loops (AutoGen, LangGraph, CrewAI) require careful static/dynamic separation: cache the tool definitions and agent persona, let tool call results remain uncached. In all three patterns, the 31% semantic similarity rate observed across production LLM queries (Burnwise 2026 analysis) means that even applications with moderate request volumes see real cache hits — not just the high-frequency request patterns typically highlighted in provider documentation. Gemini’s named cache objects are uniquely well-suited for document corpora shared across many different query types, making it the preferred choice for multi-tenant RAG deployments where the same document set serves many users.

RAG Pipelines

RAG applications are the ideal use case for prompt caching. The retrieved documents change per query, but your system prompt, retrieval instructions, and output format guidelines are static. Structure your RAG prompt as:

System instructions (static, cached)
Retrieved documents (semi-static per document set, explicitly cached with breakpoint)
User question (dynamic, not cached)

For Gemini, use named cache objects for your document corpus — create the cache once per document set and reference it by ID across all queries against that corpus.

Multi-Turn Conversations

Anthropic’s automatic cache advancement handles multi-turn chat without manual cache_control updates per message. The breakpoint moves forward automatically as conversation history grows. Watch for the 20-block lookback window — conversations longer than ~20 exchanges may see the oldest context fall outside the cacheable window. Implement a summarization or context compaction step before hitting this limit.

# Multi-turn with Anthropic — cache_control only on the system, 
# automatic caching handles the rest
messages = conversation_history  # Growing list of messages

response = client.messages.create(
    model="claude-sonnet-4-5",
    system=[{"type": "text", "text": system_prompt, 
             "cache_control": {"type": "ephemeral"}}],
    messages=messages,
    max_tokens=1024
)
# Append response to conversation_history for next turn

Agentic Systems

Agentic systems (AutoGen, LangGraph, CrewAI) make many tool calls in a loop, often with overlapping system prompts and tool definitions. Cache your tool registry and agent persona at the top of the prompt, and let the dynamic tool call results flow through the uncached portion. The consistency requirement is strict — if your tool definitions change between agent steps (e.g., tools are conditionally available), you’ll get cache misses. Prefer a static superset of tools and handle conditional availability in application logic.

When Prompt Caching Genuinely Doesn’t Help

Prompt caching is not a universal win. Avoid optimizing for it in these scenarios:

Short prompts (< 1,024 tokens): You don’t meet the minimum threshold. Engineering time is better spent elsewhere.
Highly unique contexts: If every request has a completely different long context (e.g., analyzing a unique document per user), you write a cache but never read it — you pay the write premium for nothing.
Low request volume: At under 50 requests/day, cache writes may cost more than reads save. Run the math with your actual prompt length and request rate.
Frequently changing system prompts: If your system prompt changes every hour or day (A/B testing, personalization), TTL selection becomes tricky and hit rates drop.
One-off batch jobs: A batch that runs once and never repeats gets no cache reads. Use Anthropic’s Batch API for cost savings instead.

The honest assessment: if your system prompt is under 2K tokens and you run under 1,000 requests/day, the savings are real but modest (under $50/month on most providers). At that scale, model selection and prompt length optimization likely offer better ROI than caching architecture.

FAQ

Q: Does prompt caching work with streaming responses?

Yes. All three providers support prompt caching with streaming. The cache hit check happens before token generation begins, so your streaming latency still benefits from reduced time-to-first-token on cache hits. The usage statistics (including cache read tokens) appear in the final delta of the stream.

Q: What happens if I exceed Anthropic’s 4-breakpoint limit?

The API returns a 400 error. If you’re using automatic caching (which consumes one slot), you can add up to 3 explicit breakpoints. If you need more granularity, restructure your prompt to consolidate static sections rather than adding more breakpoints.

Q: Is prompt caching the same as semantic caching?

No. Prompt caching is exact prefix matching at the token level — it requires identical byte sequences to hit. Semantic caching (tools like GPTCache, Redis + embeddings) matches semantically similar queries and returns cached responses. They’re complementary: use prompt caching to reduce per-request compute costs, and semantic caching to avoid calling the LLM at all for near-duplicate queries.

Q: Will using prompt caching affect response quality?

No. Cache hits use the exact same KV states that would have been computed fresh, so responses are statistically identical. The only observable difference is lower latency and cost. There’s no quality-cost tradeoff involved.

Q: How do I choose between Anthropic and OpenAI for cost optimization?

Run the math with your actual numbers. OpenAI gives 50% savings with zero engineering work. Anthropic gives 90% savings with minimal implementation effort. At 10,000 requests/day with a 5K-token system prompt, Anthropic saves roughly twice as much per month despite higher base prices, assuming 80%+ cache hit rate. Below about 5,000 requests/day, the difference narrows significantly, and OpenAI’s simplicity may win on total cost including engineering time.

OpenAI Responses API Tutorial 2026: Build Stateful AI Apps in Python

Tue, 21 Apr 2026 00:11:38 +0000

The OpenAI Responses API is the new primary interface for building stateful, agentic AI applications — replacing the Assistants API (being sunset H1 2026) and extending beyond what Chat Completions can do. This tutorial walks through everything from your first API call to building multi-step agents with built-in tools like web search and file retrieval.

What Is the OpenAI Responses API?

The OpenAI Responses API is a stateful, tool-native interface for building AI agents and multi-turn applications — launched in March 2025 as OpenAI’s replacement for the Assistants API and a significant evolution beyond Chat Completions. Unlike Chat Completions, which is stateless (every request requires you to resend the full conversation history), Responses API maintains conversation state server-side using previous_response_id. A 10-turn conversation with Chat Completions resends your entire history on turn 10, making it up to 5x more expensive for long dialogues. Responses API sends only the new message each turn — the server already holds context. Built-in tools (web search at $25–50/1K queries, file search at $2.50/1K queries) are first-class citizens rather than custom function definitions, and reasoning tokens from o3 and o4-mini are preserved between turns instead of being discarded. OpenAI has moved all example code in the openai-python repository to Responses API patterns — it is where the platform is going.

Key Architecture Concepts

The Responses API is built around three core primitives that differ from Chat Completions:

Response objects — Each API call returns a Response object with an id field. Pass this as previous_response_id in the next call to chain turns without resending history.
Built-in tools — web_search_preview, file_search, and computer_use_preview are activated by including them in the tools array. No custom server infrastructure required.
Semantic streaming events — Instead of raw token deltas, streaming emits structured events like response.output_item.added, response.content_part.added, and response.done.

Chat Completions vs Responses API vs Assistants API

The Responses API occupies a distinct position: it is more capable than Chat Completions for stateful and agentic workflows, while being simpler and cheaper than the Assistants API that it is replacing. Understanding which to use requires knowing what each one manages for you versus what you manage yourself. Chat Completions gives you maximum control (you own all state, all persistence, all tool execution loops) at the cost of client-side complexity. Responses API moves state management and tool orchestration server-side while keeping the request/response model familiar. Assistants API managed Threads, Runs, and Files as persistent objects — a full lifecycle that developers found overly complex for most use cases. OpenAI is converging on Responses API as the primary stateful API.

Feature	Chat Completions	Responses API	Assistants API
State management	Client-side	Server-side	Server-side (Threads)
Built-in tools	No	Yes	Yes (Code Interpreter, etc.)
Reasoning token preservation	No	Yes	No
Pricing overhead	Lowest	Medium	Highest
Streaming events	Raw token deltas	Semantic events	SSE stream
Status	Active	Active (primary)	Sunset H1 2026
Multi-provider support	Wide	Open Responses spec	OpenAI only

The migration path from Assistants to Responses is the most urgent — H1 2026 sunset means any Threads/Runs code needs to be ported now.

Getting Started: Your First Responses API Call

Making your first Responses API call requires the openai Python package (version ≥ 1.66.0 for full Responses support) and an API key. The shape of the request is close to Chat Completions but uses a different method and a different response schema. The critical difference from Chat Completions is the input parameter instead of messages, and the model field supporting all GPT-4o, o3, and o4-mini identifiers. The response is a Response object with an id field that enables state chaining, output containing the model’s reply, and usage statistics. You do not need to configure threads, assistants, or vector stores before making your first call — just the model and the input.

Install and authenticate:

pip install openai>=1.66.0
export OPENAI_API_KEY="sk-..."

Your first call (Python):

from openai import OpenAI

client = OpenAI()

response = client.responses.create(
    model="gpt-4o",
    input="Explain the difference between Responses API and Chat Completions in one paragraph."
)

print(response.output[0].content[0].text)
print(f"Response ID: {response.id}")  # Save this for multi-turn

JavaScript/TypeScript equivalent:

import OpenAI from "openai";

const client = new OpenAI();

const response = await client.responses.create({
  model: "gpt-4o",
  input: "Explain the difference between Responses API and Chat Completions."
});

console.log(response.output[0].content[0].text);
console.log(`Response ID: ${response.id}`);

The response object structure is different from ChatCompletion — output is a list of items, each with a content list. Text is at response.output[0].content[0].text.

Server-Side State Management with previous_response_id

Server-side state management via previous_response_id is the most significant capability that Responses API adds over Chat Completions. When you pass a previous_response_id to a new request, the OpenAI server reconstructs the conversation context internally — you only send the new user message, not the full history. This eliminates the most expensive part of long conversations: re-tokenizing and re-encoding historical messages on every turn. For a 10-turn conversation with 500 tokens per turn, Chat Completions sends approximately 5,000 tokens on turn 10 (full history) while Responses API sends roughly 500 tokens (just the new input). At scale across thousands of daily active users, this is not a marginal difference. Reasoning tokens from o3 and o4-mini are also preserved — the model’s internal chain-of-thought from turn 3 informs turn 7, producing more coherent agentic behavior than Chat Completions where that reasoning context is lost.

Multi-turn conversation example:

from openai import OpenAI

client = OpenAI()

# Turn 1
response_1 = client.responses.create(
    model="gpt-4o",
    input="I'm building a Python web scraper. Where should I start?"
)
print("Assistant:", response_1.output[0].content[0].text)

# Turn 2 — only send new message, server holds context
response_2 = client.responses.create(
    model="gpt-4o",
    previous_response_id=response_1.id,
    input="Which HTTP library would you recommend for async scraping?"
)
print("Assistant:", response_2.output[0].content[0].text)

# Turn 3 — chain continues
response_3 = client.responses.create(
    model="gpt-4o",
    previous_response_id=response_2.id,
    input="Show me a basic example using that library."
)
print("Assistant:", response_3.output[0].content[0].text)

Store response.id in your database alongside the user session. When the user returns, load their latest response_id and pass it as previous_response_id — the conversation resumes with full context.

Managing State in Production

For production applications, treat previous_response_id like a foreign key in your session table:

import sqlite3
from openai import OpenAI

client = OpenAI()
db = sqlite3.connect("sessions.db")
db.execute("CREATE TABLE IF NOT EXISTS sessions (user_id TEXT, last_response_id TEXT)")

def chat(user_id: str, message: str) -> str:
    row = db.execute("SELECT last_response_id FROM sessions WHERE user_id=?", (user_id,)).fetchone()
    prev_id = row[0] if row else None

    response = client.responses.create(
        model="gpt-4o",
        input=message,
        previous_response_id=prev_id
    )

    new_id = response.id
    db.execute("INSERT OR REPLACE INTO sessions VALUES (?, ?)", (user_id, new_id))
    db.commit()
    return response.output[0].content[0].text

Built-in Tools: Web Search, File Search, and Computer Use

Built-in tools in the Responses API replace custom infrastructure that developers previously had to build and maintain themselves. Web search (web_search_preview) lets the model query the live web and return cited results without you managing a search API key or result parsing logic. File search (file_search) enables semantic retrieval over uploaded documents using OpenAI-hosted vector stores — at $2.50 per 1,000 queries with the first gigabyte of storage free and $0.10/GB/day after that. Computer use (computer_use_preview) allows the model to control a browser or desktop environment, opening the door to automation workflows that were previously limited to specialized tools. These tools are activated by listing them in the tools array of your request — no separate SDK, no custom endpoints. The model decides when to invoke them based on the user’s input, executes them server-side, and returns the enriched response in a single API call.

Web search tool:

response = client.responses.create(
    model="gpt-4o",
    tools=[{"type": "web_search_preview"}],
    input="What are the latest OpenAI API pricing changes in 2026?"
)

# Response includes citations
for item in response.output:
    if item.type == "message":
        print(item.content[0].text)
    elif item.type == "web_search_call":
        print(f"Searched: {item.query}")

File search with vector store:

# Upload files first
with open("docs/api_reference.pdf", "rb") as f:
    file = client.files.create(file=f, purpose="assistants")

# Create vector store
vs = client.vector_stores.create(name="API Docs")
client.vector_stores.files.create(vector_store_id=vs.id, file_id=file.id)

# Query with file search
response = client.responses.create(
    model="gpt-4o",
    tools=[{
        "type": "file_search",
        "vector_store_ids": [vs.id]
    }],
    input="What are the rate limits for the Responses API?"
)
print(response.output[0].content[0].text)

Tool pricing summary:

Tool	Cost
`web_search_preview`	$25–50 per 1,000 queries
`file_search`	$2.50 per 1,000 queries + $0.10/GB/day storage (first GB free)
`computer_use_preview`	Billed at model token rates + compute

Function Calling with the Responses API

Function calling in the Responses API follows the same five-step loop as Chat Completions, but integrates cleanly with server-side state so you do not need to manually reconstruct conversation history after each tool execution. The loop is: define tools → send request → model returns function_call items in output → execute functions locally → send results back with previous_response_id → model generates final response. Strict mode (strict: true) uses constrained decoding at token generation time to guarantee 100% schema compliance — critical for production agents where a malformed JSON response would break your execution logic. Parallel tool calls allow the model to request multiple function executions in a single response; you run all of them simultaneously and return all results in one follow-up request.

Five-step function calling loop:

import json
from openai import OpenAI

client = OpenAI()

# Step 1: Define tools
tools = [
    {
        "type": "function",
        "name": "get_weather",
        "description": "Get current weather for a city",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {"type": "string"},
                "units": {"type": "string", "enum": ["celsius", "fahrenheit"]}
            },
            "required": ["city", "units"],
            "additionalProperties": False
        },
        "strict": True  # Step 1b: Enable strict mode
    }
]

# Step 2: Send request
response = client.responses.create(
    model="gpt-4o",
    tools=tools,
    input="What's the weather in Tokyo and Berlin?"
)

# Step 3: Check for tool calls
tool_calls = [item for item in response.output if item.type == "function_call"]

# Step 4: Execute functions
results = []
for tc in tool_calls:
    args = json.loads(tc.arguments)
    # Your actual implementation
    weather_data = {"temperature": 18, "condition": "partly cloudy"}
    results.append({
        "type": "function_call_output",
        "call_id": tc.call_id,
        "output": json.dumps(weather_data)
    })

# Step 5: Send results, get final response
final = client.responses.create(
    model="gpt-4o",
    previous_response_id=response.id,
    input=results
)
print(final.output[0].content[0].text)

Parallel Tool Calls

When the model needs multiple data points, it can request them all at once. Execute in parallel and return all results together:

import asyncio

async def execute_tool(tc):
    args = json.loads(tc.arguments)
    # Async execution of each tool call
    result = await fetch_data(args)
    return {
        "type": "function_call_output",
        "call_id": tc.call_id,
        "output": json.dumps(result)
    }

tool_calls = [item for item in response.output if item.type == "function_call"]
results = await asyncio.gather(*[execute_tool(tc) for tc in tool_calls])

For dependent operations (tool B requires tool A’s output), set parallel_tool_calls: False or use o3/o4-mini which naturally sequences calls based on reasoning.

Strict Mode and Schema Enforcement for Production

Strict mode in the Responses API’s function calling achieves 100% schema compliance by applying constrained decoding at the token generation level — the model cannot produce a token that would violate your JSON schema. This is fundamentally different from prompt-level instructions (“always return valid JSON”) which can fail under adversarial inputs or long context. For production agents processing thousands of tool call cycles, even a 0.1% JSON parse failure rate creates operational overhead: error logging, retry logic, fallback handling, user-facing error states. Strict mode eliminates this class of failure entirely at generation time. The requirement is that your schema uses only supported types (string, number, boolean, object, array, null), sets additionalProperties: false on all objects, and marks all properties as required. These constraints are strict mode’s trade-off: less flexible schemas in exchange for guaranteed compliance.

tool_schema = {
    "type": "function",
    "name": "create_ticket",
    "description": "Create a support ticket in the system",
    "parameters": {
        "type": "object",
        "properties": {
            "title": {"type": "string"},
            "priority": {"type": "string", "enum": ["low", "medium", "high", "critical"]},
            "assignee_id": {"type": ["string", "null"]},
            "tags": {
                "type": "array",
                "items": {"type": "string"}
            }
        },
        "required": ["title", "priority", "assignee_id", "tags"],
        "additionalProperties": False
    },
    "strict": True
}

With strict: True, if the model cannot fit a value into your schema, it will use null for nullable fields rather than hallucinating invalid values.

Streaming with Semantic Events

Streaming in the Responses API uses structured semantic events rather than the raw choices[0].delta.content tokens you get from Chat Completions. This matters for building reactive UIs and agent orchestration loops: you know exactly when a tool call starts, when content is being added, and when the response is complete — without parsing partial JSON or managing your own buffer state. Semantic events include response.output_item.added (new output item starting), response.content_part.added (new content part), response.output_text.delta (token-by-token text), response.tool_call.arguments.delta (streaming tool call arguments), and response.done (full response complete with final object). This is a meaningful ergonomic improvement for streaming agents because tool call arguments arrive incrementally — you can start validation or UI feedback before the full JSON is assembled.

with client.responses.stream(
    model="gpt-4o",
    tools=[{"type": "web_search_preview"}],
    input="Search for the latest news on OpenAI Responses API"
) as stream:
    for event in stream:
        if event.type == "response.output_text.delta":
            print(event.delta, end="", flush=True)
        elif event.type == "response.output_item.added":
            if event.item.type == "web_search_call":
                print(f"\n[Searching: {event.item.query}]")
        elif event.type == "response.done":
            print(f"\n\nFinal response ID: {event.response.id}")

Cost Architecture: When to Use Which API

The Responses API sits between Chat Completions (lowest cost) and Assistants API (highest overhead) in terms of cost structure. For short, single-turn interactions, Chat Completions is still cheaper — there is no state storage overhead and no per-query tool pricing. For conversations longer than 3–4 turns, Responses API often wins because you stop paying to re-tokenize history: a 10-turn conversation with 500 tokens of context per turn costs roughly 5,000 input tokens on Chat Completions turn 10 vs roughly 500 tokens on Responses API. The break-even point depends on your average conversation length and token costs for your chosen model. Built-in tools add per-use costs but replace infrastructure you would otherwise build: a self-hosted web search integration requires API keys, result parsing, prompt injection into context, and maintenance. At $25–50/1K queries, web_search_preview is often cheaper than developer time for low-to-medium volume applications.

Scenario	Recommended API	Reason
Single-turn completions, high volume	Chat Completions	No state overhead
Multi-turn chat (3+ turns)	Responses API	Avoids history resend cost
Document Q&A with file retrieval	Responses API + file_search	Built-in vector store
Web-augmented research agents	Responses API + web_search	No custom search infra
Legacy Assistants code	Migrate to Responses	Assistants sunset H1 2026
Multi-provider portability	Responses API (Open Responses spec)	Works on Ollama, vLLM, etc.

The Open Responses Specification

The Open Responses specification is a multi-provider API standard backed by OpenAI, Nvidia, Vercel, OpenRouter, Hugging Face, LM Studio, Ollama, and vLLM — defining a shared interface for stateful AI responses that any compatible server can implement. This matters for developers building on the Responses API because it means your code is not locked to OpenAI infrastructure. Ollama added Open Responses support in v0.13.3 (non-stateful flavor for local models), and vLLM ships a fully compatible server for self-hosted deployments. Azure OpenAI also supports the Responses API through its own hosted endpoint. The specification defines the request/response schema, streaming event format, and tool calling protocol — the same previous_response_id chaining, same tools array format, same semantic streaming events. Write once, run on OpenAI, Azure, local Ollama, or any vLLM deployment.

# Point to any Open Responses-compatible server
client = OpenAI(
    api_key="ollama",  # or your local API key
    base_url="http://localhost:11434/v1/responses"  # local Ollama
)

# Same code works — just the endpoint changes
response = client.responses.create(
    model="llama3.2",
    input="Explain stateful conversation management."
)

Migrating from Chat Completions to Responses API

Migrating from Chat Completions to Responses API is the most straightforward upgrade path because the model IDs are identical, the tool definition format is compatible, and you can migrate incrementally — route new features to Responses API while leaving existing Chat Completions code untouched. The surface-level change is client.chat.completions.create() → client.responses.create(), messages → input, and manually managed history → previous_response_id. For streaming, swap for chunk in stream token handling for semantic event processing. The deeper change is architectural: you stop owning conversation state in your database and delegate it to OpenAI’s server, keeping only the response_id as a foreign key.

Before (Chat Completions):

history = []

def chat(message: str) -> str:
    history.append({"role": "user", "content": message})
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=history  # Full history every time
    )
    reply = response.choices[0].message.content
    history.append({"role": "assistant", "content": reply})
    return reply

After (Responses API):

last_response_id = None

def chat(message: str) -> str:
    global last_response_id
    response = client.responses.create(
        model="gpt-4o",
        input=message,
        previous_response_id=last_response_id  # Just the ID
    )
    last_response_id = response.id
    return response.output[0].content[0].text

Migrating from Assistants API Before H1 2026 Sunset

The Assistants API is being sunset in H1 2026, which means any production code using Threads, Runs, Messages, or Assistants objects needs to be ported to Responses API before that date. The migration is not a one-to-one mapping — the conceptual model is different — but the capabilities are equivalent or improved. Threads (persistent conversation containers) map to previous_response_id chains. Runs (execution units with polling) are replaced by single synchronous or streaming Responses API calls. Messages objects (structured conversation history) are replaced by the output array in each Response. Assistants (reusable agent configurations with tools and system prompts) map to per-request instructions and tools parameters, or can be encapsulated in a Python class. The main operational change: you no longer poll for Run completion — Responses API calls block until complete (or stream incrementally).

Assistants API pattern (to replace):

# OLD: Assistants API (sunset H1 2026)
thread = client.beta.threads.create()
client.beta.threads.messages.create(thread_id=thread.id, role="user", content=message)
run = client.beta.threads.runs.create_and_poll(thread_id=thread.id, assistant_id=assistant_id)
messages = client.beta.threads.messages.list(thread_id=thread.id)
reply = messages.data[0].content[0].text.value

Responses API equivalent:

# NEW: Responses API
response = client.responses.create(
    model="gpt-4o",
    instructions="You are a helpful assistant specializing in Python development.",
    tools=[{"type": "file_search", "vector_store_ids": [vs_id]}],
    input=message,
    previous_response_id=prev_response_id  # replaces Thread
)
reply = response.output[0].content[0].text
prev_response_id = response.id  # store for next turn

Building a Complete Agent: End-to-End Tutorial

A complete Responses API agent combines server-side state, built-in tools, and function calling into a workflow that handles multi-step reasoning without manual orchestration loops. The following agent answers research questions by searching the web, retrieving relevant files, and synthesizing a cited response — all in a single Responses API call that handles tool execution internally when using built-in tools, or across two calls when using custom functions.

import json
from openai import OpenAI

client = OpenAI()

# Agent configuration
TOOLS = [
    {"type": "web_search_preview"},
    {
        "type": "function",
        "name": "save_to_notes",
        "description": "Save a research finding to the user's notes",
        "parameters": {
            "type": "object",
            "properties": {
                "title": {"type": "string"},
                "content": {"type": "string"},
                "tags": {"type": "array", "items": {"type": "string"}}
            },
            "required": ["title", "content", "tags"],
            "additionalProperties": False
        },
        "strict": True
    }
]

SYSTEM_PROMPT = """You are a research assistant. When asked a question:
1. Search the web for current information
2. Synthesize findings with citations
3. If the user asks to save findings, use the save_to_notes function
Always cite your sources."""

class ResearchAgent:
    def __init__(self):
        self.notes = []
        self.last_response_id = None

    def run(self, user_message: str) -> str:
        response = client.responses.create(
            model="gpt-4o",
            instructions=SYSTEM_PROMPT,
            tools=TOOLS,
            input=user_message,
            previous_response_id=self.last_response_id
        )

        # Handle function calls (built-in tools execute automatically)
        function_calls = [i for i in response.output if i.type == "function_call"]
        if function_calls:
            results = []
            for fc in function_calls:
                args = json.loads(fc.arguments)
                if fc.name == "save_to_notes":
                    self.notes.append(args)
                    result = {"saved": True, "note_count": len(self.notes)}
                results.append({
                    "type": "function_call_output",
                    "call_id": fc.call_id,
                    "output": json.dumps(result)
                })
            # Get final response after function execution
            response = client.responses.create(
                model="gpt-4o",
                previous_response_id=response.id,
                input=results
            )

        self.last_response_id = response.id
        return response.output[0].content[0].text

# Usage
agent = ResearchAgent()
print(agent.run("What are the key features of the OpenAI Responses API launched in 2025?"))
print(agent.run("Save those findings to my notes with the tag 'openai-api'"))
print(agent.run("What questions do I still have based on what we've discussed?"))

FAQ

The OpenAI Responses API introduces a fundamentally different programming model compared to Chat Completions and the now-sunset Assistants API. The most common questions from developers migrating existing applications center on state management, cost implications, and tool compatibility. These answers address the questions that come up most frequently when teams evaluate or implement the Responses API in production systems — covering previous_response_id chaining, the Assistants API sunset timeline, multi-provider portability via the Open Responses specification, cost savings on long conversations, and the interaction between custom function calling and built-in tools. Each answer is self-contained and reflects the current Responses API behavior as of April 2026. The Responses API launched in March 2025 and has since become OpenAI’s primary recommended interface for stateful and agentic applications, with the openai-python library updated to use Responses API patterns throughout its examples.

What is the difference between OpenAI Responses API and Chat Completions?

The key difference is state management. Chat Completions is stateless — you send the full conversation history on every request and manage persistence yourself. Responses API maintains conversation state server-side via previous_response_id, so each turn only sends the new message. Responses API also includes built-in tools (web search, file search) that Chat Completions lacks, and preserves reasoning tokens between turns for o3 and o4-mini models.

When will the Assistants API be sunset?

OpenAI has announced the Assistants API will be sunset in H1 2026. This means any production code using Threads, Runs, Messages, or the Assistants beta endpoints needs to be migrated to the Responses API before that deadline. The migration is well-documented and the Responses API provides all equivalent capabilities — stateful conversations, file retrieval, and tool use.

Is the OpenAI Responses API available on Azure OpenAI?

Yes. Azure OpenAI supports the Responses API through its hosted endpoint. Additionally, the Open Responses specification backed by Nvidia, Vercel, OpenRouter, and others enables the same API surface on Ollama (v0.13.3+), vLLM, and other compatible servers. The base_url parameter in the OpenAI Python client lets you point to any compatible server.

How does `previous_response_id` save money on long conversations?

In a 10-turn conversation with Chat Completions, turn 10 sends the entire 9-turn history plus the new message — potentially thousands of tokens of input. With Responses API, turn 10 only sends the new message (a few hundred tokens) because the server already holds the full context. OpenAI estimates Chat Completions can be up to 5x more expensive for long conversations due to this history re-tokenization cost.

Can I use both function calling and built-in tools in the same Responses API call?

Yes. You can include both custom function definitions and built-in tools (like web_search_preview or file_search) in the same tools array. The model will call whichever tools are relevant to the user’s request. Built-in tools execute server-side and their results appear automatically in response.output, while custom function calls require your client to execute them and return results via a follow-up request with previous_response_id.