GPT-5.5 Batch API and Flex mode both offer 50% off standard pricing — $2.50 per 1M input tokens and $15 per 1M output tokens versus the standard $5/$30 — giving high-volume AI coding teams a direct path to halving their monthly API spend without changing models or degrading output quality.
What Is GPT-5.5 Batch API and Flex Mode?
GPT-5.5 Batch API and Flex mode are two distinct pricing and execution tiers from OpenAI that both deliver 50% cost savings compared to standard API rates, but differ significantly in how and when results are returned. The Batch API is a fire-and-forget system: you submit up to 50,000 requests in a single JSONL file (up to 200MB), and OpenAI guarantees results within 24 hours. Flex mode, currently in beta as of April 2026, is interactive — requests are processed in real time but with variable latency ranging from a few seconds to several minutes, depending on platform load. GPT-5.5 launched on April 23, 2026, at standard pricing of $5 per 1M input tokens and $30 per 1M output tokens. Both Batch and Flex bring that cost down to $2.50/$15 — the same price as GPT-5.4 standard, but with GPT-5.5’s higher capability, including an 82.7% score on Terminal-Bench 2.0 and 58.6% on SWE-Bench Pro. For engineering teams running nightly code reviews, eval pipelines, or test generation jobs, the practical implication is straightforward: you get a better model at the same cost you were already paying.
Batch vs Flex: The Core Distinction
Batch is fully asynchronous with a 24-hour SLA and no interaction mid-job. Flex is interactive but non-priority — you may encounter HTTP 429 errors during peak traffic windows. Neither tier is suitable for production user-facing requests where sub-second latency is required.
GPT-5.5 Pricing Tiers at a Glance (Standard vs Flex vs Batch vs Priority)
OpenAI now offers four pricing and execution tiers for GPT-5.5, each targeting a different latency-cost tradeoff. Priority tier sits at the top of the speed stack — requests jump the queue for the fastest possible response, priced at a premium above standard rates. Standard tier at $5 per 1M input and $30 per 1M output is the default for most API calls today. Flex and Batch both land at $2.50/$15 — exactly 50% off standard — but serve different use cases. Flex accepts interactive API calls with variable latency, making it usable inside agent loops or CI/CD pipelines where a few extra seconds per call is acceptable. Batch, by contrast, is non-interactive: you upload a file, wait up to 24 hours, and download the results. One important pricing edge case: prompts exceeding 272K tokens are charged at 2x input and 1.5x output rates for the entire session — plan your context window sizes accordingly for large codebase analysis tasks.
| Tier | Input (per 1M) | Output (per 1M) | Latency | Interactive? |
|---|---|---|---|---|
| Priority | >$5 | >$30 | Fastest | Yes |
| Standard | $5.00 | $30.00 | Fast | Yes |
| Flex | $2.50 | $15.00 | Seconds–minutes | Yes (beta) |
| Batch | $2.50 | $15.00 | Up to 24h | No |
When Priority Makes Sense
Priority is reserved for latency-critical production paths where a delayed response directly impacts user experience — think real-time IDE completions or live pair programming assistants. Everything else should flow down the pricing tiers.
Which AI Coding Workflows Qualify for Batch or Flex Mode?
Between 40 and 60% of typical engineering team API workloads are batch-eligible according to usage analysis of common API consumption patterns — a substantial portion of spend that most teams are leaving on the table. The key qualifying criterion for Batch is that the task can tolerate async results: you don’t need the answer in real time. For Flex, the criterion is softer: the task is interactive but not latency-critical — a few extra seconds or minutes is acceptable. Concrete batch-eligible workflows include nightly code review runs across the entire diff since the last merge, automated unit test generation during off-hours CI jobs, eval grading pipelines for fine-tuning or regression testing, embedding refreshes when documentation or codebase content changes, and PR summary generation for engineering digests. Flex-eligible workflows include agent loops that chain multiple model calls where intermediate latency isn’t user-visible, data enrichment tasks that run in the background during active development, and CI/CD steps that run post-merge rather than blocking the merge queue. Standard or Priority should be reserved for inline IDE completions, live chat interfaces, real-time code explanations, and any workflow where a human is actively waiting on the response.
| Workflow | Recommended Tier | Reason |
|---|---|---|
| Nightly code review | Batch | Async, no user waiting |
| Test generation in CI | Batch / Flex | Off-critical path |
| Embedding refresh | Batch | Pure throughput |
| Eval grading | Batch | Fire-and-forget |
| Agent loop (internal calls) | Flex | Interactive but non-urgent |
| PR summary digest | Batch | Scheduled job |
| Inline IDE completion | Standard / Priority | User is actively waiting |
| Live chat assistant | Standard | Latency-sensitive |
How to Audit Your Current API Usage
Pull your OpenAI usage logs for the last 30 days and tag each request type as “user-facing” or “background.” Any background request is a Batch or Flex candidate. Most teams find 40–60% of their volume is immediately reclassifiable.
How to Implement GPT-5.5 Batch API in Python (Step-by-Step)
The GPT-5.5 Batch API requires openai>=2.1.0 and follows a three-step pattern: upload a JSONL file containing your requests, submit the batch job, then poll for completion and download the results. Each line in the JSONL file is a self-contained API request object with a custom custom_id for result matching. The system supports up to 50,000 requests per file and files up to 200MB, with all results guaranteed within 24 hours. Here is a complete working implementation for running nightly code reviews across a list of diffs:
import json
import time
from openai import OpenAI
client = OpenAI() # reads OPENAI_API_KEY from environment
# Step 1: Prepare the JSONL batch file
diffs = [
{"id": "pr-101", "diff": "...<git diff content>..."},
{"id": "pr-102", "diff": "...<git diff content>..."},
]
with open("batch_requests.jsonl", "w") as f:
for item in diffs:
request = {
"custom_id": item["id"],
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": "gpt-5.5",
"messages": [
{
"role": "system",
"content": "You are a senior code reviewer. Identify bugs, security issues, and style violations."
},
{
"role": "user",
"content": f"Review this diff:\n\n{item['diff']}"
}
],
"max_tokens": 1024
}
}
f.write(json.dumps(request) + "\n")
# Step 2: Upload the file
with open("batch_requests.jsonl", "rb") as f:
upload = client.files.create(file=f, purpose="batch")
print(f"Uploaded file: {upload.id}")
# Step 3: Submit the batch job
batch = client.batches.create(
input_file_id=upload.id,
endpoint="/v1/chat/completions",
completion_window="24h"
)
print(f"Batch submitted: {batch.id} — status: {batch.status}")
# Step 4: Poll for completion (in production, use a scheduled job instead)
while batch.status not in ("completed", "failed", "cancelled"):
time.sleep(60)
batch = client.batches.retrieve(batch.id)
print(f"Status: {batch.status} — completed: {batch.request_counts.completed}/{batch.request_counts.total}")
# Step 5: Download and process results
if batch.status == "completed":
content = client.files.content(batch.output_file_id)
for line in content.text.strip().split("\n"):
result = json.loads(line)
custom_id = result["custom_id"]
review_text = result["response"]["body"]["choices"][0]["message"]["content"]
print(f"\n--- Review for {custom_id} ---\n{review_text}")
Error Handling and Partial Failures
Batch jobs can partially succeed — individual requests may fail while others complete. Always check batch.error_file_id after completion and download the error file alongside the output file. Log failed custom_id values and resubmit them in the next batch cycle.
Flex Processing: When to Choose It Over Batch
Flex processing is OpenAI’s interactive-but-discounted tier, currently in beta for GPT-5.5, o3, and o4-mini as of April 2026. It cuts standard rates by 50% while preserving the real-time request-response pattern — meaning your code calls client.chat.completions.create() normally, but with service_tier="flex" added. The tradeoff is variable latency: responses arrive within seconds under low load, but can take several minutes when the platform is busy. Flex may also return HTTP 429 errors during peak windows, so retry logic is mandatory. The practical use case is agent pipelines where model calls happen in a background thread or async queue — the loop keeps running, and a slightly delayed intermediate response doesn’t break anything. A CI/CD step that analyzes test failures after a build completes is a good Flex candidate: the developer isn’t watching the terminal, and a 90-second response versus a 3-second one is irrelevant.
# Using Flex mode — same SDK call, different service tier
response = client.chat.completions.create(
model="gpt-5.5",
service_tier="flex", # the only required change
messages=[
{"role": "user", "content": "Analyze these test failures and suggest fixes: ..."}
]
)
Retry Logic for Flex 429 Errors
import time
from openai import RateLimitError
def flex_call_with_retry(messages, max_retries=5):
for attempt in range(max_retries):
try:
return client.chat.completions.create(
model="gpt-5.5",
service_tier="flex",
messages=messages
)
except RateLimitError:
wait = 2 ** attempt # exponential backoff: 1s, 2s, 4s, 8s, 16s
print(f"Flex 429 — retrying in {wait}s (attempt {attempt + 1}/{max_retries})")
time.sleep(wait)
raise RuntimeError("Flex call failed after max retries")
Stacking Discounts — Batch + Prompt Caching for Maximum Savings
Prompt caching and Batch API discounts stack multiplicatively, creating the most cost-efficient configuration available for high-volume GPT-5.5 workloads. OpenAI’s prompt caching automatically kicks in for prompts exceeding 1,024 tokens when the same prefix appears repeatedly — cached input tokens are priced at 50% of the standard input rate. When you’re already on Batch pricing ($2.50/1M input), cached tokens drop further to approximately $1.25/1M. For a team running nightly code reviews where the system prompt and codebase context stay constant across 500 PR reviews, the combined discount on the static prefix can approach 75% off standard rates. The key implementation detail is keeping your system prompt and shared context at the front of every request in the batch file, unchanged, so OpenAI’s caching infrastructure can recognize and serve the shared prefix from cache. Variable content — the specific diff or file being reviewed — goes at the end of the messages array.
SYSTEM_PROMPT = """You are a senior code reviewer for a Python backend team.
Our style guide: PEP 8, type hints required, no bare except clauses,
all public functions must have docstrings. Flag: security vulnerabilities,
N+1 query patterns, missing input validation, and hardcoded secrets."""
# This 150-token system prompt gets cached after the first request in the batch.
# All subsequent requests in the batch reuse it at ~$1.25/1M input tokens.
def make_batch_request(pr_id, diff):
return {
"custom_id": pr_id,
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": "gpt-5.5",
"messages": [
{"role": "system", "content": SYSTEM_PROMPT}, # cached prefix
{"role": "user", "content": f"Review this diff:\n\n{diff}"} # variable
]
}
}
Estimating Your Combined Savings
For 1M input tokens on a batch with 70% cached prefixes: 700K tokens at $1.25/1M + 300K tokens at $2.50/1M = $0.875 + $0.75 = $1.625 total, versus $5.00 at standard uncached rates. That’s a 67.5% reduction.
Real-World ROI: What 50% Savings Looks Like for a Dev Team
A 10-developer engineering team running AI-assisted workflows at scale provides a concrete reference point for the financial impact of switching batch-eligible workloads from Standard to Batch or Flex. Assume the team consumes 50M input tokens and 10M output tokens monthly — a realistic figure for teams running inline completions, code review bots, test generators, and documentation tools. At standard GPT-5.5 rates ($5/$30), that’s $250 input + $300 output = $550/month. If 50% of that workload is batch-eligible (a conservative estimate given the 40–60% industry benchmark), switching those jobs to Batch pricing reduces the eligible portion from $275 to $137.50 — a monthly saving of $137.50, or $1,650/year. For a team spending $2,000–$5,000/month on API costs, the savings scale proportionally. A team at $5,000/month with 50% batch-eligible workload saves $1,250/month — $15,000/year — without any change to output quality, since Batch jobs use the same model weights as Standard.
| Monthly Spend | Batch-Eligible % | Monthly Savings | Annual Savings |
|---|---|---|---|
| $500 | 50% | $125 | $1,500 |
| $2,000 | 50% | $500 | $6,000 |
| $5,000 | 50% | $1,250 | $15,000 |
| $10,000 | 60% | $3,000 | $36,000 |
The Cost-Neutral GPT-5.5 Upgrade
GPT-5.5 Batch pricing ($2.50/$15) equals GPT-5.4 standard pricing — meaning any team currently using GPT-5.4 can upgrade to GPT-5.5 Batch and get better benchmark performance (82.7% vs 75.1% on Terminal-Bench 2.0) at identical cost. This is the clearest no-compromise upgrade path available in the market as of April 2026.
GPT-5.5 Coding Benchmarks — Is the Upgrade Worth It?
GPT-5.5 demonstrates measurable improvements over GPT-5.4 across the benchmarks that matter most for software engineering tasks, making it the more capable model at the same effective cost when used with Batch pricing. On Terminal-Bench 2.0, which tests complex CLI workflows and multi-step shell interactions, GPT-5.5 scores 82.7% versus GPT-5.4’s 75.1% — a 7.6 percentage point improvement that translates directly to fewer retries and more reliable automated tooling. On SWE-Bench Pro, which evaluates real-world GitHub issue resolution with actual repository context, GPT-5.5 achieves 58.6%. On OSWorld-Verified, which measures autonomous computer environment operation (browser control, file system navigation, application interaction), GPT-5.5 reaches 78.7%. These gains matter for teams using AI in CI/CD pipelines: higher SWE-Bench scores mean the model resolves more issues on the first attempt, reducing the number of tokens consumed per successful fix — which compounds the cost savings from Batch pricing.
| Benchmark | GPT-5.4 | GPT-5.5 | Delta |
|---|---|---|---|
| Terminal-Bench 2.0 | 75.1% | 82.7% | +7.6pp |
| SWE-Bench Pro | — | 58.6% | — |
| OSWorld-Verified | — | 78.7% | — |
Token Efficiency Under Higher Benchmark Scores
A model that resolves an issue in one pass consumes fewer output tokens than one requiring two or three attempts. For Batch workloads where output costs dominate ($15/1M vs $2.50/1M input), higher first-pass accuracy has a direct, measurable impact on monthly spend beyond the 50% tier discount.
Limitations and Gotchas (24h SLA, 429 Errors, Token Overages)
Understanding the operational constraints of Batch and Flex mode is essential before routing production workloads through either tier. The Batch API’s 24-hour SLA is a hard ceiling, not a typical time — under normal load, most batches complete in 2–6 hours, but you must architect your pipeline to handle the full 24-hour window. Do not use Batch for any workflow where a stakeholder is waiting on the result today. Flex mode’s 429 errors are a more operationally complex issue: during peak platform load, Flex requests may be rejected entirely rather than queued, meaning your retry logic must handle outright failures, not just slow responses. The token overage pricing for prompts exceeding 272K tokens deserves special attention — at 2x input and 1.5x output for the entire session, a single oversized request can cost 3–4x what you expected. This is particularly relevant for large codebase analysis tasks where you might naively concatenate entire files into context. Batch API also has a 200MB file size limit and a 50,000 request cap per submission — teams with very large nightly jobs may need to split submissions across multiple batch files.
| Constraint | Batch | Flex |
|---|---|---|
| Max requests per job | 50,000 | N/A (per-call) |
| Max file size | 200MB | N/A |
| Completion SLA | 24 hours | Variable (seconds–minutes) |
| 429 errors possible? | No | Yes (peak traffic) |
| Prompt caching compatible? | Yes | Yes |
| Token overage threshold | 272K tokens (2x/1.5x pricing) | 272K tokens (2x/1.5x pricing) |
File Size Planning for Large Batch Jobs
At 200MB per file with average request sizes of 4KB (1,000-token prompt + metadata), you can fit approximately 50,000 requests — which coincides with the request cap. If your prompts are larger (10–20KB each due to large code context), the file size limit becomes the binding constraint before the request cap.
Final Verdict: How to Architect a Cost-Optimized GPT-5.5 Coding Pipeline
The optimal cost structure for a GPT-5.5 coding pipeline routes workloads across all four pricing tiers based on latency requirements and interactivity needs, with the goal of minimizing spend without sacrificing response quality or user experience. Every API call in your system should have an explicit tier assignment, not a default fallback to Standard. For any team serious about cost control, the practical architecture looks like this: route user-facing IDE completions and live chat to Standard or Priority; route all background agent loops, CI post-processing, and non-urgent enrichment to Flex with retry logic; route all scheduled jobs, nightly runs, eval pipelines, and embedding refreshes to Batch. Layer prompt caching on top of Batch for maximum compound savings. The result is a tiered system where only the smallest fraction of your requests — the truly latency-critical ones — pay full standard prices, while 50–70% of your volume runs at half cost or less. GPT-5.5’s superior benchmark scores mean that even on the cheaper tiers, you’re getting better results than you did from GPT-5.4 at standard pricing. The upgrade path is effectively cost-neutral for batch-heavy teams, and actively cost-reducing for teams that haven’t yet segmented their workloads by latency requirement.
Quick-start decision tree:
- Is a human actively waiting? → Standard or Priority
- Is it interactive but non-urgent (agent loop, CI step)? → Flex
- Is it a scheduled or async job? → Batch
- Does the batch have a shared prompt prefix >1,024 tokens? → Batch + Caching
FAQ
The most common questions about GPT-5.5 Batch API and Flex mode center on three practical concerns: which tier to use for which workload, how to handle operational edge cases like 429 errors and batch file limits, and whether the upgrade from GPT-5.4 is worth the migration effort. The short answer on the last point is yes — GPT-5.5 Batch pricing equals GPT-5.4 standard pricing ($2.50/$15 per 1M tokens), so any team running background workloads gets a capability upgrade at zero incremental cost. GPT-5.5 launched on April 23, 2026, with measurably higher coding benchmark scores than its predecessor. The answers below address the most common implementation questions from engineering teams evaluating both tiers, covering Python SDK integration, retry logic for Flex 429 errors, prompt caching compatibility, and the concrete ROI case for switching batch-eligible workloads. Each answer is written to stand alone without requiring context from earlier in the article.
What is the difference between GPT-5.5 Batch API and Flex mode?
Batch API is fully asynchronous — you submit a file of up to 50,000 requests and receive results within 24 hours with no real-time interaction. Flex mode is interactive: you make standard API calls but with service_tier="flex", and responses arrive with variable latency (seconds to minutes) rather than the consistent speed of the Standard tier. Both cost $2.50/1M input and $15/1M output — 50% off Standard rates.
Can I use GPT-5.5 Batch API in my existing CI/CD pipeline?
Yes. The Batch API integrates with any CI/CD system that can run Python or Node.js scripts. The typical pattern is: (1) generate the JSONL request file at the end of a build, (2) submit the batch, (3) store the batch ID, (4) have the next day’s build or a separate scheduled job poll for completion and download results. Do not block the current pipeline on batch completion — treat it as a separate async workflow.
Does prompt caching work with GPT-5.5 Batch API?
Yes, prompt caching and Batch API discounts stack. Cached input tokens (prefixes exceeding 1,024 tokens that repeat across requests in a batch) are priced at approximately $1.25/1M — 75% off standard rates. Keep your system prompt and shared context as a fixed prefix at the top of every batch request to maximize cache hit rates.
What happens if a Flex request gets a 429 error?
A 429 during Flex processing means the platform is under high load and your request was not queued. Implement exponential backoff: wait 1 second after the first failure, 2 seconds after the second, 4 after the third, and so on up to a configured maximum. If all retries are exhausted, fall back to the Standard tier for that specific request. Never use Flex for user-facing requests where a 429 would break the user experience.
Is GPT-5.5 better than GPT-5.4 for coding tasks at the same cost?
Yes, when using Batch pricing. GPT-5.5 Batch pricing ($2.50/$15 per 1M tokens) equals GPT-5.4 Standard pricing, but GPT-5.5 scores 82.7% on Terminal-Bench 2.0 versus GPT-5.4’s 75.1%. For teams currently using GPT-5.4 at standard rates and running any batch-eligible workloads, switching to GPT-5.5 Batch delivers higher capability at identical or lower cost — a direct upgrade with no tradeoffs.
