Gpt-5.5 on RockB

GPT-5.5 Batch API and Flex Mode: 50% Cost Savings for High-Volume AI Coding Tasks

Sat, 25 Apr 2026 12:04:50 +0000

GPT-5.5 Batch API and Flex mode both offer 50% off standard pricing — $2.50 per 1M input tokens and $15 per 1M output tokens versus the standard $5/$30 — giving high-volume AI coding teams a direct path to halving their monthly API spend without changing models or degrading output quality.

What Is GPT-5.5 Batch API and Flex Mode?

GPT-5.5 Batch API and Flex mode are two distinct pricing and execution tiers from OpenAI that both deliver 50% cost savings compared to standard API rates, but differ significantly in how and when results are returned. The Batch API is a fire-and-forget system: you submit up to 50,000 requests in a single JSONL file (up to 200MB), and OpenAI guarantees results within 24 hours. Flex mode, currently in beta as of April 2026, is interactive — requests are processed in real time but with variable latency ranging from a few seconds to several minutes, depending on platform load. GPT-5.5 launched on April 23, 2026, at standard pricing of $5 per 1M input tokens and $30 per 1M output tokens. Both Batch and Flex bring that cost down to $2.50/$15 — the same price as GPT-5.4 standard, but with GPT-5.5’s higher capability, including an 82.7% score on Terminal-Bench 2.0 and 58.6% on SWE-Bench Pro. For engineering teams running nightly code reviews, eval pipelines, or test generation jobs, the practical implication is straightforward: you get a better model at the same cost you were already paying.

Batch vs Flex: The Core Distinction

Batch is fully asynchronous with a 24-hour SLA and no interaction mid-job. Flex is interactive but non-priority — you may encounter HTTP 429 errors during peak traffic windows. Neither tier is suitable for production user-facing requests where sub-second latency is required.

GPT-5.5 Pricing Tiers at a Glance (Standard vs Flex vs Batch vs Priority)

OpenAI now offers four pricing and execution tiers for GPT-5.5, each targeting a different latency-cost tradeoff. Priority tier sits at the top of the speed stack — requests jump the queue for the fastest possible response, priced at a premium above standard rates. Standard tier at $5 per 1M input and $30 per 1M output is the default for most API calls today. Flex and Batch both land at $2.50/$15 — exactly 50% off standard — but serve different use cases. Flex accepts interactive API calls with variable latency, making it usable inside agent loops or CI/CD pipelines where a few extra seconds per call is acceptable. Batch, by contrast, is non-interactive: you upload a file, wait up to 24 hours, and download the results. One important pricing edge case: prompts exceeding 272K tokens are charged at 2x input and 1.5x output rates for the entire session — plan your context window sizes accordingly for large codebase analysis tasks.

Tier	Input (per 1M)	Output (per 1M)	Latency	Interactive?
Priority	>$5	>$30	Fastest	Yes
Standard	$5.00	$30.00	Fast	Yes
Flex	$2.50	$15.00	Seconds–minutes	Yes (beta)
Batch	$2.50	$15.00	Up to 24h	No

When Priority Makes Sense

Priority is reserved for latency-critical production paths where a delayed response directly impacts user experience — think real-time IDE completions or live pair programming assistants. Everything else should flow down the pricing tiers.

Which AI Coding Workflows Qualify for Batch or Flex Mode?

Between 40 and 60% of typical engineering team API workloads are batch-eligible according to usage analysis of common API consumption patterns — a substantial portion of spend that most teams are leaving on the table. The key qualifying criterion for Batch is that the task can tolerate async results: you don’t need the answer in real time. For Flex, the criterion is softer: the task is interactive but not latency-critical — a few extra seconds or minutes is acceptable. Concrete batch-eligible workflows include nightly code review runs across the entire diff since the last merge, automated unit test generation during off-hours CI jobs, eval grading pipelines for fine-tuning or regression testing, embedding refreshes when documentation or codebase content changes, and PR summary generation for engineering digests. Flex-eligible workflows include agent loops that chain multiple model calls where intermediate latency isn’t user-visible, data enrichment tasks that run in the background during active development, and CI/CD steps that run post-merge rather than blocking the merge queue. Standard or Priority should be reserved for inline IDE completions, live chat interfaces, real-time code explanations, and any workflow where a human is actively waiting on the response.

Workflow	Recommended Tier	Reason
Nightly code review	Batch	Async, no user waiting
Test generation in CI	Batch / Flex	Off-critical path
Embedding refresh	Batch	Pure throughput
Eval grading	Batch	Fire-and-forget
Agent loop (internal calls)	Flex	Interactive but non-urgent
PR summary digest	Batch	Scheduled job
Inline IDE completion	Standard / Priority	User is actively waiting
Live chat assistant	Standard	Latency-sensitive

How to Audit Your Current API Usage

Pull your OpenAI usage logs for the last 30 days and tag each request type as “user-facing” or “background.” Any background request is a Batch or Flex candidate. Most teams find 40–60% of their volume is immediately reclassifiable.

How to Implement GPT-5.5 Batch API in Python (Step-by-Step)

The GPT-5.5 Batch API requires openai>=2.1.0 and follows a three-step pattern: upload a JSONL file containing your requests, submit the batch job, then poll for completion and download the results. Each line in the JSONL file is a self-contained API request object with a custom custom_id for result matching. The system supports up to 50,000 requests per file and files up to 200MB, with all results guaranteed within 24 hours. Here is a complete working implementation for running nightly code reviews across a list of diffs:

import json
import time
from openai import OpenAI

client = OpenAI()  # reads OPENAI_API_KEY from environment

# Step 1: Prepare the JSONL batch file
diffs = [
    {"id": "pr-101", "diff": "......"},
    {"id": "pr-102", "diff": "......"},
]

with open("batch_requests.jsonl", "w") as f:
    for item in diffs:
        request = {
            "custom_id": item["id"],
            "method": "POST",
            "url": "/v1/chat/completions",
            "body": {
                "model": "gpt-5.5",
                "messages": [
                    {
                        "role": "system",
                        "content": "You are a senior code reviewer. Identify bugs, security issues, and style violations."
                    },
                    {
                        "role": "user",
                        "content": f"Review this diff:\n\n{item['diff']}"
                    }
                ],
                "max_tokens": 1024
            }
        }
        f.write(json.dumps(request) + "\n")

# Step 2: Upload the file
with open("batch_requests.jsonl", "rb") as f:
    upload = client.files.create(file=f, purpose="batch")

print(f"Uploaded file: {upload.id}")

# Step 3: Submit the batch job
batch = client.batches.create(
    input_file_id=upload.id,
    endpoint="/v1/chat/completions",
    completion_window="24h"
)

print(f"Batch submitted: {batch.id} — status: {batch.status}")

# Step 4: Poll for completion (in production, use a scheduled job instead)
while batch.status not in ("completed", "failed", "cancelled"):
    time.sleep(60)
    batch = client.batches.retrieve(batch.id)
    print(f"Status: {batch.status} — completed: {batch.request_counts.completed}/{batch.request_counts.total}")

# Step 5: Download and process results
if batch.status == "completed":
    content = client.files.content(batch.output_file_id)
    for line in content.text.strip().split("\n"):
        result = json.loads(line)
        custom_id = result["custom_id"]
        review_text = result["response"]["body"]["choices"][0]["message"]["content"]
        print(f"\n--- Review for {custom_id} ---\n{review_text}")

Error Handling and Partial Failures

Batch jobs can partially succeed — individual requests may fail while others complete. Always check batch.error_file_id after completion and download the error file alongside the output file. Log failed custom_id values and resubmit them in the next batch cycle.

Flex Processing: When to Choose It Over Batch

Flex processing is OpenAI’s interactive-but-discounted tier, currently in beta for GPT-5.5, o3, and o4-mini as of April 2026. It cuts standard rates by 50% while preserving the real-time request-response pattern — meaning your code calls client.chat.completions.create() normally, but with service_tier="flex" added. The tradeoff is variable latency: responses arrive within seconds under low load, but can take several minutes when the platform is busy. Flex may also return HTTP 429 errors during peak windows, so retry logic is mandatory. The practical use case is agent pipelines where model calls happen in a background thread or async queue — the loop keeps running, and a slightly delayed intermediate response doesn’t break anything. A CI/CD step that analyzes test failures after a build completes is a good Flex candidate: the developer isn’t watching the terminal, and a 90-second response versus a 3-second one is irrelevant.

# Using Flex mode — same SDK call, different service tier
response = client.chat.completions.create(
    model="gpt-5.5",
    service_tier="flex",  # the only required change
    messages=[
        {"role": "user", "content": "Analyze these test failures and suggest fixes: ..."}
    ]
)

Retry Logic for Flex 429 Errors

import time
from openai import RateLimitError

def flex_call_with_retry(messages, max_retries=5):
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(
                model="gpt-5.5",
                service_tier="flex",
                messages=messages
            )
        except RateLimitError:
            wait = 2 ** attempt  # exponential backoff: 1s, 2s, 4s, 8s, 16s
            print(f"Flex 429 — retrying in {wait}s (attempt {attempt + 1}/{max_retries})")
            time.sleep(wait)
    raise RuntimeError("Flex call failed after max retries")

Stacking Discounts — Batch + Prompt Caching for Maximum Savings

Prompt caching and Batch API discounts stack multiplicatively, creating the most cost-efficient configuration available for high-volume GPT-5.5 workloads. OpenAI’s prompt caching automatically kicks in for prompts exceeding 1,024 tokens when the same prefix appears repeatedly — cached input tokens are priced at 50% of the standard input rate. When you’re already on Batch pricing ($2.50/1M input), cached tokens drop further to approximately $1.25/1M. For a team running nightly code reviews where the system prompt and codebase context stay constant across 500 PR reviews, the combined discount on the static prefix can approach 75% off standard rates. The key implementation detail is keeping your system prompt and shared context at the front of every request in the batch file, unchanged, so OpenAI’s caching infrastructure can recognize and serve the shared prefix from cache. Variable content — the specific diff or file being reviewed — goes at the end of the messages array.

SYSTEM_PROMPT = """You are a senior code reviewer for a Python backend team.
Our style guide: PEP 8, type hints required, no bare except clauses,
all public functions must have docstrings. Flag: security vulnerabilities,
N+1 query patterns, missing input validation, and hardcoded secrets."""

# This 150-token system prompt gets cached after the first request in the batch.
# All subsequent requests in the batch reuse it at ~$1.25/1M input tokens.
def make_batch_request(pr_id, diff):
    return {
        "custom_id": pr_id,
        "method": "POST",
        "url": "/v1/chat/completions",
        "body": {
            "model": "gpt-5.5",
            "messages": [
                {"role": "system", "content": SYSTEM_PROMPT},  # cached prefix
                {"role": "user", "content": f"Review this diff:\n\n{diff}"}  # variable
            ]
        }
    }

Estimating Your Combined Savings

For 1M input tokens on a batch with 70% cached prefixes: 700K tokens at $1.25/1M + 300K tokens at $2.50/1M = $0.875 + $0.75 = $1.625 total, versus $5.00 at standard uncached rates. That’s a 67.5% reduction.

Real-World ROI: What 50% Savings Looks Like for a Dev Team

A 10-developer engineering team running AI-assisted workflows at scale provides a concrete reference point for the financial impact of switching batch-eligible workloads from Standard to Batch or Flex. Assume the team consumes 50M input tokens and 10M output tokens monthly — a realistic figure for teams running inline completions, code review bots, test generators, and documentation tools. At standard GPT-5.5 rates ($5/$30), that’s $250 input + $300 output = $550/month. If 50% of that workload is batch-eligible (a conservative estimate given the 40–60% industry benchmark), switching those jobs to Batch pricing reduces the eligible portion from $275 to $137.50 — a monthly saving of $137.50, or $1,650/year. For a team spending $2,000–$5,000/month on API costs, the savings scale proportionally. A team at $5,000/month with 50% batch-eligible workload saves $1,250/month — $15,000/year — without any change to output quality, since Batch jobs use the same model weights as Standard.

Monthly Spend	Batch-Eligible %	Monthly Savings	Annual Savings
$500	50%	$125	$1,500
$2,000	50%	$500	$6,000
$5,000	50%	$1,250	$15,000
$10,000	60%	$3,000	$36,000

The Cost-Neutral GPT-5.5 Upgrade

GPT-5.5 Batch pricing ($2.50/$15) equals GPT-5.4 standard pricing — meaning any team currently using GPT-5.4 can upgrade to GPT-5.5 Batch and get better benchmark performance (82.7% vs 75.1% on Terminal-Bench 2.0) at identical cost. This is the clearest no-compromise upgrade path available in the market as of April 2026.

GPT-5.5 Coding Benchmarks — Is the Upgrade Worth It?

GPT-5.5 demonstrates measurable improvements over GPT-5.4 across the benchmarks that matter most for software engineering tasks, making it the more capable model at the same effective cost when used with Batch pricing. On Terminal-Bench 2.0, which tests complex CLI workflows and multi-step shell interactions, GPT-5.5 scores 82.7% versus GPT-5.4’s 75.1% — a 7.6 percentage point improvement that translates directly to fewer retries and more reliable automated tooling. On SWE-Bench Pro, which evaluates real-world GitHub issue resolution with actual repository context, GPT-5.5 achieves 58.6%. On OSWorld-Verified, which measures autonomous computer environment operation (browser control, file system navigation, application interaction), GPT-5.5 reaches 78.7%. These gains matter for teams using AI in CI/CD pipelines: higher SWE-Bench scores mean the model resolves more issues on the first attempt, reducing the number of tokens consumed per successful fix — which compounds the cost savings from Batch pricing.

Benchmark	GPT-5.4	GPT-5.5	Delta
Terminal-Bench 2.0	75.1%	82.7%	+7.6pp
SWE-Bench Pro	—	58.6%	—
OSWorld-Verified	—	78.7%	—

Token Efficiency Under Higher Benchmark Scores

A model that resolves an issue in one pass consumes fewer output tokens than one requiring two or three attempts. For Batch workloads where output costs dominate ($15/1M vs $2.50/1M input), higher first-pass accuracy has a direct, measurable impact on monthly spend beyond the 50% tier discount.

Limitations and Gotchas (24h SLA, 429 Errors, Token Overages)

Understanding the operational constraints of Batch and Flex mode is essential before routing production workloads through either tier. The Batch API’s 24-hour SLA is a hard ceiling, not a typical time — under normal load, most batches complete in 2–6 hours, but you must architect your pipeline to handle the full 24-hour window. Do not use Batch for any workflow where a stakeholder is waiting on the result today. Flex mode’s 429 errors are a more operationally complex issue: during peak platform load, Flex requests may be rejected entirely rather than queued, meaning your retry logic must handle outright failures, not just slow responses. The token overage pricing for prompts exceeding 272K tokens deserves special attention — at 2x input and 1.5x output for the entire session, a single oversized request can cost 3–4x what you expected. This is particularly relevant for large codebase analysis tasks where you might naively concatenate entire files into context. Batch API also has a 200MB file size limit and a 50,000 request cap per submission — teams with very large nightly jobs may need to split submissions across multiple batch files.

Constraint	Batch	Flex
Max requests per job	50,000	N/A (per-call)
Max file size	200MB	N/A
Completion SLA	24 hours	Variable (seconds–minutes)
429 errors possible?	No	Yes (peak traffic)
Prompt caching compatible?	Yes	Yes
Token overage threshold	272K tokens (2x/1.5x pricing)	272K tokens (2x/1.5x pricing)

File Size Planning for Large Batch Jobs

At 200MB per file with average request sizes of 4KB (1,000-token prompt + metadata), you can fit approximately 50,000 requests — which coincides with the request cap. If your prompts are larger (10–20KB each due to large code context), the file size limit becomes the binding constraint before the request cap.

Final Verdict: How to Architect a Cost-Optimized GPT-5.5 Coding Pipeline

The optimal cost structure for a GPT-5.5 coding pipeline routes workloads across all four pricing tiers based on latency requirements and interactivity needs, with the goal of minimizing spend without sacrificing response quality or user experience. Every API call in your system should have an explicit tier assignment, not a default fallback to Standard. For any team serious about cost control, the practical architecture looks like this: route user-facing IDE completions and live chat to Standard or Priority; route all background agent loops, CI post-processing, and non-urgent enrichment to Flex with retry logic; route all scheduled jobs, nightly runs, eval pipelines, and embedding refreshes to Batch. Layer prompt caching on top of Batch for maximum compound savings. The result is a tiered system where only the smallest fraction of your requests — the truly latency-critical ones — pay full standard prices, while 50–70% of your volume runs at half cost or less. GPT-5.5’s superior benchmark scores mean that even on the cheaper tiers, you’re getting better results than you did from GPT-5.4 at standard pricing. The upgrade path is effectively cost-neutral for batch-heavy teams, and actively cost-reducing for teams that haven’t yet segmented their workloads by latency requirement.

Quick-start decision tree:

Is a human actively waiting? → Standard or Priority
Is it interactive but non-urgent (agent loop, CI step)? → Flex
Is it a scheduled or async job? → Batch
Does the batch have a shared prompt prefix >1,024 tokens? → Batch + Caching

FAQ

The most common questions about GPT-5.5 Batch API and Flex mode center on three practical concerns: which tier to use for which workload, how to handle operational edge cases like 429 errors and batch file limits, and whether the upgrade from GPT-5.4 is worth the migration effort. The short answer on the last point is yes — GPT-5.5 Batch pricing equals GPT-5.4 standard pricing ($2.50/$15 per 1M tokens), so any team running background workloads gets a capability upgrade at zero incremental cost. GPT-5.5 launched on April 23, 2026, with measurably higher coding benchmark scores than its predecessor. The answers below address the most common implementation questions from engineering teams evaluating both tiers, covering Python SDK integration, retry logic for Flex 429 errors, prompt caching compatibility, and the concrete ROI case for switching batch-eligible workloads. Each answer is written to stand alone without requiring context from earlier in the article.

What is the difference between GPT-5.5 Batch API and Flex mode?

Batch API is fully asynchronous — you submit a file of up to 50,000 requests and receive results within 24 hours with no real-time interaction. Flex mode is interactive: you make standard API calls but with service_tier="flex", and responses arrive with variable latency (seconds to minutes) rather than the consistent speed of the Standard tier. Both cost $2.50/1M input and $15/1M output — 50% off Standard rates.

Can I use GPT-5.5 Batch API in my existing CI/CD pipeline?

Yes. The Batch API integrates with any CI/CD system that can run Python or Node.js scripts. The typical pattern is: (1) generate the JSONL request file at the end of a build, (2) submit the batch, (3) store the batch ID, (4) have the next day’s build or a separate scheduled job poll for completion and download results. Do not block the current pipeline on batch completion — treat it as a separate async workflow.

Does prompt caching work with GPT-5.5 Batch API?

Yes, prompt caching and Batch API discounts stack. Cached input tokens (prefixes exceeding 1,024 tokens that repeat across requests in a batch) are priced at approximately $1.25/1M — 75% off standard rates. Keep your system prompt and shared context as a fixed prefix at the top of every batch request to maximize cache hit rates.

What happens if a Flex request gets a 429 error?

A 429 during Flex processing means the platform is under high load and your request was not queued. Implement exponential backoff: wait 1 second after the first failure, 2 seconds after the second, 4 after the third, and so on up to a configured maximum. If all retries are exhausted, fall back to the Standard tier for that specific request. Never use Flex for user-facing requests where a 429 would break the user experience.

Is GPT-5.5 better than GPT-5.4 for coding tasks at the same cost?

Yes, when using Batch pricing. GPT-5.5 Batch pricing ($2.50/$15 per 1M tokens) equals GPT-5.4 Standard pricing, but GPT-5.5 scores 82.7% on Terminal-Bench 2.0 versus GPT-5.4’s 75.1%. For teams currently using GPT-5.4 at standard rates and running any batch-eligible workloads, switching to GPT-5.5 Batch delivers higher capability at identical or lower cost — a direct upgrade with no tradeoffs.

OpenAI Hosted Shell and Apply Patch: GPT-5.5 Compute Tools for Autonomous Code Execution

Sat, 25 Apr 2026 10:05:54 +0000

GPT-5.5’s hosted shell and apply_patch tools let you run autonomous coding agents that explore filesystems, execute commands, and apply precise code edits — all inside an OpenAI-managed Debian 12 sandbox with no infrastructure to maintain.

What Are OpenAI’s Compute Tools? Hosted Shell and Apply Patch Explained

OpenAI’s compute tools are two purpose-built capabilities in the Responses API that give models direct access to code execution environments and structured file-editing primitives. The hosted shell tool provisions an ephemeral Debian 12 container where GPT-5.5 can run arbitrary shell commands — installing packages, running test suites, inspecting file trees, and producing downloadable artifacts via /mnt/data. The apply_patch tool gives the model a structured way to propose file modifications using the V4A diff format, which supports create_file, update_file, and delete_file operations with surgical precision. Together, these two tools form a closed loop: the model explores a codebase with shell commands, identifies what needs to change, and applies those changes via structured patches — without the host application needing to interpret or re-execute diffs. As of April 2026, these tools are only available through the Responses API (not the Chat Completions API) and require GPT-5.5 or compatible models. The combination represents OpenAI’s most direct answer to Claude Code, GitHub Copilot Agent, and similar agentic coding platforms.

GPT-5.5 (Spud): The Model That Powers These Tools

GPT-5.5, codenamed “Spud,” was released on April 23, 2026 — the first fully retrained base model since GPT-4.5. It is specifically optimized for agentic, multi-step workflows that involve tool use across long contexts. GPT-5.5 achieves 82.7% on Terminal-Bench 2.0, the state-of-the-art benchmark for complex command-line workflows, and 58.6% on SWE-Bench Pro for real-world GitHub issue resolution (compared to Claude Opus 4.7’s 64.3% on the same benchmark). The model supports a 1M token context window and natively integrates with hosted shell, apply_patch, computer use, Skills, MCP servers, and web search. Pricing is $5 per 1M input tokens and $30 per 1M output tokens — double the GPT-5.4 rate, reflecting the higher capability level. GPT-5.5 Pro ($30/$180 per 1M tokens) offers enhanced reasoning but notably does not support apply_patch, making standard GPT-5.5 the correct choice for autonomous code-editing agents. If your workflow requires multi-file refactoring, bug patching, or test generation at scale, GPT-5.5 is the model to use.

How the Hosted Shell Works: Debian 12 Container Architecture

The hosted shell provisions an OpenAI-managed Debian 12 environment with controlled internet access that is isolated from your application’s runtime and credentials. When you include {"type": "shell"} in the tools array and set container to "container_auto", OpenAI automatically allocates a fresh container for each session. The model can execute any shell command — apt-get install, pytest, git log, find, curl — and the output streams back from the container runtime into the model’s context. Files written to /mnt/data inside the container become downloadable artifacts available after the session. Container pricing is separate from token costs: $0.03/GB for 1GB sessions or $1.92/64GB for larger workloads, billed per 20-minute session window (pricing active from March 31, 2026). The architecture deliberately separates the control harness (your application code, API keys, environment variables) from the compute layer (the sandboxed container), which prevents the model from exfiltrating credentials or making unauthorized network calls. Containers are ephemeral by default — state does not persist between API calls unless you mount a persistent volume or use the /mnt/data artifact mechanism.

from openai import OpenAI

client = OpenAI()

response = client.responses.create(
    model="gpt-5.5",
    tools=[{"type": "shell"}],
    container="container_auto",
    input="List all Python files in /workspace and count total lines of code.",
)

for event in response:
    if event.type == "shell_call":
        print(f"Running: {event.command}")
    elif event.type == "shell_call_output":
        print(f"Output: {event.output[:200]}")

Container Session Lifecycle

A container session begins when the first shell command executes and ends after 20 minutes of inactivity or when explicitly closed. Within a session, the container maintains full filesystem state — installed packages, created files, environment variables set by earlier commands. This allows multi-turn interactions where the model installs dependencies in one turn and runs tests in the next without re-provisioning. When building long-running agents, structure your prompts to batch related operations within a single session window to minimize container provisioning overhead and cost.

The Apply Patch Tool: V4A Diff Format for Precise Code Edits

The apply_patch tool gives GPT-5.5 a structured mechanism for proposing file modifications that your application can review, approve, or reject before execution. Unlike shell-based sed or patch commands that operate inside the sandbox, apply_patch emits structured apply_patch_call objects in the model’s response output — the actual file changes happen in your filesystem, not the container’s, giving you full control over what gets modified. The tool uses the V4A diff format, a compact patch syntax that supports three operations: create_file (with full content), update_file (with context lines and replacements), and delete_file. Enable it by adding {"type": "apply_patch"} to your tools array. The model generates patches that are precise, machine-readable, and auditable — each patch specifies exactly which lines change and why, making code review tractable even for large refactors. This design reflects a key architectural choice: the model proposes, the human (or application) disposes. You can add an approval gate, write the patches to a staging directory, run your test suite against them, and only apply on green.

response = client.responses.create(
    model="gpt-5.5",
    tools=[
        {"type": "shell"},
        {"type": "apply_patch"},
    ],
    container="container_auto",
    input="""
    Read src/auth.py. The JWT token expiry is hardcoded to 3600 seconds.
    Refactor it to read from an environment variable JWT_EXPIRY_SECONDS with
    a fallback of 3600. Apply the patch when ready.
    """,
)

for event in response:
    if event.type == "apply_patch_call":
        # Review patch before applying
        print(event.patch)
        # Apply: event.apply() or handle manually

V4A Diff Format in Practice

The V4A format is intentionally minimal. An update_file patch looks like this:

Context lines (unchanged code around the edit) help the patch engine locate the right position even if line numbers have shifted. create_file patches include the full file content inline. delete_file patches require only the filename. The format is designed for model output — terse enough to fit in long context windows, structured enough to parse deterministically.

Building an Autonomous Coding Agent: Shell + Apply Patch Workflow

The most powerful pattern combines hosted shell for exploration and apply_patch for modifications in a four-phase loop: explore → plan → patch → verify. In the explore phase, the model uses shell commands to understand the codebase structure, identify failing tests, and locate the code that needs to change. In the plan phase, it reasons through the changes required. In the patch phase, it emits apply_patch_call objects for each file to modify. In the verify phase, it runs the test suite inside the container to confirm the changes are correct. This loop can run fully autonomously or with a human-in-the-loop approval gate between patch and verify. The shell tool handles exploration and verification; apply_patch handles modifications. Neither tool is sufficient alone — shell-only agents write changes via sed or tee, which is fragile and hard to audit; apply_patch-only agents cannot run tests to verify correctness. The combination is what makes the workflow production-grade.

from openai import OpenAI
import subprocess

client = OpenAI()

SYSTEM_PROMPT = """You are an autonomous coding agent. For each task:
1. Use shell to explore the codebase and understand the problem
2. Use shell to run existing tests to understand what's failing
3. Use apply_patch to propose precise code changes
4. Use shell to run tests again and verify your fix works
Report results when done."""

def run_agent(task: str, workspace: str):
    messages = [{"role": "user", "content": f"Workspace: {workspace}\n\nTask: {task}"}]

    while True:
        response = client.responses.create(
            model="gpt-5.5",
            tools=[
                {"type": "shell"},
                {"type": "apply_patch"},
            ],
            container="container_auto",
            system=SYSTEM_PROMPT,
            input=messages,
        )

        patches_applied = []
        for event in response:
            if event.type == "apply_patch_call":
                # Apply patch to local filesystem
                result = subprocess.run(
                    ["patch", "-p0"],
                    input=event.patch,
                    capture_output=True,
                    text=True
                )
                patches_applied.append({
                    "patch": event.patch,
                    "success": result.returncode == 0,
                    "output": result.stdout or result.stderr
                })

        if response.status == "completed":
            return {
                "patches": patches_applied,
                "summary": response.output_text
            }

        # Add tool results to message history and continue
        messages = response.messages

Real-World Use Cases: Refactors, Bug Fixes, and Migrations

Hosted shell and apply_patch unlock several high-value automated workflows that were previously too complex or risky to automate. Multi-file refactors: renaming a function across 50 files, updating import paths after a package reorganization, or migrating from one ORM to another. The model explores the codebase, identifies all affected files, and emits a sequence of apply_patch_call objects — one per file — that can be reviewed as a batch before application. Bug fixes from issue descriptions: given a GitHub issue URL or error stack trace, the agent reproduces the bug in the container, locates the root cause, patches it, and runs the test suite to confirm resolution. API migrations: when a third-party SDK releases a breaking change, the agent reads the migration guide (via shell curl), identifies all call sites in your codebase, and patches them to the new API. Test generation: the agent reads a source file, generates corresponding test cases in the container’s scratch space, validates they pass, then uses apply_patch to write the test file into your repository. Dependency upgrades: the agent runs pip install --upgrade or npm update, runs your test suite, identifies breakages, patches the affected code, and repeats until tests pass.

When Not to Use the Hosted Shell

The hosted shell is not appropriate for operations that require access to production systems, customer data, or credentials. The container isolation prevents credential theft by design, but this also means the agent cannot directly connect to your production database or internal services. For workflows that require such access, use the apply_patch tool in isolation (without hosted shell) combined with your own local execution environment, where you control what tools and credentials the agent can access.

Security Best Practices: Sandboxing, Path Validation, and Audit Logging

The hosted shell’s container isolation eliminates the most dangerous attack vector — direct access to the host filesystem and credentials — but applications using apply_patch still need their own security controls. The key principle: never apply patches to arbitrary paths without validation. Validate that all patch targets are within your project root, reject patches that modify .env files, credentials, or CI/CD configuration, and require explicit approval for patches to production code paths. Implement an audit log that records every apply_patch_call with the full patch content, timestamp, model version, and the task prompt that generated it — this creates an immutable record for debugging and compliance. For multi-agent pipelines where one agent’s output becomes another’s input, add an intermediate validation step that checks patch syntax, target path safety, and changeset size before forwarding. Rate-limit the number of files a single agent run can modify to bound blast radius. Finally, always run your test suite after applying patches in CI, even if the agent reports success — test suite verification in the container is informative but not authoritative for your actual test environment.

import os
from pathlib import Path

PROJECT_ROOT = Path("/workspace/myapp").resolve()
BLOCKED_PATTERNS = [".env", "credentials", "secrets", ".aws", ".ssh"]

def safe_apply_patch(patch_event, project_root=PROJECT_ROOT):
    """Validate and apply a patch only if targets are within project root."""
    lines = patch_event.patch.splitlines()
    targets = [l.split(": ", 1)[1] for l in lines if l.startswith("*** ")]

    for target in targets:
        target_path = (project_root / target).resolve()
        # Prevent path traversal
        if not str(target_path).startswith(str(project_root)):
            raise ValueError(f"Path traversal attempt: {target}")
        # Block sensitive files
        if any(p in str(target_path) for p in BLOCKED_PATTERNS):
            raise ValueError(f"Blocked sensitive path: {target}")

    # Safe to apply
    return subprocess.run(["patch", "-p0"], input=patch_event.patch, ...)

Pricing Breakdown: API Costs, Container Sessions, and When to Use GPT-5.5 Pro

Understanding the cost structure is essential for building economically viable agents. Token costs and container costs are billed independently and accumulate differently across agent run types.

Component	GPT-5.5	GPT-5.5 Pro
Input tokens	$5 / 1M	$30 / 1M
Output tokens	$30 / 1M	$180 / 1M
apply_patch	Supported	Not supported
Container (1GB)	$0.03/session	$0.03/session
Container (64GB)	$1.92/session	$1.92/session
Context window	1M tokens	1M tokens

GPT-5.5 Pro’s 6x token cost premium is only justified for tasks that require deep multi-step reasoning without tool use — complex architectural analysis, security audit reports, or algorithmic design. For any workflow that uses apply_patch, standard GPT-5.5 is the only option, as Pro explicitly does not support it. For high-volume batch workflows (nightly dependency updates, automated test generation across a monorepo), cache your system prompts and codebase context using the Responses API’s caching layer to reduce input token costs by up to 75%. A typical bug-fix agent run that explores 20 files and applies 3 patches costs approximately $0.08–$0.15 in tokens plus $0.03 for the container session — well under $0.20 per resolved issue.

Container Cost Optimization

Container sessions bill per 20-minute window, not per command. Batch multiple related operations within a single agent run to maximize utilization. If your workflow involves repeated runs against the same codebase (e.g., a nightly CI bot), use persistent volumes to avoid re-installing dependencies each session. For development and testing, use a local sandbox (Docker + the OpenAI API without container_auto) to avoid container costs entirely during iteration.

GPT-5.5 vs Claude Code vs GitHub Copilot Agent: Agentic Coding Comparison

The autonomous coding agent space now has three dominant approaches, each with distinct architectural trade-offs that affect what workflows they handle best.

Capability	GPT-5.5 (Shell + Patch)	Claude Code	GitHub Copilot Agent
Hosted sandbox	OpenAI-managed Debian 12	Local process	GitHub Actions runner
Code editing primitive	apply_patch (V4A)	Direct file writes	Direct file writes
Benchmark (SWE-Bench Pro)	58.6%	64.3% (Opus 4.7)	~52% (est.)
Terminal-Bench 2.0	82.7%	Not published	Not published
Context window	1M tokens	200K tokens	128K tokens
PR integration	Via API	Native Git	Native GitHub PRs
Audit trail	apply_patch_call log	Git diff	PR review thread
Pricing model	Per token + container	Subscription / API	Subscription

GPT-5.5 leads on Terminal-Bench 2.0 (CLI workflows) and context length, making it the strongest choice for large monorepo refactors where full-codebase context matters. Claude Opus 4.7 leads on SWE-Bench Pro (real GitHub issues), making it stronger for nuanced bug diagnosis. Copilot Agent has the tightest GitHub integration but the smallest context window, limiting it to targeted, file-scoped changes. For teams already invested in the OpenAI API ecosystem, GPT-5.5 with hosted shell and apply_patch delivers a cohesive platform without additional infrastructure. For teams that need maximum accuracy on complex bugs, Claude Code remains the benchmark leader.

Getting Started: Complete Code Example with Shell and Apply Patch

The following is a production-ready example that implements the full explore → patch → verify loop with error handling, patch validation, and result reporting. This pattern is suitable for CI/CD integration, nightly maintenance bots, or interactive developer tools.

from openai import OpenAI
from pathlib import Path
import subprocess
import json
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

client = OpenAI()

PROJECT_ROOT = Path.cwd()
BLOCKED_PATHS = {".env", ".aws", ".ssh", "credentials"}

SYSTEM_PROMPT = """You are a senior software engineer running inside an OpenAI compute environment.
You have access to a hosted shell and the apply_patch tool.

Your workflow for every task:
1. Use shell to understand the codebase structure (ls, find, cat key files)
2. Use shell to run existing tests and understand the current state
3. Plan your changes carefully before patching
4. Use apply_patch for each file modification — never use shell to write files directly
5. Use shell to run tests after patching and verify your changes work
6. Report results: files changed, tests passed/failed, any caveats

Be precise. Be minimal. Only change what the task requires."""

def validate_patch(patch_text: str) -> bool:
    lines = patch_text.splitlines()
    for line in lines:
        if line.startswith("*** ") and ": " in line:
            target = line.split(": ", 1)[1].strip()
            target_path = (PROJECT_ROOT / target).resolve()
            if not str(target_path).startswith(str(PROJECT_ROOT)):
                logger.error(f"Path traversal blocked: {target}")
                return False
            if any(blocked in target for blocked in BLOCKED_PATHS):
                logger.error(f"Sensitive path blocked: {target}")
                return False
    return True

def apply_patch(patch_text: str) -> dict:
    if not validate_patch(patch_text):
        return {"success": False, "error": "Patch validation failed"}

    result = subprocess.run(
        ["patch", "-p0", "--dry-run"],
        input=patch_text,
        capture_output=True,
        text=True,
        cwd=PROJECT_ROOT
    )
    if result.returncode != 0:
        return {"success": False, "error": result.stderr}

    result = subprocess.run(
        ["patch", "-p0"],
        input=patch_text,
        capture_output=True,
        text=True,
        cwd=PROJECT_ROOT
    )
    return {
        "success": result.returncode == 0,
        "output": result.stdout,
        "error": result.stderr if result.returncode != 0 else None
    }

def run_coding_agent(task: str) -> dict:
    logger.info(f"Starting agent for task: {task[:80]}...")
    audit_log = []
    patches_applied = []

    response = client.responses.create(
        model="gpt-5.5",
        tools=[
            {"type": "shell"},
            {"type": "apply_patch"},
        ],
        container="container_auto",
        system=SYSTEM_PROMPT,
        input=task,
        stream=True,
    )

    for event in response:
        if event.type == "shell_call":
            logger.info(f"Shell: {event.command[:100]}")

        elif event.type == "apply_patch_call":
            logger.info("Patch proposed, validating...")
            audit_log.append({
                "type": "apply_patch_call",
                "patch": event.patch,
                "task": task,
            })
            result = apply_patch(event.patch)
            patches_applied.append(result)
            if result["success"]:
                logger.info("Patch applied successfully")
            else:
                logger.error(f"Patch failed: {result['error']}")

        elif event.type == "response.done":
            logger.info("Agent completed")

    return {
        "patches_applied": patches_applied,
        "patches_succeeded": sum(1 for p in patches_applied if p["success"]),
        "audit_log": audit_log,
        "summary": response.output_text if hasattr(response, "output_text") else "",
    }

if __name__ == "__main__":
    result = run_coding_agent(
        "Find all hardcoded timeout values in src/ and replace them with "
        "constants defined in src/config/timeouts.py. Create that file if it "
        "doesn't exist. Run the test suite to verify nothing breaks."
    )
    print(json.dumps(result, indent=2, default=str))

FAQ

Does the hosted shell have internet access? Yes, with restrictions. OpenAI-managed containers have controlled internet access that allows common package manager operations (apt-get, pip install, npm install) and public API calls, but blocks access to internal networks and restricts certain outbound protocols. This is intentional: the container needs to install dependencies but should not be able to reach your internal databases or VPNs.

Can I use apply_patch without the hosted shell? Yes. The apply_patch tool operates independently of the hosted shell. If your application already manages code execution locally (e.g., in a Docker container you control), you can enable only apply_patch and handle all file operations yourself. The model will emit apply_patch_call events that your application applies to its own filesystem.

Is GPT-5.5 better than Claude Code for autonomous coding? It depends on the benchmark. GPT-5.5 scores higher on Terminal-Bench 2.0 (82.7% vs. unreported for Claude Code), making it stronger for CLI-heavy workflows. Claude Opus 4.7 scores higher on SWE-Bench Pro (64.3% vs. GPT-5.5’s 58.6%), making it better for complex real-world bug resolution. For teams in the OpenAI ecosystem, GPT-5.5 with hosted shell and apply_patch is the most integrated solution.

What happens if a patch fails to apply? The apply_patch tool emits the patch as structured output — your application is responsible for applying it. If patch -p0 fails (e.g., due to context mismatch), you can return the error to the model in a follow-up turn and ask it to generate a corrected patch. Build retry logic with a maximum of 2–3 attempts before surfacing the error to a human reviewer.

How do I handle large codebases with GPT-5.5’s 1M token context? GPT-5.5’s 1M token context is large enough to hold approximately 30,000–40,000 lines of code. For monorepos larger than this, use the shell tool to identify the relevant subset of files (via grep, find, or language-specific analysis tools) and pass only those files as context. Structure your prompts to load files lazily — let the model request the files it needs rather than dumping the entire codebase upfront.