<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Gpt-5.5 on RockB</title><link>https://baeseokjae.github.io/tags/gpt-5.5/</link><description>Recent content in Gpt-5.5 on RockB</description><image><title>RockB</title><url>https://baeseokjae.github.io/images/og-default.png</url><link>https://baeseokjae.github.io/images/og-default.png</link></image><generator>Hugo</generator><language>en-us</language><lastBuildDate>Sat, 25 Apr 2026 12:04:50 +0000</lastBuildDate><atom:link href="https://baeseokjae.github.io/tags/gpt-5.5/index.xml" rel="self" type="application/rss+xml"/><item><title>GPT-5.5 Batch API and Flex Mode: 50% Cost Savings for High-Volume AI Coding Tasks</title><link>https://baeseokjae.github.io/posts/gpt-5-5-batch-flex-pricing-guide-2026/</link><pubDate>Sat, 25 Apr 2026 12:04:50 +0000</pubDate><guid>https://baeseokjae.github.io/posts/gpt-5-5-batch-flex-pricing-guide-2026/</guid><description>GPT-5.5 Batch and Flex mode cut your API bill by 50%. Learn which coding workflows qualify and how to implement batch jobs in Python.</description><content:encoded><![CDATA[<p>GPT-5.5 Batch API and Flex mode both offer 50% off standard pricing — $2.50 per 1M input tokens and $15 per 1M output tokens versus the standard $5/$30 — giving high-volume AI coding teams a direct path to halving their monthly API spend without changing models or degrading output quality.</p>
<h2 id="what-is-gpt-55-batch-api-and-flex-mode">What Is GPT-5.5 Batch API and Flex Mode?</h2>
<p>GPT-5.5 Batch API and Flex mode are two distinct pricing and execution tiers from OpenAI that both deliver 50% cost savings compared to standard API rates, but differ significantly in how and when results are returned. The Batch API is a fire-and-forget system: you submit up to 50,000 requests in a single JSONL file (up to 200MB), and OpenAI guarantees results within 24 hours. Flex mode, currently in beta as of April 2026, is interactive — requests are processed in real time but with variable latency ranging from a few seconds to several minutes, depending on platform load. GPT-5.5 launched on April 23, 2026, at standard pricing of $5 per 1M input tokens and $30 per 1M output tokens. Both Batch and Flex bring that cost down to $2.50/$15 — the same price as GPT-5.4 standard, but with GPT-5.5&rsquo;s higher capability, including an 82.7% score on Terminal-Bench 2.0 and 58.6% on SWE-Bench Pro. For engineering teams running nightly code reviews, eval pipelines, or test generation jobs, the practical implication is straightforward: you get a better model at the same cost you were already paying.</p>
<h3 id="batch-vs-flex-the-core-distinction">Batch vs Flex: The Core Distinction</h3>
<p>Batch is fully asynchronous with a 24-hour SLA and no interaction mid-job. Flex is interactive but non-priority — you may encounter HTTP 429 errors during peak traffic windows. Neither tier is suitable for production user-facing requests where sub-second latency is required.</p>
<h2 id="gpt-55-pricing-tiers-at-a-glance-standard-vs-flex-vs-batch-vs-priority">GPT-5.5 Pricing Tiers at a Glance (Standard vs Flex vs Batch vs Priority)</h2>
<p>OpenAI now offers four pricing and execution tiers for GPT-5.5, each targeting a different latency-cost tradeoff. Priority tier sits at the top of the speed stack — requests jump the queue for the fastest possible response, priced at a premium above standard rates. Standard tier at $5 per 1M input and $30 per 1M output is the default for most API calls today. Flex and Batch both land at $2.50/$15 — exactly 50% off standard — but serve different use cases. Flex accepts interactive API calls with variable latency, making it usable inside agent loops or CI/CD pipelines where a few extra seconds per call is acceptable. Batch, by contrast, is non-interactive: you upload a file, wait up to 24 hours, and download the results. One important pricing edge case: prompts exceeding 272K tokens are charged at 2x input and 1.5x output rates for the entire session — plan your context window sizes accordingly for large codebase analysis tasks.</p>
<table>
  <thead>
      <tr>
          <th>Tier</th>
          <th>Input (per 1M)</th>
          <th>Output (per 1M)</th>
          <th>Latency</th>
          <th>Interactive?</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Priority</td>
          <td>&gt;$5</td>
          <td>&gt;$30</td>
          <td>Fastest</td>
          <td>Yes</td>
      </tr>
      <tr>
          <td>Standard</td>
          <td>$5.00</td>
          <td>$30.00</td>
          <td>Fast</td>
          <td>Yes</td>
      </tr>
      <tr>
          <td>Flex</td>
          <td>$2.50</td>
          <td>$15.00</td>
          <td>Seconds–minutes</td>
          <td>Yes (beta)</td>
      </tr>
      <tr>
          <td>Batch</td>
          <td>$2.50</td>
          <td>$15.00</td>
          <td>Up to 24h</td>
          <td>No</td>
      </tr>
  </tbody>
</table>
<h3 id="when-priority-makes-sense">When Priority Makes Sense</h3>
<p>Priority is reserved for latency-critical production paths where a delayed response directly impacts user experience — think real-time IDE completions or live pair programming assistants. Everything else should flow down the pricing tiers.</p>
<h2 id="which-ai-coding-workflows-qualify-for-batch-or-flex-mode">Which AI Coding Workflows Qualify for Batch or Flex Mode?</h2>
<p>Between 40 and 60% of typical engineering team API workloads are batch-eligible according to usage analysis of common API consumption patterns — a substantial portion of spend that most teams are leaving on the table. The key qualifying criterion for Batch is that the task can tolerate async results: you don&rsquo;t need the answer in real time. For Flex, the criterion is softer: the task is interactive but not latency-critical — a few extra seconds or minutes is acceptable. Concrete batch-eligible workflows include nightly code review runs across the entire diff since the last merge, automated unit test generation during off-hours CI jobs, eval grading pipelines for fine-tuning or regression testing, embedding refreshes when documentation or codebase content changes, and PR summary generation for engineering digests. Flex-eligible workflows include agent loops that chain multiple model calls where intermediate latency isn&rsquo;t user-visible, data enrichment tasks that run in the background during active development, and CI/CD steps that run post-merge rather than blocking the merge queue. Standard or Priority should be reserved for inline IDE completions, live chat interfaces, real-time code explanations, and any workflow where a human is actively waiting on the response.</p>
<table>
  <thead>
      <tr>
          <th>Workflow</th>
          <th>Recommended Tier</th>
          <th>Reason</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Nightly code review</td>
          <td>Batch</td>
          <td>Async, no user waiting</td>
      </tr>
      <tr>
          <td>Test generation in CI</td>
          <td>Batch / Flex</td>
          <td>Off-critical path</td>
      </tr>
      <tr>
          <td>Embedding refresh</td>
          <td>Batch</td>
          <td>Pure throughput</td>
      </tr>
      <tr>
          <td>Eval grading</td>
          <td>Batch</td>
          <td>Fire-and-forget</td>
      </tr>
      <tr>
          <td>Agent loop (internal calls)</td>
          <td>Flex</td>
          <td>Interactive but non-urgent</td>
      </tr>
      <tr>
          <td>PR summary digest</td>
          <td>Batch</td>
          <td>Scheduled job</td>
      </tr>
      <tr>
          <td>Inline IDE completion</td>
          <td>Standard / Priority</td>
          <td>User is actively waiting</td>
      </tr>
      <tr>
          <td>Live chat assistant</td>
          <td>Standard</td>
          <td>Latency-sensitive</td>
      </tr>
  </tbody>
</table>
<h3 id="how-to-audit-your-current-api-usage">How to Audit Your Current API Usage</h3>
<p>Pull your OpenAI usage logs for the last 30 days and tag each request type as &ldquo;user-facing&rdquo; or &ldquo;background.&rdquo; Any background request is a Batch or Flex candidate. Most teams find 40–60% of their volume is immediately reclassifiable.</p>
<h2 id="how-to-implement-gpt-55-batch-api-in-python-step-by-step">How to Implement GPT-5.5 Batch API in Python (Step-by-Step)</h2>
<p>The GPT-5.5 Batch API requires <code>openai&gt;=2.1.0</code> and follows a three-step pattern: upload a JSONL file containing your requests, submit the batch job, then poll for completion and download the results. Each line in the JSONL file is a self-contained API request object with a custom <code>custom_id</code> for result matching. The system supports up to 50,000 requests per file and files up to 200MB, with all results guaranteed within 24 hours. Here is a complete working implementation for running nightly code reviews across a list of diffs:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> json
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> time
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> openai <span style="color:#f92672">import</span> OpenAI
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>client <span style="color:#f92672">=</span> OpenAI()  <span style="color:#75715e"># reads OPENAI_API_KEY from environment</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Step 1: Prepare the JSONL batch file</span>
</span></span><span style="display:flex;"><span>diffs <span style="color:#f92672">=</span> [
</span></span><span style="display:flex;"><span>    {<span style="color:#e6db74">&#34;id&#34;</span>: <span style="color:#e6db74">&#34;pr-101&#34;</span>, <span style="color:#e6db74">&#34;diff&#34;</span>: <span style="color:#e6db74">&#34;...&lt;git diff content&gt;...&#34;</span>},
</span></span><span style="display:flex;"><span>    {<span style="color:#e6db74">&#34;id&#34;</span>: <span style="color:#e6db74">&#34;pr-102&#34;</span>, <span style="color:#e6db74">&#34;diff&#34;</span>: <span style="color:#e6db74">&#34;...&lt;git diff content&gt;...&#34;</span>},
</span></span><span style="display:flex;"><span>]
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">with</span> open(<span style="color:#e6db74">&#34;batch_requests.jsonl&#34;</span>, <span style="color:#e6db74">&#34;w&#34;</span>) <span style="color:#66d9ef">as</span> f:
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">for</span> item <span style="color:#f92672">in</span> diffs:
</span></span><span style="display:flex;"><span>        request <span style="color:#f92672">=</span> {
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;custom_id&#34;</span>: item[<span style="color:#e6db74">&#34;id&#34;</span>],
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;method&#34;</span>: <span style="color:#e6db74">&#34;POST&#34;</span>,
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;url&#34;</span>: <span style="color:#e6db74">&#34;/v1/chat/completions&#34;</span>,
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;body&#34;</span>: {
</span></span><span style="display:flex;"><span>                <span style="color:#e6db74">&#34;model&#34;</span>: <span style="color:#e6db74">&#34;gpt-5.5&#34;</span>,
</span></span><span style="display:flex;"><span>                <span style="color:#e6db74">&#34;messages&#34;</span>: [
</span></span><span style="display:flex;"><span>                    {
</span></span><span style="display:flex;"><span>                        <span style="color:#e6db74">&#34;role&#34;</span>: <span style="color:#e6db74">&#34;system&#34;</span>,
</span></span><span style="display:flex;"><span>                        <span style="color:#e6db74">&#34;content&#34;</span>: <span style="color:#e6db74">&#34;You are a senior code reviewer. Identify bugs, security issues, and style violations.&#34;</span>
</span></span><span style="display:flex;"><span>                    },
</span></span><span style="display:flex;"><span>                    {
</span></span><span style="display:flex;"><span>                        <span style="color:#e6db74">&#34;role&#34;</span>: <span style="color:#e6db74">&#34;user&#34;</span>,
</span></span><span style="display:flex;"><span>                        <span style="color:#e6db74">&#34;content&#34;</span>: <span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Review this diff:</span><span style="color:#ae81ff">\n\n</span><span style="color:#e6db74">{</span>item[<span style="color:#e6db74">&#39;diff&#39;</span>]<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>
</span></span><span style="display:flex;"><span>                    }
</span></span><span style="display:flex;"><span>                ],
</span></span><span style="display:flex;"><span>                <span style="color:#e6db74">&#34;max_tokens&#34;</span>: <span style="color:#ae81ff">1024</span>
</span></span><span style="display:flex;"><span>            }
</span></span><span style="display:flex;"><span>        }
</span></span><span style="display:flex;"><span>        f<span style="color:#f92672">.</span>write(json<span style="color:#f92672">.</span>dumps(request) <span style="color:#f92672">+</span> <span style="color:#e6db74">&#34;</span><span style="color:#ae81ff">\n</span><span style="color:#e6db74">&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Step 2: Upload the file</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">with</span> open(<span style="color:#e6db74">&#34;batch_requests.jsonl&#34;</span>, <span style="color:#e6db74">&#34;rb&#34;</span>) <span style="color:#66d9ef">as</span> f:
</span></span><span style="display:flex;"><span>    upload <span style="color:#f92672">=</span> client<span style="color:#f92672">.</span>files<span style="color:#f92672">.</span>create(file<span style="color:#f92672">=</span>f, purpose<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;batch&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Uploaded file: </span><span style="color:#e6db74">{</span>upload<span style="color:#f92672">.</span>id<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Step 3: Submit the batch job</span>
</span></span><span style="display:flex;"><span>batch <span style="color:#f92672">=</span> client<span style="color:#f92672">.</span>batches<span style="color:#f92672">.</span>create(
</span></span><span style="display:flex;"><span>    input_file_id<span style="color:#f92672">=</span>upload<span style="color:#f92672">.</span>id,
</span></span><span style="display:flex;"><span>    endpoint<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;/v1/chat/completions&#34;</span>,
</span></span><span style="display:flex;"><span>    completion_window<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;24h&#34;</span>
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Batch submitted: </span><span style="color:#e6db74">{</span>batch<span style="color:#f92672">.</span>id<span style="color:#e6db74">}</span><span style="color:#e6db74"> — status: </span><span style="color:#e6db74">{</span>batch<span style="color:#f92672">.</span>status<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Step 4: Poll for completion (in production, use a scheduled job instead)</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">while</span> batch<span style="color:#f92672">.</span>status <span style="color:#f92672">not</span> <span style="color:#f92672">in</span> (<span style="color:#e6db74">&#34;completed&#34;</span>, <span style="color:#e6db74">&#34;failed&#34;</span>, <span style="color:#e6db74">&#34;cancelled&#34;</span>):
</span></span><span style="display:flex;"><span>    time<span style="color:#f92672">.</span>sleep(<span style="color:#ae81ff">60</span>)
</span></span><span style="display:flex;"><span>    batch <span style="color:#f92672">=</span> client<span style="color:#f92672">.</span>batches<span style="color:#f92672">.</span>retrieve(batch<span style="color:#f92672">.</span>id)
</span></span><span style="display:flex;"><span>    print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Status: </span><span style="color:#e6db74">{</span>batch<span style="color:#f92672">.</span>status<span style="color:#e6db74">}</span><span style="color:#e6db74"> — completed: </span><span style="color:#e6db74">{</span>batch<span style="color:#f92672">.</span>request_counts<span style="color:#f92672">.</span>completed<span style="color:#e6db74">}</span><span style="color:#e6db74">/</span><span style="color:#e6db74">{</span>batch<span style="color:#f92672">.</span>request_counts<span style="color:#f92672">.</span>total<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Step 5: Download and process results</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">if</span> batch<span style="color:#f92672">.</span>status <span style="color:#f92672">==</span> <span style="color:#e6db74">&#34;completed&#34;</span>:
</span></span><span style="display:flex;"><span>    content <span style="color:#f92672">=</span> client<span style="color:#f92672">.</span>files<span style="color:#f92672">.</span>content(batch<span style="color:#f92672">.</span>output_file_id)
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">for</span> line <span style="color:#f92672">in</span> content<span style="color:#f92672">.</span>text<span style="color:#f92672">.</span>strip()<span style="color:#f92672">.</span>split(<span style="color:#e6db74">&#34;</span><span style="color:#ae81ff">\n</span><span style="color:#e6db74">&#34;</span>):
</span></span><span style="display:flex;"><span>        result <span style="color:#f92672">=</span> json<span style="color:#f92672">.</span>loads(line)
</span></span><span style="display:flex;"><span>        custom_id <span style="color:#f92672">=</span> result[<span style="color:#e6db74">&#34;custom_id&#34;</span>]
</span></span><span style="display:flex;"><span>        review_text <span style="color:#f92672">=</span> result[<span style="color:#e6db74">&#34;response&#34;</span>][<span style="color:#e6db74">&#34;body&#34;</span>][<span style="color:#e6db74">&#34;choices&#34;</span>][<span style="color:#ae81ff">0</span>][<span style="color:#e6db74">&#34;message&#34;</span>][<span style="color:#e6db74">&#34;content&#34;</span>]
</span></span><span style="display:flex;"><span>        print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;</span><span style="color:#ae81ff">\n</span><span style="color:#e6db74">--- Review for </span><span style="color:#e6db74">{</span>custom_id<span style="color:#e6db74">}</span><span style="color:#e6db74"> ---</span><span style="color:#ae81ff">\n</span><span style="color:#e6db74">{</span>review_text<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span></code></pre></div><h3 id="error-handling-and-partial-failures">Error Handling and Partial Failures</h3>
<p>Batch jobs can partially succeed — individual requests may fail while others complete. Always check <code>batch.error_file_id</code> after completion and download the error file alongside the output file. Log failed <code>custom_id</code> values and resubmit them in the next batch cycle.</p>
<h2 id="flex-processing-when-to-choose-it-over-batch">Flex Processing: When to Choose It Over Batch</h2>
<p>Flex processing is OpenAI&rsquo;s interactive-but-discounted tier, currently in beta for GPT-5.5, o3, and o4-mini as of April 2026. It cuts standard rates by 50% while preserving the real-time request-response pattern — meaning your code calls <code>client.chat.completions.create()</code> normally, but with <code>service_tier=&quot;flex&quot;</code> added. The tradeoff is variable latency: responses arrive within seconds under low load, but can take several minutes when the platform is busy. Flex may also return HTTP 429 errors during peak windows, so retry logic is mandatory. The practical use case is agent pipelines where model calls happen in a background thread or async queue — the loop keeps running, and a slightly delayed intermediate response doesn&rsquo;t break anything. A CI/CD step that analyzes test failures after a build completes is a good Flex candidate: the developer isn&rsquo;t watching the terminal, and a 90-second response versus a 3-second one is irrelevant.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#75715e"># Using Flex mode — same SDK call, different service tier</span>
</span></span><span style="display:flex;"><span>response <span style="color:#f92672">=</span> client<span style="color:#f92672">.</span>chat<span style="color:#f92672">.</span>completions<span style="color:#f92672">.</span>create(
</span></span><span style="display:flex;"><span>    model<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;gpt-5.5&#34;</span>,
</span></span><span style="display:flex;"><span>    service_tier<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;flex&#34;</span>,  <span style="color:#75715e"># the only required change</span>
</span></span><span style="display:flex;"><span>    messages<span style="color:#f92672">=</span>[
</span></span><span style="display:flex;"><span>        {<span style="color:#e6db74">&#34;role&#34;</span>: <span style="color:#e6db74">&#34;user&#34;</span>, <span style="color:#e6db74">&#34;content&#34;</span>: <span style="color:#e6db74">&#34;Analyze these test failures and suggest fixes: ...&#34;</span>}
</span></span><span style="display:flex;"><span>    ]
</span></span><span style="display:flex;"><span>)
</span></span></code></pre></div><h3 id="retry-logic-for-flex-429-errors">Retry Logic for Flex 429 Errors</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> time
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> openai <span style="color:#f92672">import</span> RateLimitError
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">flex_call_with_retry</span>(messages, max_retries<span style="color:#f92672">=</span><span style="color:#ae81ff">5</span>):
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">for</span> attempt <span style="color:#f92672">in</span> range(max_retries):
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">try</span>:
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">return</span> client<span style="color:#f92672">.</span>chat<span style="color:#f92672">.</span>completions<span style="color:#f92672">.</span>create(
</span></span><span style="display:flex;"><span>                model<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;gpt-5.5&#34;</span>,
</span></span><span style="display:flex;"><span>                service_tier<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;flex&#34;</span>,
</span></span><span style="display:flex;"><span>                messages<span style="color:#f92672">=</span>messages
</span></span><span style="display:flex;"><span>            )
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">except</span> RateLimitError:
</span></span><span style="display:flex;"><span>            wait <span style="color:#f92672">=</span> <span style="color:#ae81ff">2</span> <span style="color:#f92672">**</span> attempt  <span style="color:#75715e"># exponential backoff: 1s, 2s, 4s, 8s, 16s</span>
</span></span><span style="display:flex;"><span>            print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Flex 429 — retrying in </span><span style="color:#e6db74">{</span>wait<span style="color:#e6db74">}</span><span style="color:#e6db74">s (attempt </span><span style="color:#e6db74">{</span>attempt <span style="color:#f92672">+</span> <span style="color:#ae81ff">1</span><span style="color:#e6db74">}</span><span style="color:#e6db74">/</span><span style="color:#e6db74">{</span>max_retries<span style="color:#e6db74">}</span><span style="color:#e6db74">)&#34;</span>)
</span></span><span style="display:flex;"><span>            time<span style="color:#f92672">.</span>sleep(wait)
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">raise</span> <span style="color:#a6e22e">RuntimeError</span>(<span style="color:#e6db74">&#34;Flex call failed after max retries&#34;</span>)
</span></span></code></pre></div><h2 id="stacking-discounts--batch--prompt-caching-for-maximum-savings">Stacking Discounts — Batch + Prompt Caching for Maximum Savings</h2>
<p>Prompt caching and Batch API discounts stack multiplicatively, creating the most cost-efficient configuration available for high-volume GPT-5.5 workloads. OpenAI&rsquo;s prompt caching automatically kicks in for prompts exceeding 1,024 tokens when the same prefix appears repeatedly — cached input tokens are priced at 50% of the standard input rate. When you&rsquo;re already on Batch pricing ($2.50/1M input), cached tokens drop further to approximately $1.25/1M. For a team running nightly code reviews where the system prompt and codebase context stay constant across 500 PR reviews, the combined discount on the static prefix can approach 75% off standard rates. The key implementation detail is keeping your system prompt and shared context at the front of every request in the batch file, unchanged, so OpenAI&rsquo;s caching infrastructure can recognize and serve the shared prefix from cache. Variable content — the specific diff or file being reviewed — goes at the end of the messages array.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>SYSTEM_PROMPT <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;&#34;&#34;You are a senior code reviewer for a Python backend team.
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">Our style guide: PEP 8, type hints required, no bare except clauses,
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">all public functions must have docstrings. Flag: security vulnerabilities,
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">N+1 query patterns, missing input validation, and hardcoded secrets.&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># This 150-token system prompt gets cached after the first request in the batch.</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># All subsequent requests in the batch reuse it at ~$1.25/1M input tokens.</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">make_batch_request</span>(pr_id, diff):
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span> {
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#34;custom_id&#34;</span>: pr_id,
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#34;method&#34;</span>: <span style="color:#e6db74">&#34;POST&#34;</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#34;url&#34;</span>: <span style="color:#e6db74">&#34;/v1/chat/completions&#34;</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#34;body&#34;</span>: {
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;model&#34;</span>: <span style="color:#e6db74">&#34;gpt-5.5&#34;</span>,
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;messages&#34;</span>: [
</span></span><span style="display:flex;"><span>                {<span style="color:#e6db74">&#34;role&#34;</span>: <span style="color:#e6db74">&#34;system&#34;</span>, <span style="color:#e6db74">&#34;content&#34;</span>: SYSTEM_PROMPT},  <span style="color:#75715e"># cached prefix</span>
</span></span><span style="display:flex;"><span>                {<span style="color:#e6db74">&#34;role&#34;</span>: <span style="color:#e6db74">&#34;user&#34;</span>, <span style="color:#e6db74">&#34;content&#34;</span>: <span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Review this diff:</span><span style="color:#ae81ff">\n\n</span><span style="color:#e6db74">{</span>diff<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>}  <span style="color:#75715e"># variable</span>
</span></span><span style="display:flex;"><span>            ]
</span></span><span style="display:flex;"><span>        }
</span></span><span style="display:flex;"><span>    }
</span></span></code></pre></div><h3 id="estimating-your-combined-savings">Estimating Your Combined Savings</h3>
<p>For 1M input tokens on a batch with 70% cached prefixes: 700K tokens at $1.25/1M + 300K tokens at $2.50/1M = $0.875 + $0.75 = <strong>$1.625 total</strong>, versus $5.00 at standard uncached rates. That&rsquo;s a 67.5% reduction.</p>
<h2 id="real-world-roi-what-50-savings-looks-like-for-a-dev-team">Real-World ROI: What 50% Savings Looks Like for a Dev Team</h2>
<p>A 10-developer engineering team running AI-assisted workflows at scale provides a concrete reference point for the financial impact of switching batch-eligible workloads from Standard to Batch or Flex. Assume the team consumes 50M input tokens and 10M output tokens monthly — a realistic figure for teams running inline completions, code review bots, test generators, and documentation tools. At standard GPT-5.5 rates ($5/$30), that&rsquo;s $250 input + $300 output = <strong>$550/month</strong>. If 50% of that workload is batch-eligible (a conservative estimate given the 40–60% industry benchmark), switching those jobs to Batch pricing reduces the eligible portion from $275 to $137.50 — a monthly saving of <strong>$137.50</strong>, or <strong>$1,650/year</strong>. For a team spending $2,000–$5,000/month on API costs, the savings scale proportionally. A team at $5,000/month with 50% batch-eligible workload saves <strong>$1,250/month</strong> — $15,000/year — without any change to output quality, since Batch jobs use the same model weights as Standard.</p>
<table>
  <thead>
      <tr>
          <th>Monthly Spend</th>
          <th>Batch-Eligible %</th>
          <th>Monthly Savings</th>
          <th>Annual Savings</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>$500</td>
          <td>50%</td>
          <td>$125</td>
          <td>$1,500</td>
      </tr>
      <tr>
          <td>$2,000</td>
          <td>50%</td>
          <td>$500</td>
          <td>$6,000</td>
      </tr>
      <tr>
          <td>$5,000</td>
          <td>50%</td>
          <td>$1,250</td>
          <td>$15,000</td>
      </tr>
      <tr>
          <td>$10,000</td>
          <td>60%</td>
          <td>$3,000</td>
          <td>$36,000</td>
      </tr>
  </tbody>
</table>
<h3 id="the-cost-neutral-gpt-55-upgrade">The Cost-Neutral GPT-5.5 Upgrade</h3>
<p>GPT-5.5 Batch pricing ($2.50/$15) equals GPT-5.4 standard pricing — meaning any team currently using GPT-5.4 can upgrade to GPT-5.5 Batch and get better benchmark performance (82.7% vs 75.1% on Terminal-Bench 2.0) at identical cost. This is the clearest no-compromise upgrade path available in the market as of April 2026.</p>
<h2 id="gpt-55-coding-benchmarks--is-the-upgrade-worth-it">GPT-5.5 Coding Benchmarks — Is the Upgrade Worth It?</h2>
<p>GPT-5.5 demonstrates measurable improvements over GPT-5.4 across the benchmarks that matter most for software engineering tasks, making it the more capable model at the same effective cost when used with Batch pricing. On Terminal-Bench 2.0, which tests complex CLI workflows and multi-step shell interactions, GPT-5.5 scores 82.7% versus GPT-5.4&rsquo;s 75.1% — a 7.6 percentage point improvement that translates directly to fewer retries and more reliable automated tooling. On SWE-Bench Pro, which evaluates real-world GitHub issue resolution with actual repository context, GPT-5.5 achieves 58.6%. On OSWorld-Verified, which measures autonomous computer environment operation (browser control, file system navigation, application interaction), GPT-5.5 reaches 78.7%. These gains matter for teams using AI in CI/CD pipelines: higher SWE-Bench scores mean the model resolves more issues on the first attempt, reducing the number of tokens consumed per successful fix — which compounds the cost savings from Batch pricing.</p>
<table>
  <thead>
      <tr>
          <th>Benchmark</th>
          <th>GPT-5.4</th>
          <th>GPT-5.5</th>
          <th>Delta</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Terminal-Bench 2.0</td>
          <td>75.1%</td>
          <td>82.7%</td>
          <td>+7.6pp</td>
      </tr>
      <tr>
          <td>SWE-Bench Pro</td>
          <td>—</td>
          <td>58.6%</td>
          <td>—</td>
      </tr>
      <tr>
          <td>OSWorld-Verified</td>
          <td>—</td>
          <td>78.7%</td>
          <td>—</td>
      </tr>
  </tbody>
</table>
<h3 id="token-efficiency-under-higher-benchmark-scores">Token Efficiency Under Higher Benchmark Scores</h3>
<p>A model that resolves an issue in one pass consumes fewer output tokens than one requiring two or three attempts. For Batch workloads where output costs dominate ($15/1M vs $2.50/1M input), higher first-pass accuracy has a direct, measurable impact on monthly spend beyond the 50% tier discount.</p>
<h2 id="limitations-and-gotchas-24h-sla-429-errors-token-overages">Limitations and Gotchas (24h SLA, 429 Errors, Token Overages)</h2>
<p>Understanding the operational constraints of Batch and Flex mode is essential before routing production workloads through either tier. The Batch API&rsquo;s 24-hour SLA is a hard ceiling, not a typical time — under normal load, most batches complete in 2–6 hours, but you must architect your pipeline to handle the full 24-hour window. Do not use Batch for any workflow where a stakeholder is waiting on the result today. Flex mode&rsquo;s 429 errors are a more operationally complex issue: during peak platform load, Flex requests may be rejected entirely rather than queued, meaning your retry logic must handle outright failures, not just slow responses. The token overage pricing for prompts exceeding 272K tokens deserves special attention — at 2x input and 1.5x output for the entire session, a single oversized request can cost 3–4x what you expected. This is particularly relevant for large codebase analysis tasks where you might naively concatenate entire files into context. Batch API also has a 200MB file size limit and a 50,000 request cap per submission — teams with very large nightly jobs may need to split submissions across multiple batch files.</p>
<table>
  <thead>
      <tr>
          <th>Constraint</th>
          <th>Batch</th>
          <th>Flex</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Max requests per job</td>
          <td>50,000</td>
          <td>N/A (per-call)</td>
      </tr>
      <tr>
          <td>Max file size</td>
          <td>200MB</td>
          <td>N/A</td>
      </tr>
      <tr>
          <td>Completion SLA</td>
          <td>24 hours</td>
          <td>Variable (seconds–minutes)</td>
      </tr>
      <tr>
          <td>429 errors possible?</td>
          <td>No</td>
          <td>Yes (peak traffic)</td>
      </tr>
      <tr>
          <td>Prompt caching compatible?</td>
          <td>Yes</td>
          <td>Yes</td>
      </tr>
      <tr>
          <td>Token overage threshold</td>
          <td>272K tokens (2x/1.5x pricing)</td>
          <td>272K tokens (2x/1.5x pricing)</td>
      </tr>
  </tbody>
</table>
<h3 id="file-size-planning-for-large-batch-jobs">File Size Planning for Large Batch Jobs</h3>
<p>At 200MB per file with average request sizes of 4KB (1,000-token prompt + metadata), you can fit approximately 50,000 requests — which coincides with the request cap. If your prompts are larger (10–20KB each due to large code context), the file size limit becomes the binding constraint before the request cap.</p>
<h2 id="final-verdict-how-to-architect-a-cost-optimized-gpt-55-coding-pipeline">Final Verdict: How to Architect a Cost-Optimized GPT-5.5 Coding Pipeline</h2>
<p>The optimal cost structure for a GPT-5.5 coding pipeline routes workloads across all four pricing tiers based on latency requirements and interactivity needs, with the goal of minimizing spend without sacrificing response quality or user experience. Every API call in your system should have an explicit tier assignment, not a default fallback to Standard. For any team serious about cost control, the practical architecture looks like this: route user-facing IDE completions and live chat to Standard or Priority; route all background agent loops, CI post-processing, and non-urgent enrichment to Flex with retry logic; route all scheduled jobs, nightly runs, eval pipelines, and embedding refreshes to Batch. Layer prompt caching on top of Batch for maximum compound savings. The result is a tiered system where only the smallest fraction of your requests — the truly latency-critical ones — pay full standard prices, while 50–70% of your volume runs at half cost or less. GPT-5.5&rsquo;s superior benchmark scores mean that even on the cheaper tiers, you&rsquo;re getting better results than you did from GPT-5.4 at standard pricing. The upgrade path is effectively cost-neutral for batch-heavy teams, and actively cost-reducing for teams that haven&rsquo;t yet segmented their workloads by latency requirement.</p>
<p><strong>Quick-start decision tree:</strong></p>
<ol>
<li>Is a human actively waiting? → <strong>Standard or Priority</strong></li>
<li>Is it interactive but non-urgent (agent loop, CI step)? → <strong>Flex</strong></li>
<li>Is it a scheduled or async job? → <strong>Batch</strong></li>
<li>Does the batch have a shared prompt prefix &gt;1,024 tokens? → <strong>Batch + Caching</strong></li>
</ol>
<hr>
<h2 id="faq">FAQ</h2>
<p>The most common questions about GPT-5.5 Batch API and Flex mode center on three practical concerns: which tier to use for which workload, how to handle operational edge cases like 429 errors and batch file limits, and whether the upgrade from GPT-5.4 is worth the migration effort. The short answer on the last point is yes — GPT-5.5 Batch pricing equals GPT-5.4 standard pricing ($2.50/$15 per 1M tokens), so any team running background workloads gets a capability upgrade at zero incremental cost. GPT-5.5 launched on April 23, 2026, with measurably higher coding benchmark scores than its predecessor. The answers below address the most common implementation questions from engineering teams evaluating both tiers, covering Python SDK integration, retry logic for Flex 429 errors, prompt caching compatibility, and the concrete ROI case for switching batch-eligible workloads. Each answer is written to stand alone without requiring context from earlier in the article.</p>
<h3 id="what-is-the-difference-between-gpt-55-batch-api-and-flex-mode">What is the difference between GPT-5.5 Batch API and Flex mode?</h3>
<p>Batch API is fully asynchronous — you submit a file of up to 50,000 requests and receive results within 24 hours with no real-time interaction. Flex mode is interactive: you make standard API calls but with <code>service_tier=&quot;flex&quot;</code>, and responses arrive with variable latency (seconds to minutes) rather than the consistent speed of the Standard tier. Both cost $2.50/1M input and $15/1M output — 50% off Standard rates.</p>
<h3 id="can-i-use-gpt-55-batch-api-in-my-existing-cicd-pipeline">Can I use GPT-5.5 Batch API in my existing CI/CD pipeline?</h3>
<p>Yes. The Batch API integrates with any CI/CD system that can run Python or Node.js scripts. The typical pattern is: (1) generate the JSONL request file at the end of a build, (2) submit the batch, (3) store the batch ID, (4) have the next day&rsquo;s build or a separate scheduled job poll for completion and download results. Do not block the current pipeline on batch completion — treat it as a separate async workflow.</p>
<h3 id="does-prompt-caching-work-with-gpt-55-batch-api">Does prompt caching work with GPT-5.5 Batch API?</h3>
<p>Yes, prompt caching and Batch API discounts stack. Cached input tokens (prefixes exceeding 1,024 tokens that repeat across requests in a batch) are priced at approximately $1.25/1M — 75% off standard rates. Keep your system prompt and shared context as a fixed prefix at the top of every batch request to maximize cache hit rates.</p>
<h3 id="what-happens-if-a-flex-request-gets-a-429-error">What happens if a Flex request gets a 429 error?</h3>
<p>A 429 during Flex processing means the platform is under high load and your request was not queued. Implement exponential backoff: wait 1 second after the first failure, 2 seconds after the second, 4 after the third, and so on up to a configured maximum. If all retries are exhausted, fall back to the Standard tier for that specific request. Never use Flex for user-facing requests where a 429 would break the user experience.</p>
<h3 id="is-gpt-55-better-than-gpt-54-for-coding-tasks-at-the-same-cost">Is GPT-5.5 better than GPT-5.4 for coding tasks at the same cost?</h3>
<p>Yes, when using Batch pricing. GPT-5.5 Batch pricing ($2.50/$15 per 1M tokens) equals GPT-5.4 Standard pricing, but GPT-5.5 scores 82.7% on Terminal-Bench 2.0 versus GPT-5.4&rsquo;s 75.1%. For teams currently using GPT-5.4 at standard rates and running any batch-eligible workloads, switching to GPT-5.5 Batch delivers higher capability at identical or lower cost — a direct upgrade with no tradeoffs.</p>
]]></content:encoded></item><item><title>OpenAI Hosted Shell and Apply Patch: GPT-5.5 Compute Tools for Autonomous Code Execution</title><link>https://baeseokjae.github.io/posts/openai-hosted-shell-apply-patch-guide-2026/</link><pubDate>Sat, 25 Apr 2026 10:05:54 +0000</pubDate><guid>https://baeseokjae.github.io/posts/openai-hosted-shell-apply-patch-guide-2026/</guid><description>Complete guide to GPT-5.5&amp;#39;s hosted shell and apply_patch tools for building autonomous coding agents via the OpenAI Responses API.</description><content:encoded><![CDATA[<p>GPT-5.5&rsquo;s hosted shell and <code>apply_patch</code> tools let you run autonomous coding agents that explore filesystems, execute commands, and apply precise code edits — all inside an OpenAI-managed Debian 12 sandbox with no infrastructure to maintain.</p>
<h2 id="what-are-openais-compute-tools-hosted-shell-and-apply-patch-explained">What Are OpenAI&rsquo;s Compute Tools? Hosted Shell and Apply Patch Explained</h2>
<p>OpenAI&rsquo;s compute tools are two purpose-built capabilities in the Responses API that give models direct access to code execution environments and structured file-editing primitives. The <strong>hosted shell</strong> tool provisions an ephemeral Debian 12 container where GPT-5.5 can run arbitrary shell commands — installing packages, running test suites, inspecting file trees, and producing downloadable artifacts via <code>/mnt/data</code>. The <strong><code>apply_patch</code> tool</strong> gives the model a structured way to propose file modifications using the V4A diff format, which supports <code>create_file</code>, <code>update_file</code>, and <code>delete_file</code> operations with surgical precision. Together, these two tools form a closed loop: the model explores a codebase with shell commands, identifies what needs to change, and applies those changes via structured patches — without the host application needing to interpret or re-execute diffs. As of April 2026, these tools are only available through the Responses API (not the Chat Completions API) and require GPT-5.5 or compatible models. The combination represents OpenAI&rsquo;s most direct answer to Claude Code, GitHub Copilot Agent, and similar agentic coding platforms.</p>
<h2 id="gpt-55-spud-the-model-that-powers-these-tools">GPT-5.5 (Spud): The Model That Powers These Tools</h2>
<p>GPT-5.5, codenamed &ldquo;Spud,&rdquo; was released on April 23, 2026 — the first fully retrained base model since GPT-4.5. It is specifically optimized for agentic, multi-step workflows that involve tool use across long contexts. GPT-5.5 achieves <strong>82.7% on Terminal-Bench 2.0</strong>, the state-of-the-art benchmark for complex command-line workflows, and <strong>58.6% on SWE-Bench Pro</strong> for real-world GitHub issue resolution (compared to Claude Opus 4.7&rsquo;s 64.3% on the same benchmark). The model supports a <strong>1M token context window</strong> and natively integrates with hosted shell, <code>apply_patch</code>, computer use, Skills, MCP servers, and web search. Pricing is $5 per 1M input tokens and $30 per 1M output tokens — double the GPT-5.4 rate, reflecting the higher capability level. GPT-5.5 Pro ($30/$180 per 1M tokens) offers enhanced reasoning but notably does <strong>not</strong> support <code>apply_patch</code>, making standard GPT-5.5 the correct choice for autonomous code-editing agents. If your workflow requires multi-file refactoring, bug patching, or test generation at scale, GPT-5.5 is the model to use.</p>
<h2 id="how-the-hosted-shell-works-debian-12-container-architecture">How the Hosted Shell Works: Debian 12 Container Architecture</h2>
<p>The hosted shell provisions an OpenAI-managed Debian 12 environment with controlled internet access that is isolated from your application&rsquo;s runtime and credentials. When you include <code>{&quot;type&quot;: &quot;shell&quot;}</code> in the <code>tools</code> array and set <code>container</code> to <code>&quot;container_auto&quot;</code>, OpenAI automatically allocates a fresh container for each session. The model can execute any shell command — <code>apt-get install</code>, <code>pytest</code>, <code>git log</code>, <code>find</code>, <code>curl</code> — and the output streams back from the container runtime into the model&rsquo;s context. Files written to <code>/mnt/data</code> inside the container become downloadable artifacts available after the session. Container pricing is separate from token costs: $0.03/GB for 1GB sessions or $1.92/64GB for larger workloads, billed per 20-minute session window (pricing active from March 31, 2026). The architecture deliberately separates the <strong>control harness</strong> (your application code, API keys, environment variables) from the <strong>compute layer</strong> (the sandboxed container), which prevents the model from exfiltrating credentials or making unauthorized network calls. Containers are ephemeral by default — state does not persist between API calls unless you mount a persistent volume or use the <code>/mnt/data</code> artifact mechanism.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> openai <span style="color:#f92672">import</span> OpenAI
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>client <span style="color:#f92672">=</span> OpenAI()
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>response <span style="color:#f92672">=</span> client<span style="color:#f92672">.</span>responses<span style="color:#f92672">.</span>create(
</span></span><span style="display:flex;"><span>    model<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;gpt-5.5&#34;</span>,
</span></span><span style="display:flex;"><span>    tools<span style="color:#f92672">=</span>[{<span style="color:#e6db74">&#34;type&#34;</span>: <span style="color:#e6db74">&#34;shell&#34;</span>}],
</span></span><span style="display:flex;"><span>    container<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;container_auto&#34;</span>,
</span></span><span style="display:flex;"><span>    input<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;List all Python files in /workspace and count total lines of code.&#34;</span>,
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">for</span> event <span style="color:#f92672">in</span> response:
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">if</span> event<span style="color:#f92672">.</span>type <span style="color:#f92672">==</span> <span style="color:#e6db74">&#34;shell_call&#34;</span>:
</span></span><span style="display:flex;"><span>        print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Running: </span><span style="color:#e6db74">{</span>event<span style="color:#f92672">.</span>command<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">elif</span> event<span style="color:#f92672">.</span>type <span style="color:#f92672">==</span> <span style="color:#e6db74">&#34;shell_call_output&#34;</span>:
</span></span><span style="display:flex;"><span>        print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Output: </span><span style="color:#e6db74">{</span>event<span style="color:#f92672">.</span>output[:<span style="color:#ae81ff">200</span>]<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span></code></pre></div><h3 id="container-session-lifecycle">Container Session Lifecycle</h3>
<p>A container session begins when the first shell command executes and ends after 20 minutes of inactivity or when explicitly closed. Within a session, the container maintains full filesystem state — installed packages, created files, environment variables set by earlier commands. This allows multi-turn interactions where the model installs dependencies in one turn and runs tests in the next without re-provisioning. When building long-running agents, structure your prompts to batch related operations within a single session window to minimize container provisioning overhead and cost.</p>
<h2 id="the-apply-patch-tool-v4a-diff-format-for-precise-code-edits">The Apply Patch Tool: V4A Diff Format for Precise Code Edits</h2>
<p>The <code>apply_patch</code> tool gives GPT-5.5 a structured mechanism for proposing file modifications that your application can review, approve, or reject before execution. Unlike shell-based <code>sed</code> or <code>patch</code> commands that operate inside the sandbox, <code>apply_patch</code> emits structured <code>apply_patch_call</code> objects in the model&rsquo;s response output — the actual file changes happen in <strong>your</strong> filesystem, not the container&rsquo;s, giving you full control over what gets modified. The tool uses the <strong>V4A diff format</strong>, a compact patch syntax that supports three operations: <code>create_file</code> (with full content), <code>update_file</code> (with context lines and replacements), and <code>delete_file</code>. Enable it by adding <code>{&quot;type&quot;: &quot;apply_patch&quot;}</code> to your tools array. The model generates patches that are precise, machine-readable, and auditable — each patch specifies exactly which lines change and why, making code review tractable even for large refactors. This design reflects a key architectural choice: the model proposes, the human (or application) disposes. You can add an approval gate, write the patches to a staging directory, run your test suite against them, and only apply on green.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>response <span style="color:#f92672">=</span> client<span style="color:#f92672">.</span>responses<span style="color:#f92672">.</span>create(
</span></span><span style="display:flex;"><span>    model<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;gpt-5.5&#34;</span>,
</span></span><span style="display:flex;"><span>    tools<span style="color:#f92672">=</span>[
</span></span><span style="display:flex;"><span>        {<span style="color:#e6db74">&#34;type&#34;</span>: <span style="color:#e6db74">&#34;shell&#34;</span>},
</span></span><span style="display:flex;"><span>        {<span style="color:#e6db74">&#34;type&#34;</span>: <span style="color:#e6db74">&#34;apply_patch&#34;</span>},
</span></span><span style="display:flex;"><span>    ],
</span></span><span style="display:flex;"><span>    container<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;container_auto&#34;</span>,
</span></span><span style="display:flex;"><span>    input<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;&#34;&#34;
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    Read src/auth.py. The JWT token expiry is hardcoded to 3600 seconds.
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    Refactor it to read from an environment variable JWT_EXPIRY_SECONDS with
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    a fallback of 3600. Apply the patch when ready.
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    &#34;&#34;&#34;</span>,
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">for</span> event <span style="color:#f92672">in</span> response:
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">if</span> event<span style="color:#f92672">.</span>type <span style="color:#f92672">==</span> <span style="color:#e6db74">&#34;apply_patch_call&#34;</span>:
</span></span><span style="display:flex;"><span>        <span style="color:#75715e"># Review patch before applying</span>
</span></span><span style="display:flex;"><span>        print(event<span style="color:#f92672">.</span>patch)
</span></span><span style="display:flex;"><span>        <span style="color:#75715e"># Apply: event.apply() or handle manually</span>
</span></span></code></pre></div><h3 id="v4a-diff-format-in-practice">V4A Diff Format in Practice</h3>
<p>The V4A format is intentionally minimal. An <code>update_file</code> patch looks like this:</p>



<div class="goat svg-container ">
  
    <svg
      xmlns="http://www.w3.org/2000/svg"
      font-family="Menlo,Lucida Console,monospace"
      
        viewBox="0 0 496 73"
      >
      <g transform='translate(8,16)'>
<circle cx='0' cy='0' r='6' stroke='currentColor' fill='currentColor'></circle>
<circle cx='8' cy='0' r='6' stroke='currentColor' fill='currentColor'></circle>
<circle cx='16' cy='0' r='6' stroke='currentColor' fill='currentColor'></circle>
<text text-anchor='middle' x='0' y='20' fill='currentColor' style='font-size:1em'>@</text>
<text text-anchor='middle' x='0' y='36' fill='currentColor' style='font-size:1em'>-</text>
<text text-anchor='middle' x='0' y='52' fill='currentColor' style='font-size:1em'>+</text>
<text text-anchor='middle' x='8' y='20' fill='currentColor' style='font-size:1em'>@</text>
<text text-anchor='middle' x='8' y='36' fill='currentColor' style='font-size:1em'>J</text>
<text text-anchor='middle' x='8' y='52' fill='currentColor' style='font-size:1em'>J</text>
<text text-anchor='middle' x='16' y='36' fill='currentColor' style='font-size:1em'>W</text>
<text text-anchor='middle' x='16' y='52' fill='currentColor' style='font-size:1em'>W</text>
<text text-anchor='middle' x='24' y='36' fill='currentColor' style='font-size:1em'>T</text>
<text text-anchor='middle' x='24' y='52' fill='currentColor' style='font-size:1em'>T</text>
<text text-anchor='middle' x='32' y='4' fill='currentColor' style='font-size:1em'>U</text>
<text text-anchor='middle' x='32' y='36' fill='currentColor' style='font-size:1em'>_</text>
<text text-anchor='middle' x='32' y='52' fill='currentColor' style='font-size:1em'>_</text>
<text text-anchor='middle' x='40' y='4' fill='currentColor' style='font-size:1em'>p</text>
<text text-anchor='middle' x='40' y='36' fill='currentColor' style='font-size:1em'>E</text>
<text text-anchor='middle' x='40' y='52' fill='currentColor' style='font-size:1em'>E</text>
<text text-anchor='middle' x='48' y='4' fill='currentColor' style='font-size:1em'>d</text>
<text text-anchor='middle' x='48' y='36' fill='currentColor' style='font-size:1em'>X</text>
<text text-anchor='middle' x='48' y='52' fill='currentColor' style='font-size:1em'>X</text>
<text text-anchor='middle' x='56' y='4' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='56' y='36' fill='currentColor' style='font-size:1em'>P</text>
<text text-anchor='middle' x='56' y='52' fill='currentColor' style='font-size:1em'>P</text>
<text text-anchor='middle' x='64' y='4' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='64' y='36' fill='currentColor' style='font-size:1em'>I</text>
<text text-anchor='middle' x='64' y='52' fill='currentColor' style='font-size:1em'>I</text>
<text text-anchor='middle' x='72' y='4' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='72' y='36' fill='currentColor' style='font-size:1em'>R</text>
<text text-anchor='middle' x='72' y='52' fill='currentColor' style='font-size:1em'>R</text>
<text text-anchor='middle' x='80' y='36' fill='currentColor' style='font-size:1em'>Y</text>
<text text-anchor='middle' x='80' y='52' fill='currentColor' style='font-size:1em'>Y</text>
<text text-anchor='middle' x='88' y='4' fill='currentColor' style='font-size:1em'>F</text>
<text text-anchor='middle' x='96' y='4' fill='currentColor' style='font-size:1em'>i</text>
<text text-anchor='middle' x='96' y='36' fill='currentColor' style='font-size:1em'>=</text>
<text text-anchor='middle' x='96' y='52' fill='currentColor' style='font-size:1em'>=</text>
<text text-anchor='middle' x='104' y='4' fill='currentColor' style='font-size:1em'>l</text>
<text text-anchor='middle' x='112' y='4' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='112' y='36' fill='currentColor' style='font-size:1em'>3</text>
<text text-anchor='middle' x='112' y='52' fill='currentColor' style='font-size:1em'>i</text>
<text text-anchor='middle' x='120' y='4' fill='currentColor' style='font-size:1em'>:</text>
<text text-anchor='middle' x='120' y='36' fill='currentColor' style='font-size:1em'>6</text>
<text text-anchor='middle' x='120' y='52' fill='currentColor' style='font-size:1em'>n</text>
<text text-anchor='middle' x='128' y='36' fill='currentColor' style='font-size:1em'>0</text>
<text text-anchor='middle' x='128' y='52' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='136' y='4' fill='currentColor' style='font-size:1em'>s</text>
<text text-anchor='middle' x='136' y='36' fill='currentColor' style='font-size:1em'>0</text>
<text text-anchor='middle' x='136' y='52' fill='currentColor' style='font-size:1em'>(</text>
<text text-anchor='middle' x='144' y='4' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='144' y='52' fill='currentColor' style='font-size:1em'>o</text>
<text text-anchor='middle' x='152' y='4' fill='currentColor' style='font-size:1em'>c</text>
<text text-anchor='middle' x='152' y='52' fill='currentColor' style='font-size:1em'>s</text>
<text text-anchor='middle' x='160' y='4' fill='currentColor' style='font-size:1em'>/</text>
<text text-anchor='middle' x='160' y='52' fill='currentColor' style='font-size:1em'>.</text>
<text text-anchor='middle' x='168' y='4' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='168' y='52' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='176' y='4' fill='currentColor' style='font-size:1em'>u</text>
<text text-anchor='middle' x='176' y='52' fill='currentColor' style='font-size:1em'>n</text>
<text text-anchor='middle' x='184' y='4' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='184' y='52' fill='currentColor' style='font-size:1em'>v</text>
<text text-anchor='middle' x='192' y='4' fill='currentColor' style='font-size:1em'>h</text>
<text text-anchor='middle' x='192' y='52' fill='currentColor' style='font-size:1em'>i</text>
<text text-anchor='middle' x='200' y='4' fill='currentColor' style='font-size:1em'>.</text>
<text text-anchor='middle' x='200' y='52' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='208' y='4' fill='currentColor' style='font-size:1em'>p</text>
<text text-anchor='middle' x='208' y='52' fill='currentColor' style='font-size:1em'>o</text>
<text text-anchor='middle' x='216' y='4' fill='currentColor' style='font-size:1em'>y</text>
<text text-anchor='middle' x='216' y='52' fill='currentColor' style='font-size:1em'>n</text>
<text text-anchor='middle' x='224' y='52' fill='currentColor' style='font-size:1em'>.</text>
<text text-anchor='middle' x='232' y='52' fill='currentColor' style='font-size:1em'>g</text>
<text text-anchor='middle' x='240' y='52' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='248' y='52' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='256' y='52' fill='currentColor' style='font-size:1em'>(</text>
<text text-anchor='middle' x='264' y='52' fill='currentColor' style='font-size:1em'>"</text>
<text text-anchor='middle' x='272' y='52' fill='currentColor' style='font-size:1em'>J</text>
<text text-anchor='middle' x='280' y='52' fill='currentColor' style='font-size:1em'>W</text>
<text text-anchor='middle' x='288' y='52' fill='currentColor' style='font-size:1em'>T</text>
<text text-anchor='middle' x='296' y='52' fill='currentColor' style='font-size:1em'>_</text>
<text text-anchor='middle' x='304' y='52' fill='currentColor' style='font-size:1em'>E</text>
<text text-anchor='middle' x='312' y='52' fill='currentColor' style='font-size:1em'>X</text>
<text text-anchor='middle' x='320' y='52' fill='currentColor' style='font-size:1em'>P</text>
<text text-anchor='middle' x='328' y='52' fill='currentColor' style='font-size:1em'>I</text>
<text text-anchor='middle' x='336' y='52' fill='currentColor' style='font-size:1em'>R</text>
<text text-anchor='middle' x='344' y='52' fill='currentColor' style='font-size:1em'>Y</text>
<text text-anchor='middle' x='352' y='52' fill='currentColor' style='font-size:1em'>_</text>
<text text-anchor='middle' x='360' y='52' fill='currentColor' style='font-size:1em'>S</text>
<text text-anchor='middle' x='368' y='52' fill='currentColor' style='font-size:1em'>E</text>
<text text-anchor='middle' x='376' y='52' fill='currentColor' style='font-size:1em'>C</text>
<text text-anchor='middle' x='384' y='52' fill='currentColor' style='font-size:1em'>O</text>
<text text-anchor='middle' x='392' y='52' fill='currentColor' style='font-size:1em'>N</text>
<text text-anchor='middle' x='400' y='52' fill='currentColor' style='font-size:1em'>D</text>
<text text-anchor='middle' x='408' y='52' fill='currentColor' style='font-size:1em'>S</text>
<text text-anchor='middle' x='416' y='52' fill='currentColor' style='font-size:1em'>"</text>
<text text-anchor='middle' x='424' y='52' fill='currentColor' style='font-size:1em'>,</text>
<text text-anchor='middle' x='440' y='52' fill='currentColor' style='font-size:1em'>3</text>
<text text-anchor='middle' x='448' y='52' fill='currentColor' style='font-size:1em'>6</text>
<text text-anchor='middle' x='456' y='52' fill='currentColor' style='font-size:1em'>0</text>
<text text-anchor='middle' x='464' y='52' fill='currentColor' style='font-size:1em'>0</text>
<text text-anchor='middle' x='472' y='52' fill='currentColor' style='font-size:1em'>)</text>
<text text-anchor='middle' x='480' y='52' fill='currentColor' style='font-size:1em'>)</text>
</g>

    </svg>
  
</div>
<p>Context lines (unchanged code around the edit) help the patch engine locate the right position even if line numbers have shifted. <code>create_file</code> patches include the full file content inline. <code>delete_file</code> patches require only the filename. The format is designed for model output — terse enough to fit in long context windows, structured enough to parse deterministically.</p>
<h2 id="building-an-autonomous-coding-agent-shell--apply-patch-workflow">Building an Autonomous Coding Agent: Shell + Apply Patch Workflow</h2>
<p>The most powerful pattern combines hosted shell for exploration and <code>apply_patch</code> for modifications in a four-phase loop: <strong>explore → plan → patch → verify</strong>. In the explore phase, the model uses shell commands to understand the codebase structure, identify failing tests, and locate the code that needs to change. In the plan phase, it reasons through the changes required. In the patch phase, it emits <code>apply_patch_call</code> objects for each file to modify. In the verify phase, it runs the test suite inside the container to confirm the changes are correct. This loop can run fully autonomously or with a human-in-the-loop approval gate between patch and verify. The shell tool handles exploration and verification; <code>apply_patch</code> handles modifications. Neither tool is sufficient alone — shell-only agents write changes via <code>sed</code> or <code>tee</code>, which is fragile and hard to audit; <code>apply_patch</code>-only agents cannot run tests to verify correctness. The combination is what makes the workflow production-grade.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> openai <span style="color:#f92672">import</span> OpenAI
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> subprocess
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>client <span style="color:#f92672">=</span> OpenAI()
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>SYSTEM_PROMPT <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;&#34;&#34;You are an autonomous coding agent. For each task:
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">1. Use shell to explore the codebase and understand the problem
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">2. Use shell to run existing tests to understand what&#39;s failing
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">3. Use apply_patch to propose precise code changes
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">4. Use shell to run tests again and verify your fix works
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">Report results when done.&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">run_agent</span>(task: str, workspace: str):
</span></span><span style="display:flex;"><span>    messages <span style="color:#f92672">=</span> [{<span style="color:#e6db74">&#34;role&#34;</span>: <span style="color:#e6db74">&#34;user&#34;</span>, <span style="color:#e6db74">&#34;content&#34;</span>: <span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Workspace: </span><span style="color:#e6db74">{</span>workspace<span style="color:#e6db74">}</span><span style="color:#ae81ff">\n\n</span><span style="color:#e6db74">Task: </span><span style="color:#e6db74">{</span>task<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>}]
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">while</span> <span style="color:#66d9ef">True</span>:
</span></span><span style="display:flex;"><span>        response <span style="color:#f92672">=</span> client<span style="color:#f92672">.</span>responses<span style="color:#f92672">.</span>create(
</span></span><span style="display:flex;"><span>            model<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;gpt-5.5&#34;</span>,
</span></span><span style="display:flex;"><span>            tools<span style="color:#f92672">=</span>[
</span></span><span style="display:flex;"><span>                {<span style="color:#e6db74">&#34;type&#34;</span>: <span style="color:#e6db74">&#34;shell&#34;</span>},
</span></span><span style="display:flex;"><span>                {<span style="color:#e6db74">&#34;type&#34;</span>: <span style="color:#e6db74">&#34;apply_patch&#34;</span>},
</span></span><span style="display:flex;"><span>            ],
</span></span><span style="display:flex;"><span>            container<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;container_auto&#34;</span>,
</span></span><span style="display:flex;"><span>            system<span style="color:#f92672">=</span>SYSTEM_PROMPT,
</span></span><span style="display:flex;"><span>            input<span style="color:#f92672">=</span>messages,
</span></span><span style="display:flex;"><span>        )
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>        patches_applied <span style="color:#f92672">=</span> []
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">for</span> event <span style="color:#f92672">in</span> response:
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">if</span> event<span style="color:#f92672">.</span>type <span style="color:#f92672">==</span> <span style="color:#e6db74">&#34;apply_patch_call&#34;</span>:
</span></span><span style="display:flex;"><span>                <span style="color:#75715e"># Apply patch to local filesystem</span>
</span></span><span style="display:flex;"><span>                result <span style="color:#f92672">=</span> subprocess<span style="color:#f92672">.</span>run(
</span></span><span style="display:flex;"><span>                    [<span style="color:#e6db74">&#34;patch&#34;</span>, <span style="color:#e6db74">&#34;-p0&#34;</span>],
</span></span><span style="display:flex;"><span>                    input<span style="color:#f92672">=</span>event<span style="color:#f92672">.</span>patch,
</span></span><span style="display:flex;"><span>                    capture_output<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>,
</span></span><span style="display:flex;"><span>                    text<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>
</span></span><span style="display:flex;"><span>                )
</span></span><span style="display:flex;"><span>                patches_applied<span style="color:#f92672">.</span>append({
</span></span><span style="display:flex;"><span>                    <span style="color:#e6db74">&#34;patch&#34;</span>: event<span style="color:#f92672">.</span>patch,
</span></span><span style="display:flex;"><span>                    <span style="color:#e6db74">&#34;success&#34;</span>: result<span style="color:#f92672">.</span>returncode <span style="color:#f92672">==</span> <span style="color:#ae81ff">0</span>,
</span></span><span style="display:flex;"><span>                    <span style="color:#e6db74">&#34;output&#34;</span>: result<span style="color:#f92672">.</span>stdout <span style="color:#f92672">or</span> result<span style="color:#f92672">.</span>stderr
</span></span><span style="display:flex;"><span>                })
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">if</span> response<span style="color:#f92672">.</span>status <span style="color:#f92672">==</span> <span style="color:#e6db74">&#34;completed&#34;</span>:
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">return</span> {
</span></span><span style="display:flex;"><span>                <span style="color:#e6db74">&#34;patches&#34;</span>: patches_applied,
</span></span><span style="display:flex;"><span>                <span style="color:#e6db74">&#34;summary&#34;</span>: response<span style="color:#f92672">.</span>output_text
</span></span><span style="display:flex;"><span>            }
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>        <span style="color:#75715e"># Add tool results to message history and continue</span>
</span></span><span style="display:flex;"><span>        messages <span style="color:#f92672">=</span> response<span style="color:#f92672">.</span>messages
</span></span></code></pre></div><h2 id="real-world-use-cases-refactors-bug-fixes-and-migrations">Real-World Use Cases: Refactors, Bug Fixes, and Migrations</h2>
<p>Hosted shell and <code>apply_patch</code> unlock several high-value automated workflows that were previously too complex or risky to automate. <strong>Multi-file refactors</strong>: renaming a function across 50 files, updating import paths after a package reorganization, or migrating from one ORM to another. The model explores the codebase, identifies all affected files, and emits a sequence of <code>apply_patch_call</code> objects — one per file — that can be reviewed as a batch before application. <strong>Bug fixes from issue descriptions</strong>: given a GitHub issue URL or error stack trace, the agent reproduces the bug in the container, locates the root cause, patches it, and runs the test suite to confirm resolution. <strong>API migrations</strong>: when a third-party SDK releases a breaking change, the agent reads the migration guide (via shell <code>curl</code>), identifies all call sites in your codebase, and patches them to the new API. <strong>Test generation</strong>: the agent reads a source file, generates corresponding test cases in the container&rsquo;s scratch space, validates they pass, then uses <code>apply_patch</code> to write the test file into your repository. <strong>Dependency upgrades</strong>: the agent runs <code>pip install --upgrade</code> or <code>npm update</code>, runs your test suite, identifies breakages, patches the affected code, and repeats until tests pass.</p>
<h3 id="when-not-to-use-the-hosted-shell">When Not to Use the Hosted Shell</h3>
<p>The hosted shell is not appropriate for operations that require access to production systems, customer data, or credentials. The container isolation prevents credential theft by design, but this also means the agent cannot directly connect to your production database or internal services. For workflows that require such access, use the <code>apply_patch</code> tool in isolation (without hosted shell) combined with your own local execution environment, where you control what tools and credentials the agent can access.</p>
<h2 id="security-best-practices-sandboxing-path-validation-and-audit-logging">Security Best Practices: Sandboxing, Path Validation, and Audit Logging</h2>
<p>The hosted shell&rsquo;s container isolation eliminates the most dangerous attack vector — direct access to the host filesystem and credentials — but applications using <code>apply_patch</code> still need their own security controls. The key principle: <strong>never apply patches to arbitrary paths without validation</strong>. Validate that all patch targets are within your project root, reject patches that modify <code>.env</code> files, credentials, or CI/CD configuration, and require explicit approval for patches to production code paths. Implement an audit log that records every <code>apply_patch_call</code> with the full patch content, timestamp, model version, and the task prompt that generated it — this creates an immutable record for debugging and compliance. For multi-agent pipelines where one agent&rsquo;s output becomes another&rsquo;s input, add an intermediate validation step that checks patch syntax, target path safety, and changeset size before forwarding. Rate-limit the number of files a single agent run can modify to bound blast radius. Finally, always run your test suite after applying patches in CI, even if the agent reports success — test suite verification in the container is informative but not authoritative for your actual test environment.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> os
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> pathlib <span style="color:#f92672">import</span> Path
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>PROJECT_ROOT <span style="color:#f92672">=</span> Path(<span style="color:#e6db74">&#34;/workspace/myapp&#34;</span>)<span style="color:#f92672">.</span>resolve()
</span></span><span style="display:flex;"><span>BLOCKED_PATTERNS <span style="color:#f92672">=</span> [<span style="color:#e6db74">&#34;.env&#34;</span>, <span style="color:#e6db74">&#34;credentials&#34;</span>, <span style="color:#e6db74">&#34;secrets&#34;</span>, <span style="color:#e6db74">&#34;.aws&#34;</span>, <span style="color:#e6db74">&#34;.ssh&#34;</span>]
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">safe_apply_patch</span>(patch_event, project_root<span style="color:#f92672">=</span>PROJECT_ROOT):
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;&#34;&#34;Validate and apply a patch only if targets are within project root.&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    lines <span style="color:#f92672">=</span> patch_event<span style="color:#f92672">.</span>patch<span style="color:#f92672">.</span>splitlines()
</span></span><span style="display:flex;"><span>    targets <span style="color:#f92672">=</span> [l<span style="color:#f92672">.</span>split(<span style="color:#e6db74">&#34;: &#34;</span>, <span style="color:#ae81ff">1</span>)[<span style="color:#ae81ff">1</span>] <span style="color:#66d9ef">for</span> l <span style="color:#f92672">in</span> lines <span style="color:#66d9ef">if</span> l<span style="color:#f92672">.</span>startswith(<span style="color:#e6db74">&#34;*** &#34;</span>)]
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">for</span> target <span style="color:#f92672">in</span> targets:
</span></span><span style="display:flex;"><span>        target_path <span style="color:#f92672">=</span> (project_root <span style="color:#f92672">/</span> target)<span style="color:#f92672">.</span>resolve()
</span></span><span style="display:flex;"><span>        <span style="color:#75715e"># Prevent path traversal</span>
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">if</span> <span style="color:#f92672">not</span> str(target_path)<span style="color:#f92672">.</span>startswith(str(project_root)):
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">raise</span> <span style="color:#a6e22e">ValueError</span>(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Path traversal attempt: </span><span style="color:#e6db74">{</span>target<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span><span style="display:flex;"><span>        <span style="color:#75715e"># Block sensitive files</span>
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">if</span> any(p <span style="color:#f92672">in</span> str(target_path) <span style="color:#66d9ef">for</span> p <span style="color:#f92672">in</span> BLOCKED_PATTERNS):
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">raise</span> <span style="color:#a6e22e">ValueError</span>(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Blocked sensitive path: </span><span style="color:#e6db74">{</span>target<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Safe to apply</span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span> subprocess<span style="color:#f92672">.</span>run([<span style="color:#e6db74">&#34;patch&#34;</span>, <span style="color:#e6db74">&#34;-p0&#34;</span>], input<span style="color:#f92672">=</span>patch_event<span style="color:#f92672">.</span>patch, <span style="color:#f92672">...</span>)
</span></span></code></pre></div><h2 id="pricing-breakdown-api-costs-container-sessions-and-when-to-use-gpt-55-pro">Pricing Breakdown: API Costs, Container Sessions, and When to Use GPT-5.5 Pro</h2>
<p>Understanding the cost structure is essential for building economically viable agents. Token costs and container costs are billed independently and accumulate differently across agent run types.</p>
<table>
  <thead>
      <tr>
          <th>Component</th>
          <th>GPT-5.5</th>
          <th>GPT-5.5 Pro</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Input tokens</td>
          <td>$5 / 1M</td>
          <td>$30 / 1M</td>
      </tr>
      <tr>
          <td>Output tokens</td>
          <td>$30 / 1M</td>
          <td>$180 / 1M</td>
      </tr>
      <tr>
          <td>apply_patch</td>
          <td>Supported</td>
          <td><strong>Not supported</strong></td>
      </tr>
      <tr>
          <td>Container (1GB)</td>
          <td>$0.03/session</td>
          <td>$0.03/session</td>
      </tr>
      <tr>
          <td>Container (64GB)</td>
          <td>$1.92/session</td>
          <td>$1.92/session</td>
      </tr>
      <tr>
          <td>Context window</td>
          <td>1M tokens</td>
          <td>1M tokens</td>
      </tr>
  </tbody>
</table>
<p>GPT-5.5 Pro&rsquo;s 6x token cost premium is only justified for tasks that require deep multi-step reasoning without tool use — complex architectural analysis, security audit reports, or algorithmic design. For any workflow that uses <code>apply_patch</code>, standard GPT-5.5 is the only option, as Pro explicitly does not support it. For high-volume batch workflows (nightly dependency updates, automated test generation across a monorepo), cache your system prompts and codebase context using the Responses API&rsquo;s caching layer to reduce input token costs by up to 75%. A typical bug-fix agent run that explores 20 files and applies 3 patches costs approximately $0.08–$0.15 in tokens plus $0.03 for the container session — well under $0.20 per resolved issue.</p>
<h3 id="container-cost-optimization">Container Cost Optimization</h3>
<p>Container sessions bill per 20-minute window, not per command. Batch multiple related operations within a single agent run to maximize utilization. If your workflow involves repeated runs against the same codebase (e.g., a nightly CI bot), use persistent volumes to avoid re-installing dependencies each session. For development and testing, use a local sandbox (Docker + the OpenAI API without <code>container_auto</code>) to avoid container costs entirely during iteration.</p>
<h2 id="gpt-55-vs-claude-code-vs-github-copilot-agent-agentic-coding-comparison">GPT-5.5 vs Claude Code vs GitHub Copilot Agent: Agentic Coding Comparison</h2>
<p>The autonomous coding agent space now has three dominant approaches, each with distinct architectural trade-offs that affect what workflows they handle best.</p>
<table>
  <thead>
      <tr>
          <th>Capability</th>
          <th>GPT-5.5 (Shell + Patch)</th>
          <th>Claude Code</th>
          <th>GitHub Copilot Agent</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Hosted sandbox</td>
          <td>OpenAI-managed Debian 12</td>
          <td>Local process</td>
          <td>GitHub Actions runner</td>
      </tr>
      <tr>
          <td>Code editing primitive</td>
          <td>apply_patch (V4A)</td>
          <td>Direct file writes</td>
          <td>Direct file writes</td>
      </tr>
      <tr>
          <td>Benchmark (SWE-Bench Pro)</td>
          <td>58.6%</td>
          <td>64.3% (Opus 4.7)</td>
          <td>~52% (est.)</td>
      </tr>
      <tr>
          <td>Terminal-Bench 2.0</td>
          <td>82.7%</td>
          <td>Not published</td>
          <td>Not published</td>
      </tr>
      <tr>
          <td>Context window</td>
          <td>1M tokens</td>
          <td>200K tokens</td>
          <td>128K tokens</td>
      </tr>
      <tr>
          <td>PR integration</td>
          <td>Via API</td>
          <td>Native Git</td>
          <td>Native GitHub PRs</td>
      </tr>
      <tr>
          <td>Audit trail</td>
          <td>apply_patch_call log</td>
          <td>Git diff</td>
          <td>PR review thread</td>
      </tr>
      <tr>
          <td>Pricing model</td>
          <td>Per token + container</td>
          <td>Subscription / API</td>
          <td>Subscription</td>
      </tr>
  </tbody>
</table>
<p>GPT-5.5 leads on Terminal-Bench 2.0 (CLI workflows) and context length, making it the strongest choice for large monorepo refactors where full-codebase context matters. Claude Opus 4.7 leads on SWE-Bench Pro (real GitHub issues), making it stronger for nuanced bug diagnosis. Copilot Agent has the tightest GitHub integration but the smallest context window, limiting it to targeted, file-scoped changes. For teams already invested in the OpenAI API ecosystem, GPT-5.5 with hosted shell and <code>apply_patch</code> delivers a cohesive platform without additional infrastructure. For teams that need maximum accuracy on complex bugs, Claude Code remains the benchmark leader.</p>
<h2 id="getting-started-complete-code-example-with-shell-and-apply-patch">Getting Started: Complete Code Example with Shell and Apply Patch</h2>
<p>The following is a production-ready example that implements the full explore → patch → verify loop with error handling, patch validation, and result reporting. This pattern is suitable for CI/CD integration, nightly maintenance bots, or interactive developer tools.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> openai <span style="color:#f92672">import</span> OpenAI
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> pathlib <span style="color:#f92672">import</span> Path
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> subprocess
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> json
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> logging
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>logging<span style="color:#f92672">.</span>basicConfig(level<span style="color:#f92672">=</span>logging<span style="color:#f92672">.</span>INFO)
</span></span><span style="display:flex;"><span>logger <span style="color:#f92672">=</span> logging<span style="color:#f92672">.</span>getLogger(__name__)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>client <span style="color:#f92672">=</span> OpenAI()
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>PROJECT_ROOT <span style="color:#f92672">=</span> Path<span style="color:#f92672">.</span>cwd()
</span></span><span style="display:flex;"><span>BLOCKED_PATHS <span style="color:#f92672">=</span> {<span style="color:#e6db74">&#34;.env&#34;</span>, <span style="color:#e6db74">&#34;.aws&#34;</span>, <span style="color:#e6db74">&#34;.ssh&#34;</span>, <span style="color:#e6db74">&#34;credentials&#34;</span>}
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>SYSTEM_PROMPT <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;&#34;&#34;You are a senior software engineer running inside an OpenAI compute environment.
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">You have access to a hosted shell and the apply_patch tool.
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">Your workflow for every task:
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">1. Use shell to understand the codebase structure (ls, find, cat key files)
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">2. Use shell to run existing tests and understand the current state
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">3. Plan your changes carefully before patching
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">4. Use apply_patch for each file modification — never use shell to write files directly
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">5. Use shell to run tests after patching and verify your changes work
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">6. Report results: files changed, tests passed/failed, any caveats
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">Be precise. Be minimal. Only change what the task requires.&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">validate_patch</span>(patch_text: str) <span style="color:#f92672">-&gt;</span> bool:
</span></span><span style="display:flex;"><span>    lines <span style="color:#f92672">=</span> patch_text<span style="color:#f92672">.</span>splitlines()
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">for</span> line <span style="color:#f92672">in</span> lines:
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">if</span> line<span style="color:#f92672">.</span>startswith(<span style="color:#e6db74">&#34;*** &#34;</span>) <span style="color:#f92672">and</span> <span style="color:#e6db74">&#34;: &#34;</span> <span style="color:#f92672">in</span> line:
</span></span><span style="display:flex;"><span>            target <span style="color:#f92672">=</span> line<span style="color:#f92672">.</span>split(<span style="color:#e6db74">&#34;: &#34;</span>, <span style="color:#ae81ff">1</span>)[<span style="color:#ae81ff">1</span>]<span style="color:#f92672">.</span>strip()
</span></span><span style="display:flex;"><span>            target_path <span style="color:#f92672">=</span> (PROJECT_ROOT <span style="color:#f92672">/</span> target)<span style="color:#f92672">.</span>resolve()
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">if</span> <span style="color:#f92672">not</span> str(target_path)<span style="color:#f92672">.</span>startswith(str(PROJECT_ROOT)):
</span></span><span style="display:flex;"><span>                logger<span style="color:#f92672">.</span>error(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Path traversal blocked: </span><span style="color:#e6db74">{</span>target<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span><span style="display:flex;"><span>                <span style="color:#66d9ef">return</span> <span style="color:#66d9ef">False</span>
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">if</span> any(blocked <span style="color:#f92672">in</span> target <span style="color:#66d9ef">for</span> blocked <span style="color:#f92672">in</span> BLOCKED_PATHS):
</span></span><span style="display:flex;"><span>                logger<span style="color:#f92672">.</span>error(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Sensitive path blocked: </span><span style="color:#e6db74">{</span>target<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span><span style="display:flex;"><span>                <span style="color:#66d9ef">return</span> <span style="color:#66d9ef">False</span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span> <span style="color:#66d9ef">True</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">apply_patch</span>(patch_text: str) <span style="color:#f92672">-&gt;</span> dict:
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">if</span> <span style="color:#f92672">not</span> validate_patch(patch_text):
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> {<span style="color:#e6db74">&#34;success&#34;</span>: <span style="color:#66d9ef">False</span>, <span style="color:#e6db74">&#34;error&#34;</span>: <span style="color:#e6db74">&#34;Patch validation failed&#34;</span>}
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    result <span style="color:#f92672">=</span> subprocess<span style="color:#f92672">.</span>run(
</span></span><span style="display:flex;"><span>        [<span style="color:#e6db74">&#34;patch&#34;</span>, <span style="color:#e6db74">&#34;-p0&#34;</span>, <span style="color:#e6db74">&#34;--dry-run&#34;</span>],
</span></span><span style="display:flex;"><span>        input<span style="color:#f92672">=</span>patch_text,
</span></span><span style="display:flex;"><span>        capture_output<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>,
</span></span><span style="display:flex;"><span>        text<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>,
</span></span><span style="display:flex;"><span>        cwd<span style="color:#f92672">=</span>PROJECT_ROOT
</span></span><span style="display:flex;"><span>    )
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">if</span> result<span style="color:#f92672">.</span>returncode <span style="color:#f92672">!=</span> <span style="color:#ae81ff">0</span>:
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> {<span style="color:#e6db74">&#34;success&#34;</span>: <span style="color:#66d9ef">False</span>, <span style="color:#e6db74">&#34;error&#34;</span>: result<span style="color:#f92672">.</span>stderr}
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    result <span style="color:#f92672">=</span> subprocess<span style="color:#f92672">.</span>run(
</span></span><span style="display:flex;"><span>        [<span style="color:#e6db74">&#34;patch&#34;</span>, <span style="color:#e6db74">&#34;-p0&#34;</span>],
</span></span><span style="display:flex;"><span>        input<span style="color:#f92672">=</span>patch_text,
</span></span><span style="display:flex;"><span>        capture_output<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>,
</span></span><span style="display:flex;"><span>        text<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>,
</span></span><span style="display:flex;"><span>        cwd<span style="color:#f92672">=</span>PROJECT_ROOT
</span></span><span style="display:flex;"><span>    )
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span> {
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#34;success&#34;</span>: result<span style="color:#f92672">.</span>returncode <span style="color:#f92672">==</span> <span style="color:#ae81ff">0</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#34;output&#34;</span>: result<span style="color:#f92672">.</span>stdout,
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#34;error&#34;</span>: result<span style="color:#f92672">.</span>stderr <span style="color:#66d9ef">if</span> result<span style="color:#f92672">.</span>returncode <span style="color:#f92672">!=</span> <span style="color:#ae81ff">0</span> <span style="color:#66d9ef">else</span> <span style="color:#66d9ef">None</span>
</span></span><span style="display:flex;"><span>    }
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">run_coding_agent</span>(task: str) <span style="color:#f92672">-&gt;</span> dict:
</span></span><span style="display:flex;"><span>    logger<span style="color:#f92672">.</span>info(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Starting agent for task: </span><span style="color:#e6db74">{</span>task[:<span style="color:#ae81ff">80</span>]<span style="color:#e6db74">}</span><span style="color:#e6db74">...&#34;</span>)
</span></span><span style="display:flex;"><span>    audit_log <span style="color:#f92672">=</span> []
</span></span><span style="display:flex;"><span>    patches_applied <span style="color:#f92672">=</span> []
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    response <span style="color:#f92672">=</span> client<span style="color:#f92672">.</span>responses<span style="color:#f92672">.</span>create(
</span></span><span style="display:flex;"><span>        model<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;gpt-5.5&#34;</span>,
</span></span><span style="display:flex;"><span>        tools<span style="color:#f92672">=</span>[
</span></span><span style="display:flex;"><span>            {<span style="color:#e6db74">&#34;type&#34;</span>: <span style="color:#e6db74">&#34;shell&#34;</span>},
</span></span><span style="display:flex;"><span>            {<span style="color:#e6db74">&#34;type&#34;</span>: <span style="color:#e6db74">&#34;apply_patch&#34;</span>},
</span></span><span style="display:flex;"><span>        ],
</span></span><span style="display:flex;"><span>        container<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;container_auto&#34;</span>,
</span></span><span style="display:flex;"><span>        system<span style="color:#f92672">=</span>SYSTEM_PROMPT,
</span></span><span style="display:flex;"><span>        input<span style="color:#f92672">=</span>task,
</span></span><span style="display:flex;"><span>        stream<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>,
</span></span><span style="display:flex;"><span>    )
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">for</span> event <span style="color:#f92672">in</span> response:
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">if</span> event<span style="color:#f92672">.</span>type <span style="color:#f92672">==</span> <span style="color:#e6db74">&#34;shell_call&#34;</span>:
</span></span><span style="display:flex;"><span>            logger<span style="color:#f92672">.</span>info(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Shell: </span><span style="color:#e6db74">{</span>event<span style="color:#f92672">.</span>command[:<span style="color:#ae81ff">100</span>]<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">elif</span> event<span style="color:#f92672">.</span>type <span style="color:#f92672">==</span> <span style="color:#e6db74">&#34;apply_patch_call&#34;</span>:
</span></span><span style="display:flex;"><span>            logger<span style="color:#f92672">.</span>info(<span style="color:#e6db74">&#34;Patch proposed, validating...&#34;</span>)
</span></span><span style="display:flex;"><span>            audit_log<span style="color:#f92672">.</span>append({
</span></span><span style="display:flex;"><span>                <span style="color:#e6db74">&#34;type&#34;</span>: <span style="color:#e6db74">&#34;apply_patch_call&#34;</span>,
</span></span><span style="display:flex;"><span>                <span style="color:#e6db74">&#34;patch&#34;</span>: event<span style="color:#f92672">.</span>patch,
</span></span><span style="display:flex;"><span>                <span style="color:#e6db74">&#34;task&#34;</span>: task,
</span></span><span style="display:flex;"><span>            })
</span></span><span style="display:flex;"><span>            result <span style="color:#f92672">=</span> apply_patch(event<span style="color:#f92672">.</span>patch)
</span></span><span style="display:flex;"><span>            patches_applied<span style="color:#f92672">.</span>append(result)
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">if</span> result[<span style="color:#e6db74">&#34;success&#34;</span>]:
</span></span><span style="display:flex;"><span>                logger<span style="color:#f92672">.</span>info(<span style="color:#e6db74">&#34;Patch applied successfully&#34;</span>)
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">else</span>:
</span></span><span style="display:flex;"><span>                logger<span style="color:#f92672">.</span>error(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Patch failed: </span><span style="color:#e6db74">{</span>result[<span style="color:#e6db74">&#39;error&#39;</span>]<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">elif</span> event<span style="color:#f92672">.</span>type <span style="color:#f92672">==</span> <span style="color:#e6db74">&#34;response.done&#34;</span>:
</span></span><span style="display:flex;"><span>            logger<span style="color:#f92672">.</span>info(<span style="color:#e6db74">&#34;Agent completed&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span> {
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#34;patches_applied&#34;</span>: patches_applied,
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#34;patches_succeeded&#34;</span>: sum(<span style="color:#ae81ff">1</span> <span style="color:#66d9ef">for</span> p <span style="color:#f92672">in</span> patches_applied <span style="color:#66d9ef">if</span> p[<span style="color:#e6db74">&#34;success&#34;</span>]),
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#34;audit_log&#34;</span>: audit_log,
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#34;summary&#34;</span>: response<span style="color:#f92672">.</span>output_text <span style="color:#66d9ef">if</span> hasattr(response, <span style="color:#e6db74">&#34;output_text&#34;</span>) <span style="color:#66d9ef">else</span> <span style="color:#e6db74">&#34;&#34;</span>,
</span></span><span style="display:flex;"><span>    }
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">if</span> __name__ <span style="color:#f92672">==</span> <span style="color:#e6db74">&#34;__main__&#34;</span>:
</span></span><span style="display:flex;"><span>    result <span style="color:#f92672">=</span> run_coding_agent(
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#34;Find all hardcoded timeout values in src/ and replace them with &#34;</span>
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#34;constants defined in src/config/timeouts.py. Create that file if it &#34;</span>
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#34;doesn&#39;t exist. Run the test suite to verify nothing breaks.&#34;</span>
</span></span><span style="display:flex;"><span>    )
</span></span><span style="display:flex;"><span>    print(json<span style="color:#f92672">.</span>dumps(result, indent<span style="color:#f92672">=</span><span style="color:#ae81ff">2</span>, default<span style="color:#f92672">=</span>str))
</span></span></code></pre></div><h2 id="faq">FAQ</h2>
<p><strong>Does the hosted shell have internet access?</strong>
Yes, with restrictions. OpenAI-managed containers have controlled internet access that allows common package manager operations (<code>apt-get</code>, <code>pip install</code>, <code>npm install</code>) and public API calls, but blocks access to internal networks and restricts certain outbound protocols. This is intentional: the container needs to install dependencies but should not be able to reach your internal databases or VPNs.</p>
<p><strong>Can I use apply_patch without the hosted shell?</strong>
Yes. The <code>apply_patch</code> tool operates independently of the hosted shell. If your application already manages code execution locally (e.g., in a Docker container you control), you can enable only <code>apply_patch</code> and handle all file operations yourself. The model will emit <code>apply_patch_call</code> events that your application applies to its own filesystem.</p>
<p><strong>Is GPT-5.5 better than Claude Code for autonomous coding?</strong>
It depends on the benchmark. GPT-5.5 scores higher on Terminal-Bench 2.0 (82.7% vs. unreported for Claude Code), making it stronger for CLI-heavy workflows. Claude Opus 4.7 scores higher on SWE-Bench Pro (64.3% vs. GPT-5.5&rsquo;s 58.6%), making it better for complex real-world bug resolution. For teams in the OpenAI ecosystem, GPT-5.5 with hosted shell and <code>apply_patch</code> is the most integrated solution.</p>
<p><strong>What happens if a patch fails to apply?</strong>
The <code>apply_patch</code> tool emits the patch as structured output — your application is responsible for applying it. If <code>patch -p0</code> fails (e.g., due to context mismatch), you can return the error to the model in a follow-up turn and ask it to generate a corrected patch. Build retry logic with a maximum of 2–3 attempts before surfacing the error to a human reviewer.</p>
<p><strong>How do I handle large codebases with GPT-5.5&rsquo;s 1M token context?</strong>
GPT-5.5&rsquo;s 1M token context is large enough to hold approximately 30,000–40,000 lines of code. For monorepos larger than this, use the shell tool to identify the relevant subset of files (via <code>grep</code>, <code>find</code>, or language-specific analysis tools) and pass only those files as context. Structure your prompts to load files lazily — let the model request the files it needs rather than dumping the entire codebase upfront.</p>
]]></content:encoded></item></channel></rss>