<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Coding Benchmark on RockB</title><link>https://baeseokjae.github.io/tags/coding-benchmark/</link><description>Recent content in Coding Benchmark on RockB</description><image><title>RockB</title><url>https://baeseokjae.github.io/images/og-default.png</url><link>https://baeseokjae.github.io/images/og-default.png</link></image><generator>Hugo</generator><language>en-us</language><lastBuildDate>Mon, 18 May 2026 06:04:48 +0000</lastBuildDate><atom:link href="https://baeseokjae.github.io/tags/coding-benchmark/index.xml" rel="self" type="application/rss+xml"/><item><title>Claude Sonnet 5 vs GPT-5.4 for Coding: SWE-bench Benchmark Comparison 2026</title><link>https://baeseokjae.github.io/posts/claude-sonnet-5-vs-gpt-5-4-coding-2026/</link><pubDate>Mon, 18 May 2026 06:04:48 +0000</pubDate><guid>https://baeseokjae.github.io/posts/claude-sonnet-5-vs-gpt-5-4-coding-2026/</guid><description>Claude Sonnet 5 scores 82.1% vs GPT-5.4&amp;#39;s 57.7% on SWE-bench Pro. Here&amp;#39;s what the gap means for developers choosing an AI coding tool in 2026.</description><content:encoded><![CDATA[<p>Claude Sonnet 5 scores 82.1% on SWE-bench Verified and 46%+ on SWE-bench Pro, while GPT-5.4 scores 57.7% on SWE-bench Pro with comparable Verified scores around 85%. For most coding workflows, Sonnet 5 delivers a stronger autonomous code-editing experience, but GPT-5.4&rsquo;s reasoning levels give it an edge in cost-flexibility for high-stakes reasoning tasks.</p>
<h2 id="what-is-the-swe-bench-benchmark-and-why-does-it-matter-for-coding">What Is the SWE-bench Benchmark and Why Does It Matter for Coding?</h2>
<p>SWE-bench is the most respected real-world coding benchmark in 2026, built from actual GitHub issues submitted to production Python repositories including Django, Flask, and Scikit-learn. Unlike HumanEval — which tests isolated function writing and is now saturated at 95%+ for frontier models — SWE-bench requires a model to read a bug report, navigate a real codebase, write a patch, and pass the repository&rsquo;s own test suite. This means the benchmark tests the full software engineering loop, not just code generation from a clean prompt. SWE-bench Verified contains 500 human-validated tasks, while SWE-bench Pro uses harder tasks from private and less-contaminated repositories. As of May 2026, Claude Sonnet 5 holds an 82.1% SWE-bench Verified score (the first model to break the 80% barrier) and GPT-5.4 leads SWE-bench Pro at 57.7%, reflecting fundamentally different strengths: Sonnet 5 excels at agentic, autonomous patch generation, while GPT-5.4 integrates broader reasoning and computer-use capabilities in a single model.</p>
<h3 id="swe-bench-verified-vs-pro-which-score-should-you-trust">SWE-bench Verified vs. Pro: Which Score Should You Trust?</h3>
<p>SWE-bench Verified (500 tasks) is the most widely cited leaderboard, but contamination is a documented problem — models trained on public GitHub data have likely seen many of these tasks. SWE-bench Pro uses harder, newer, or partially private tasks to reduce this effect. A model scoring 82% on Verified but only 46% on Pro signals benchmark contamination; a model scoring 57.7% on Pro with a consistent Verified score signals genuine generalization. For developers making purchase decisions, Pro scores are more predictive of real-world performance.</p>
<h2 id="claude-sonnet-5-vs-gpt-54-head-to-head-benchmark-numbers">Claude Sonnet 5 vs GPT-5.4: Head-to-Head Benchmark Numbers</h2>
<p>Claude Sonnet 5 and GPT-5.4 were released two months apart in early 2026 — Sonnet 5 on February 3 and GPT-5.4 on March 5 — and they target slightly different parts of the developer workflow. Sonnet 5 is optimized for agentic code editing, autonomous bug fixing, and extended multi-step development sessions, while GPT-5.4 is a general-purpose reasoning model that incorporates GPT-5.3 Codex&rsquo;s coding capabilities alongside computer-use and knowledge-work benchmarks. The headline SWE-bench numbers make Sonnet 5 look dominant, but GPT-5.4&rsquo;s architecture gives it advantages in specific categories — particularly knowledge-heavy reasoning (83% GDPval) and desktop automation (75% OSWorld). Below is a complete benchmark comparison across the metrics that matter most to developers.</p>
<table>
  <thead>
      <tr>
          <th>Benchmark</th>
          <th>Claude Sonnet 5</th>
          <th>GPT-5.4</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>SWE-bench Verified</td>
          <td><strong>82.1%</strong></td>
          <td>~85%</td>
          <td>Sonnet 5 first to break 80%; GPT-5.4 benefits from later training</td>
      </tr>
      <tr>
          <td>SWE-bench Pro</td>
          <td>~46%</td>
          <td><strong>57.7%</strong></td>
          <td>Pro is harder, less contaminated</td>
      </tr>
      <tr>
          <td>HumanEval</td>
          <td>95%+</td>
          <td>95%+</td>
          <td>Saturated — no longer differentiates</td>
      </tr>
      <tr>
          <td>OSWorld (computer use)</td>
          <td>N/A</td>
          <td><strong>75%</strong></td>
          <td>GPT-5.4 exceeds human baseline (72.4%)</td>
      </tr>
      <tr>
          <td>GDPval (knowledge work)</td>
          <td>N/A</td>
          <td><strong>83%</strong></td>
          <td>44 occupations, industry-professional level</td>
      </tr>
      <tr>
          <td>Context Window</td>
          <td><strong>1M tokens</strong></td>
          <td>1.05M tokens</td>
          <td>Near parity</td>
      </tr>
      <tr>
          <td>Blind Human Eval (coding)</td>
          <td><strong>47%</strong> preferred</td>
          <td>29% preferred</td>
          <td>LM Council benchmarks, Q1 2026</td>
      </tr>
  </tbody>
</table>
<p>The blind human evaluation result is the most practically useful number here: in head-to-head comparisons where evaluators didn&rsquo;t know which model generated the code, Claude Sonnet 5 output was preferred 47% of the time versus 29% for GPT-5.4 and 24% for Gemini. For actual developers reviewing code diffs, Sonnet 5&rsquo;s output reads as more correct and production-ready.</p>
<h2 id="pricing-comparison-which-model-costs-less-to-run">Pricing Comparison: Which Model Costs Less to Run?</h2>
<p>Pricing at the API level is close but not identical, and the right choice depends heavily on how you use the models.</p>
<p>Claude Sonnet 5 is priced at <strong>$3 per million input tokens</strong> and <strong>$15 per million output tokens</strong> — unchanged from Claude Sonnet 4.5. With prompt caching enabled, cached reads drop to $0.30/MTok (90% reduction on repeated context), and the Batch API halves both input and output costs. For a typical autonomous coding workflow where the system prompt and repository context are cached, effective input costs drop to roughly $0.30-$0.60/MTok.</p>
<p>GPT-5.4 standard is priced at <strong>$2.50 per million input tokens</strong> and <strong>$15 per million output tokens</strong> — slightly cheaper on input but identical on output. However, GPT-5.4&rsquo;s context pricing doubles beyond 272K tokens (2x input, 1.5x output), meaning long-context sessions with a full codebase loaded cost significantly more. GPT-5.4 Pro runs at $30/$180 per million tokens — a 12x premium appropriate only for the highest-stakes enterprise tasks.</p>
<table>
  <thead>
      <tr>
          <th>Pricing Factor</th>
          <th>Claude Sonnet 5</th>
          <th>GPT-5.4 Standard</th>
          <th>GPT-5.4 Pro</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Input (standard)</td>
          <td>$3.00/MTok</td>
          <td>$2.50/MTok</td>
          <td>$30/MTok</td>
      </tr>
      <tr>
          <td>Output</td>
          <td>$15.00/MTok</td>
          <td>$15.00/MTok</td>
          <td>$180/MTok</td>
      </tr>
      <tr>
          <td>Cached input</td>
          <td><strong>$0.30/MTok</strong></td>
          <td>~$0.63/MTok</td>
          <td>N/A</td>
      </tr>
      <tr>
          <td>Long-context surcharge</td>
          <td>None</td>
          <td>2x beyond 272K</td>
          <td>2x beyond 272K</td>
      </tr>
      <tr>
          <td>Batch discount</td>
          <td><strong>50% off</strong></td>
          <td>50% off</td>
          <td>N/A</td>
      </tr>
  </tbody>
</table>
<p>For teams running high-volume coding pipelines with large repository contexts, Claude Sonnet 5&rsquo;s flat pricing plus prompt caching makes it significantly cheaper in practice, despite the slightly higher nominal input rate.</p>
<h2 id="agentic-coding-capabilities-dev-team-mode-vs-reasoning-effort-levels">Agentic Coding Capabilities: Dev Team Mode vs. Reasoning Effort Levels</h2>
<p>This is where the two models diverge most sharply in their architecture and philosophy.</p>
<p><strong>Claude Sonnet 5</strong> introduces Anthropic Agent Teams (also called Dev Team Mode), which enables the model to spawn specialized sub-agents — Backend, QA, Technical Writer — that work in parallel on different parts of a task. In a typical workflow, a Sonnet 5 orchestrator agent reads the issue, spawns a backend agent to write the patch, a QA agent to write tests, and a documentation agent to update comments, then merges the results. This reduces wall-clock time for complex multi-file changes. Verified by Vals AI, the model demonstrated 0% code-editing error rate on internal benchmarks (down from 9% for Sonnet 4) and sustained focus across 30+ hour autonomous coding sessions.</p>
<p><strong>GPT-5.4</strong> takes a different approach with its <code>reasoning_effort</code> parameter, which accepts five levels: <code>none</code>, <code>low</code>, <code>medium</code>, <code>high</code>, and <code>xhigh</code>. This controls how many reasoning tokens the model spends before responding — an <code>xhigh</code> request costs 3-5x more than <code>low</code> but produces significantly better results on ambiguous or architecturally complex problems. GPT-5.4 also integrates native Computer Use, allowing it to control a desktop, run code in a hosted shell, and interact with GUIs directly — capabilities Sonnet 5 accesses only through external tooling.</p>
<p>For most coding use cases (bug fixes, PR generation, code review), Sonnet 5&rsquo;s parallel agent architecture is the more practical choice. For tasks that require dynamic reasoning investment — say, debugging a race condition in a distributed system — GPT-5.4&rsquo;s tunable reasoning levels give developers more precise control over the cost-quality tradeoff.</p>
<h2 id="real-world-developer-performance-what-benchmark-scores-dont-tell-you">Real-World Developer Performance: What Benchmark Scores Don&rsquo;t Tell You</h2>
<p>SWE-bench scores explain what a model can do on Python GitHub issues from 2024 and earlier. They don&rsquo;t tell you how the model behaves on:</p>
<ul>
<li><strong>Your proprietary codebase</strong>: On SWE-bench Pro&rsquo;s private subset, models score 30-40% lower than on the public Verified set. Claude Opus 4.1 dropped from 22.7% to 17.8% and GPT-5 dropped from 23.1% to 14.9% when tested on private repositories. Assume similar drops for both Sonnet 5 and GPT-5.4.</li>
<li><strong>Non-Python languages</strong>: SWE-bench is Python-only. For TypeScript, Rust, or Go codebases, both models&rsquo; real performance is unknown from public benchmarks.</li>
<li><strong>Instruction following under ambiguity</strong>: Sonnet 5&rsquo;s 0% internal error rate on code edits suggests it&rsquo;s significantly less likely to make destructive changes or hallucinate function signatures. GPT-5.4 at <code>medium</code> reasoning effort is comparable; at <code>low</code>, it makes more mistakes.</li>
<li><strong>Latency</strong>: Sonnet 5 is faster for standard code completion tasks. GPT-5.4 at <code>xhigh</code> reasoning can be slower than a full Sonnet 5 agentic session.</li>
</ul>
<p>For teams that want to validate before committing, the most reliable approach is to build a private mini-benchmark from 20-30 real bugs or feature requests from your own backlog, run both models against it, and measure pass rate. This takes a few hours but produces data specific to your language, stack, and code complexity.</p>
<h2 id="which-model-should-you-use-decision-framework">Which Model Should You Use? Decision Framework</h2>
<p><strong>Choose Claude Sonnet 5 if:</strong></p>
<ul>
<li>Your primary use case is autonomous bug fixing, PR generation, or full-cycle code editing</li>
<li>You want native agentic orchestration without managing separate tools</li>
<li>You&rsquo;re running high-volume pipelines where prompt caching provides a material cost advantage</li>
<li>You use Claude Code, Cursor, or any IDE integration built on the Anthropic API</li>
<li>Human preference for code quality is your top metric</li>
</ul>
<p><strong>Choose GPT-5.4 if:</strong></p>
<ul>
<li>You need computer-use automation (controlling a desktop, Selenium replacement, GUI testing)</li>
<li>You&rsquo;re building complex reasoning pipelines where <code>reasoning_effort</code> tuning matters</li>
<li>You&rsquo;re already on OpenAI&rsquo;s platform and want a drop-in replacement for gpt-5.2</li>
<li>Your use case combines knowledge work (GDPval-style tasks) with code generation</li>
<li>You need gpt-5.4-pro&rsquo;s ceiling for the most complex enterprise tasks</li>
</ul>
<p><strong>The honest middle ground</strong>: Most engineering teams in 2026 use both. Sonnet 5 runs as the default coding agent; GPT-5.4 handles specific tasks where computer-use or high-reasoning-effort mode is necessary. The pricing difference is negligible at moderate scale — the decision should be capability-first.</p>
<h2 id="swe-bench-2026-leaderboard-context">SWE-bench 2026 Leaderboard Context</h2>
<p>The May 2026 SWE-bench leaderboard reveals a tiered market with a clear gap between models designed for autonomous code editing and general-purpose reasoning models that include coding as one capability among many. Claude Mythos Preview leads at 93.9% Verified and 77.8% Pro — but it is not yet available through the standard Anthropic API, making it a benchmark reference point rather than a practical option. The immediately accessible tier includes Claude Opus 4.7 Adaptive (87.6% Verified), GPT-5.4 (~85% Verified, 57.7% Pro), Claude Sonnet 5 (82.1% Verified), and GPT-5.3 Codex (85% Verified). Gemini 3.1 Pro sits around 70% Verified and trails significantly on Pro, suggesting Google&rsquo;s model generalizes less well to unfamiliar codebases. The key insight from the 2026 leaderboard is that the Verified-to-Pro gap is a better signal of real-world reliability than the Verified score alone — and on that metric, GPT-5.4 currently leads the accessible market.</p>
<p>To put Sonnet 5 and GPT-5.4 in broader context, here&rsquo;s where they sit in the May 2026 leaderboard:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>SWE-bench Verified</th>
          <th>SWE-bench Pro</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Claude Mythos Preview</td>
          <td>93.9%</td>
          <td>77.8%</td>
      </tr>
      <tr>
          <td>Claude Opus 4.7 Adaptive</td>
          <td>87.6%</td>
          <td>~50%</td>
      </tr>
      <tr>
          <td>GPT-5.4</td>
          <td>~85%</td>
          <td><strong>57.7%</strong></td>
      </tr>
      <tr>
          <td>Claude Sonnet 5</td>
          <td><strong>82.1%</strong></td>
          <td>~46%</td>
      </tr>
      <tr>
          <td>GPT-5.3 Codex</td>
          <td>85.0%</td>
          <td>~45%</td>
      </tr>
      <tr>
          <td>Gemini 3.1 Pro</td>
          <td>~70%</td>
          <td>~38%</td>
      </tr>
  </tbody>
</table>
<p>Claude Mythos Preview leads both leaderboards but is not yet available for general API access. For what&rsquo;s actually available to developers today, the Sonnet 5 / GPT-5.4 pairing represents the effective frontier.</p>
<h2 id="faq">FAQ</h2>
<p><strong>Is Claude Sonnet 5 better than GPT-5.4 for coding?</strong>
On most coding benchmarks and in blind human evaluations, Claude Sonnet 5 performs better for autonomous code editing and patch generation. GPT-5.4 has an edge in SWE-bench Pro and for tasks requiring computer-use or high-reasoning-effort mode. For pure coding workflows, Sonnet 5 is the current preference; for multi-modal or reasoning-heavy tasks, GPT-5.4 is competitive.</p>
<p><strong>What does 82.1% SWE-bench mean in practice?</strong>
It means Claude Sonnet 5 successfully fixed 82.1% of 500 real GitHub issues — reading the bug report, finding the right file, writing a patch, and passing the repository&rsquo;s own tests — without human guidance. This is the highest score achieved by any generally available model as of February 2026.</p>
<p><strong>How do I compare SWE-bench Verified vs. Pro scores?</strong>
Verified scores are inflated by data contamination (models trained on public GitHub data have likely seen these tasks). Pro scores use harder, less-contaminated tasks and are more predictive of real-world performance. A model with a large gap between Verified and Pro scores (like dropping from 82% to 46%) may be partially memorizing the Verified dataset.</p>
<p><strong>What is GPT-5.4&rsquo;s reasoning_effort parameter?</strong>
It&rsquo;s a parameter (<code>none</code>, <code>low</code>, <code>medium</code>, <code>high</code>, <code>xhigh</code>) that controls how many reasoning tokens GPT-5.4 uses before responding. Higher settings improve accuracy on complex problems but cost 3-5x more. For routine code completion, <code>low</code> or <code>medium</code> is cost-effective; for architectural decisions or complex debugging, <code>high</code> or <code>xhigh</code> is recommended.</p>
<p><strong>Can I use both Claude Sonnet 5 and GPT-5.4 in the same pipeline?</strong>
Yes, and many teams do. A common pattern is to use Sonnet 5 as the primary autonomous coding agent (leveraging Agent Teams for parallel execution) and route specific subtasks — computer-use automation, reasoning-heavy debugging — to GPT-5.4. Both models support function calling, tool use, and similar API patterns, making integration straightforward.</p>
]]></content:encoded></item></channel></rss>