<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Claude Sonnet 5 on RockB</title><link>https://baeseokjae.github.io/tags/claude-sonnet-5/</link><description>Recent content in Claude Sonnet 5 on RockB</description><image><title>RockB</title><url>https://baeseokjae.github.io/images/og-default.png</url><link>https://baeseokjae.github.io/images/og-default.png</link></image><generator>Hugo</generator><language>en-us</language><lastBuildDate>Mon, 18 May 2026 06:04:48 +0000</lastBuildDate><atom:link href="https://baeseokjae.github.io/tags/claude-sonnet-5/index.xml" rel="self" type="application/rss+xml"/><item><title>Claude Sonnet 5 vs GPT-5.4 for Coding: SWE-bench Benchmark Comparison 2026</title><link>https://baeseokjae.github.io/posts/claude-sonnet-5-vs-gpt-5-4-coding-2026/</link><pubDate>Mon, 18 May 2026 06:04:48 +0000</pubDate><guid>https://baeseokjae.github.io/posts/claude-sonnet-5-vs-gpt-5-4-coding-2026/</guid><description>Claude Sonnet 5 scores 82.1% vs GPT-5.4&amp;#39;s 57.7% on SWE-bench Pro. Here&amp;#39;s what the gap means for developers choosing an AI coding tool in 2026.</description><content:encoded><![CDATA[<p>Claude Sonnet 5 scores 82.1% on SWE-bench Verified and 46%+ on SWE-bench Pro, while GPT-5.4 scores 57.7% on SWE-bench Pro with comparable Verified scores around 85%. For most coding workflows, Sonnet 5 delivers a stronger autonomous code-editing experience, but GPT-5.4&rsquo;s reasoning levels give it an edge in cost-flexibility for high-stakes reasoning tasks.</p>
<h2 id="what-is-the-swe-bench-benchmark-and-why-does-it-matter-for-coding">What Is the SWE-bench Benchmark and Why Does It Matter for Coding?</h2>
<p>SWE-bench is the most respected real-world coding benchmark in 2026, built from actual GitHub issues submitted to production Python repositories including Django, Flask, and Scikit-learn. Unlike HumanEval — which tests isolated function writing and is now saturated at 95%+ for frontier models — SWE-bench requires a model to read a bug report, navigate a real codebase, write a patch, and pass the repository&rsquo;s own test suite. This means the benchmark tests the full software engineering loop, not just code generation from a clean prompt. SWE-bench Verified contains 500 human-validated tasks, while SWE-bench Pro uses harder tasks from private and less-contaminated repositories. As of May 2026, Claude Sonnet 5 holds an 82.1% SWE-bench Verified score (the first model to break the 80% barrier) and GPT-5.4 leads SWE-bench Pro at 57.7%, reflecting fundamentally different strengths: Sonnet 5 excels at agentic, autonomous patch generation, while GPT-5.4 integrates broader reasoning and computer-use capabilities in a single model.</p>
<h3 id="swe-bench-verified-vs-pro-which-score-should-you-trust">SWE-bench Verified vs. Pro: Which Score Should You Trust?</h3>
<p>SWE-bench Verified (500 tasks) is the most widely cited leaderboard, but contamination is a documented problem — models trained on public GitHub data have likely seen many of these tasks. SWE-bench Pro uses harder, newer, or partially private tasks to reduce this effect. A model scoring 82% on Verified but only 46% on Pro signals benchmark contamination; a model scoring 57.7% on Pro with a consistent Verified score signals genuine generalization. For developers making purchase decisions, Pro scores are more predictive of real-world performance.</p>
<h2 id="claude-sonnet-5-vs-gpt-54-head-to-head-benchmark-numbers">Claude Sonnet 5 vs GPT-5.4: Head-to-Head Benchmark Numbers</h2>
<p>Claude Sonnet 5 and GPT-5.4 were released two months apart in early 2026 — Sonnet 5 on February 3 and GPT-5.4 on March 5 — and they target slightly different parts of the developer workflow. Sonnet 5 is optimized for agentic code editing, autonomous bug fixing, and extended multi-step development sessions, while GPT-5.4 is a general-purpose reasoning model that incorporates GPT-5.3 Codex&rsquo;s coding capabilities alongside computer-use and knowledge-work benchmarks. The headline SWE-bench numbers make Sonnet 5 look dominant, but GPT-5.4&rsquo;s architecture gives it advantages in specific categories — particularly knowledge-heavy reasoning (83% GDPval) and desktop automation (75% OSWorld). Below is a complete benchmark comparison across the metrics that matter most to developers.</p>
<table>
  <thead>
      <tr>
          <th>Benchmark</th>
          <th>Claude Sonnet 5</th>
          <th>GPT-5.4</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>SWE-bench Verified</td>
          <td><strong>82.1%</strong></td>
          <td>~85%</td>
          <td>Sonnet 5 first to break 80%; GPT-5.4 benefits from later training</td>
      </tr>
      <tr>
          <td>SWE-bench Pro</td>
          <td>~46%</td>
          <td><strong>57.7%</strong></td>
          <td>Pro is harder, less contaminated</td>
      </tr>
      <tr>
          <td>HumanEval</td>
          <td>95%+</td>
          <td>95%+</td>
          <td>Saturated — no longer differentiates</td>
      </tr>
      <tr>
          <td>OSWorld (computer use)</td>
          <td>N/A</td>
          <td><strong>75%</strong></td>
          <td>GPT-5.4 exceeds human baseline (72.4%)</td>
      </tr>
      <tr>
          <td>GDPval (knowledge work)</td>
          <td>N/A</td>
          <td><strong>83%</strong></td>
          <td>44 occupations, industry-professional level</td>
      </tr>
      <tr>
          <td>Context Window</td>
          <td><strong>1M tokens</strong></td>
          <td>1.05M tokens</td>
          <td>Near parity</td>
      </tr>
      <tr>
          <td>Blind Human Eval (coding)</td>
          <td><strong>47%</strong> preferred</td>
          <td>29% preferred</td>
          <td>LM Council benchmarks, Q1 2026</td>
      </tr>
  </tbody>
</table>
<p>The blind human evaluation result is the most practically useful number here: in head-to-head comparisons where evaluators didn&rsquo;t know which model generated the code, Claude Sonnet 5 output was preferred 47% of the time versus 29% for GPT-5.4 and 24% for Gemini. For actual developers reviewing code diffs, Sonnet 5&rsquo;s output reads as more correct and production-ready.</p>
<h2 id="pricing-comparison-which-model-costs-less-to-run">Pricing Comparison: Which Model Costs Less to Run?</h2>
<p>Pricing at the API level is close but not identical, and the right choice depends heavily on how you use the models.</p>
<p>Claude Sonnet 5 is priced at <strong>$3 per million input tokens</strong> and <strong>$15 per million output tokens</strong> — unchanged from Claude Sonnet 4.5. With prompt caching enabled, cached reads drop to $0.30/MTok (90% reduction on repeated context), and the Batch API halves both input and output costs. For a typical autonomous coding workflow where the system prompt and repository context are cached, effective input costs drop to roughly $0.30-$0.60/MTok.</p>
<p>GPT-5.4 standard is priced at <strong>$2.50 per million input tokens</strong> and <strong>$15 per million output tokens</strong> — slightly cheaper on input but identical on output. However, GPT-5.4&rsquo;s context pricing doubles beyond 272K tokens (2x input, 1.5x output), meaning long-context sessions with a full codebase loaded cost significantly more. GPT-5.4 Pro runs at $30/$180 per million tokens — a 12x premium appropriate only for the highest-stakes enterprise tasks.</p>
<table>
  <thead>
      <tr>
          <th>Pricing Factor</th>
          <th>Claude Sonnet 5</th>
          <th>GPT-5.4 Standard</th>
          <th>GPT-5.4 Pro</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Input (standard)</td>
          <td>$3.00/MTok</td>
          <td>$2.50/MTok</td>
          <td>$30/MTok</td>
      </tr>
      <tr>
          <td>Output</td>
          <td>$15.00/MTok</td>
          <td>$15.00/MTok</td>
          <td>$180/MTok</td>
      </tr>
      <tr>
          <td>Cached input</td>
          <td><strong>$0.30/MTok</strong></td>
          <td>~$0.63/MTok</td>
          <td>N/A</td>
      </tr>
      <tr>
          <td>Long-context surcharge</td>
          <td>None</td>
          <td>2x beyond 272K</td>
          <td>2x beyond 272K</td>
      </tr>
      <tr>
          <td>Batch discount</td>
          <td><strong>50% off</strong></td>
          <td>50% off</td>
          <td>N/A</td>
      </tr>
  </tbody>
</table>
<p>For teams running high-volume coding pipelines with large repository contexts, Claude Sonnet 5&rsquo;s flat pricing plus prompt caching makes it significantly cheaper in practice, despite the slightly higher nominal input rate.</p>
<h2 id="agentic-coding-capabilities-dev-team-mode-vs-reasoning-effort-levels">Agentic Coding Capabilities: Dev Team Mode vs. Reasoning Effort Levels</h2>
<p>This is where the two models diverge most sharply in their architecture and philosophy.</p>
<p><strong>Claude Sonnet 5</strong> introduces Anthropic Agent Teams (also called Dev Team Mode), which enables the model to spawn specialized sub-agents — Backend, QA, Technical Writer — that work in parallel on different parts of a task. In a typical workflow, a Sonnet 5 orchestrator agent reads the issue, spawns a backend agent to write the patch, a QA agent to write tests, and a documentation agent to update comments, then merges the results. This reduces wall-clock time for complex multi-file changes. Verified by Vals AI, the model demonstrated 0% code-editing error rate on internal benchmarks (down from 9% for Sonnet 4) and sustained focus across 30+ hour autonomous coding sessions.</p>
<p><strong>GPT-5.4</strong> takes a different approach with its <code>reasoning_effort</code> parameter, which accepts five levels: <code>none</code>, <code>low</code>, <code>medium</code>, <code>high</code>, and <code>xhigh</code>. This controls how many reasoning tokens the model spends before responding — an <code>xhigh</code> request costs 3-5x more than <code>low</code> but produces significantly better results on ambiguous or architecturally complex problems. GPT-5.4 also integrates native Computer Use, allowing it to control a desktop, run code in a hosted shell, and interact with GUIs directly — capabilities Sonnet 5 accesses only through external tooling.</p>
<p>For most coding use cases (bug fixes, PR generation, code review), Sonnet 5&rsquo;s parallel agent architecture is the more practical choice. For tasks that require dynamic reasoning investment — say, debugging a race condition in a distributed system — GPT-5.4&rsquo;s tunable reasoning levels give developers more precise control over the cost-quality tradeoff.</p>
<h2 id="real-world-developer-performance-what-benchmark-scores-dont-tell-you">Real-World Developer Performance: What Benchmark Scores Don&rsquo;t Tell You</h2>
<p>SWE-bench scores explain what a model can do on Python GitHub issues from 2024 and earlier. They don&rsquo;t tell you how the model behaves on:</p>
<ul>
<li><strong>Your proprietary codebase</strong>: On SWE-bench Pro&rsquo;s private subset, models score 30-40% lower than on the public Verified set. Claude Opus 4.1 dropped from 22.7% to 17.8% and GPT-5 dropped from 23.1% to 14.9% when tested on private repositories. Assume similar drops for both Sonnet 5 and GPT-5.4.</li>
<li><strong>Non-Python languages</strong>: SWE-bench is Python-only. For TypeScript, Rust, or Go codebases, both models&rsquo; real performance is unknown from public benchmarks.</li>
<li><strong>Instruction following under ambiguity</strong>: Sonnet 5&rsquo;s 0% internal error rate on code edits suggests it&rsquo;s significantly less likely to make destructive changes or hallucinate function signatures. GPT-5.4 at <code>medium</code> reasoning effort is comparable; at <code>low</code>, it makes more mistakes.</li>
<li><strong>Latency</strong>: Sonnet 5 is faster for standard code completion tasks. GPT-5.4 at <code>xhigh</code> reasoning can be slower than a full Sonnet 5 agentic session.</li>
</ul>
<p>For teams that want to validate before committing, the most reliable approach is to build a private mini-benchmark from 20-30 real bugs or feature requests from your own backlog, run both models against it, and measure pass rate. This takes a few hours but produces data specific to your language, stack, and code complexity.</p>
<h2 id="which-model-should-you-use-decision-framework">Which Model Should You Use? Decision Framework</h2>
<p><strong>Choose Claude Sonnet 5 if:</strong></p>
<ul>
<li>Your primary use case is autonomous bug fixing, PR generation, or full-cycle code editing</li>
<li>You want native agentic orchestration without managing separate tools</li>
<li>You&rsquo;re running high-volume pipelines where prompt caching provides a material cost advantage</li>
<li>You use Claude Code, Cursor, or any IDE integration built on the Anthropic API</li>
<li>Human preference for code quality is your top metric</li>
</ul>
<p><strong>Choose GPT-5.4 if:</strong></p>
<ul>
<li>You need computer-use automation (controlling a desktop, Selenium replacement, GUI testing)</li>
<li>You&rsquo;re building complex reasoning pipelines where <code>reasoning_effort</code> tuning matters</li>
<li>You&rsquo;re already on OpenAI&rsquo;s platform and want a drop-in replacement for gpt-5.2</li>
<li>Your use case combines knowledge work (GDPval-style tasks) with code generation</li>
<li>You need gpt-5.4-pro&rsquo;s ceiling for the most complex enterprise tasks</li>
</ul>
<p><strong>The honest middle ground</strong>: Most engineering teams in 2026 use both. Sonnet 5 runs as the default coding agent; GPT-5.4 handles specific tasks where computer-use or high-reasoning-effort mode is necessary. The pricing difference is negligible at moderate scale — the decision should be capability-first.</p>
<h2 id="swe-bench-2026-leaderboard-context">SWE-bench 2026 Leaderboard Context</h2>
<p>The May 2026 SWE-bench leaderboard reveals a tiered market with a clear gap between models designed for autonomous code editing and general-purpose reasoning models that include coding as one capability among many. Claude Mythos Preview leads at 93.9% Verified and 77.8% Pro — but it is not yet available through the standard Anthropic API, making it a benchmark reference point rather than a practical option. The immediately accessible tier includes Claude Opus 4.7 Adaptive (87.6% Verified), GPT-5.4 (~85% Verified, 57.7% Pro), Claude Sonnet 5 (82.1% Verified), and GPT-5.3 Codex (85% Verified). Gemini 3.1 Pro sits around 70% Verified and trails significantly on Pro, suggesting Google&rsquo;s model generalizes less well to unfamiliar codebases. The key insight from the 2026 leaderboard is that the Verified-to-Pro gap is a better signal of real-world reliability than the Verified score alone — and on that metric, GPT-5.4 currently leads the accessible market.</p>
<p>To put Sonnet 5 and GPT-5.4 in broader context, here&rsquo;s where they sit in the May 2026 leaderboard:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>SWE-bench Verified</th>
          <th>SWE-bench Pro</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Claude Mythos Preview</td>
          <td>93.9%</td>
          <td>77.8%</td>
      </tr>
      <tr>
          <td>Claude Opus 4.7 Adaptive</td>
          <td>87.6%</td>
          <td>~50%</td>
      </tr>
      <tr>
          <td>GPT-5.4</td>
          <td>~85%</td>
          <td><strong>57.7%</strong></td>
      </tr>
      <tr>
          <td>Claude Sonnet 5</td>
          <td><strong>82.1%</strong></td>
          <td>~46%</td>
      </tr>
      <tr>
          <td>GPT-5.3 Codex</td>
          <td>85.0%</td>
          <td>~45%</td>
      </tr>
      <tr>
          <td>Gemini 3.1 Pro</td>
          <td>~70%</td>
          <td>~38%</td>
      </tr>
  </tbody>
</table>
<p>Claude Mythos Preview leads both leaderboards but is not yet available for general API access. For what&rsquo;s actually available to developers today, the Sonnet 5 / GPT-5.4 pairing represents the effective frontier.</p>
<h2 id="faq">FAQ</h2>
<p><strong>Is Claude Sonnet 5 better than GPT-5.4 for coding?</strong>
On most coding benchmarks and in blind human evaluations, Claude Sonnet 5 performs better for autonomous code editing and patch generation. GPT-5.4 has an edge in SWE-bench Pro and for tasks requiring computer-use or high-reasoning-effort mode. For pure coding workflows, Sonnet 5 is the current preference; for multi-modal or reasoning-heavy tasks, GPT-5.4 is competitive.</p>
<p><strong>What does 82.1% SWE-bench mean in practice?</strong>
It means Claude Sonnet 5 successfully fixed 82.1% of 500 real GitHub issues — reading the bug report, finding the right file, writing a patch, and passing the repository&rsquo;s own tests — without human guidance. This is the highest score achieved by any generally available model as of February 2026.</p>
<p><strong>How do I compare SWE-bench Verified vs. Pro scores?</strong>
Verified scores are inflated by data contamination (models trained on public GitHub data have likely seen these tasks). Pro scores use harder, less-contaminated tasks and are more predictive of real-world performance. A model with a large gap between Verified and Pro scores (like dropping from 82% to 46%) may be partially memorizing the Verified dataset.</p>
<p><strong>What is GPT-5.4&rsquo;s reasoning_effort parameter?</strong>
It&rsquo;s a parameter (<code>none</code>, <code>low</code>, <code>medium</code>, <code>high</code>, <code>xhigh</code>) that controls how many reasoning tokens GPT-5.4 uses before responding. Higher settings improve accuracy on complex problems but cost 3-5x more. For routine code completion, <code>low</code> or <code>medium</code> is cost-effective; for architectural decisions or complex debugging, <code>high</code> or <code>xhigh</code> is recommended.</p>
<p><strong>Can I use both Claude Sonnet 5 and GPT-5.4 in the same pipeline?</strong>
Yes, and many teams do. A common pattern is to use Sonnet 5 as the primary autonomous coding agent (leveraging Agent Teams for parallel execution) and route specific subtasks — computer-use automation, reasoning-heavy debugging — to GPT-5.4. Both models support function calling, tool use, and similar API patterns, making integration straightforward.</p>
]]></content:encoded></item><item><title>Claude Sonnet 5 Review: 82.1% SWE-bench, Dev Team Mode &amp; Pricing Guide</title><link>https://baeseokjae.github.io/posts/claude-sonnet-5-review-2026/</link><pubDate>Sun, 17 May 2026 09:04:37 +0000</pubDate><guid>https://baeseokjae.github.io/posts/claude-sonnet-5-review-2026/</guid><description>Claude Sonnet 5 hits 82.1% SWE-bench Verified, introduces Dev Team multi-agent mode, and costs $3/MTok. Full developer review with pricing and migration guide.</description><content:encoded><![CDATA[<p>Claude Sonnet 5 is Anthropic&rsquo;s mid-tier frontier model released February 3, 2026, scoring 82.1% on SWE-bench Verified — the highest coding benchmark score ever recorded at launch. It introduces Dev Team multi-agent mode, a 1 million token context window, and holds the same $3 per million input token price as its predecessor. For most development teams, it&rsquo;s the most capable coding model available at a non-flagship price.</p>
<h2 id="what-is-claude-sonnet-5-fennec-model-overview--release-details">What Is Claude Sonnet 5? (Fennec Model Overview &amp; Release Details)</h2>
<p>Claude Sonnet 5 — internally codenamed &ldquo;Fennec&rdquo; after the large-eared desert fox — is Anthropic&rsquo;s third-generation Sonnet model and the first AI model to break the 80% ceiling on SWE-bench Verified. It was officially released on February 3, 2026, simultaneously across the Anthropic API, Amazon Bedrock, and Google Vertex AI, with the identifier <code>claude-sonnet-5@20260203</code> first spotted in Vertex AI deployment logs days before the announcement. The codename Fennec is not arbitrary marketing: it nods to the model&rsquo;s 1 million token context window — metaphorically &ldquo;large ears&rdquo; for listening to entire codebases. Unlike Claude Opus 4.7, which targets deep multi-step reasoning at a premium price, Sonnet 5 is positioned as the workhorse model for engineering teams who need frontier-grade coding capability without flagship-grade cost. It replaced Claude Sonnet 4.6 as the default model for Claude Code Free and Pro users on launch day. The model runs on Google&rsquo;s Antigravity TPU infrastructure, which Anthropic credits for the latency improvements over Sonnet 4.6. For API users, the migration path from <code>claude-sonnet-4-6</code> to <code>claude-sonnet-5</code> is a one-line model ID change — same tool format, same system prompt conventions.</p>
<h2 id="821-swe-bench-verified--what-this-score-actually-means-for-developers">82.1% SWE-bench Verified — What This Score Actually Means for Developers</h2>
<p>SWE-bench Verified is the most rigorous public benchmark for autonomous software engineering: a model receives a real GitHub issue and a full repository, then must write, test, and verify a patch entirely on its own — no hints, no human guidance. Claude Sonnet 5&rsquo;s 82.1% score means it successfully resolved over four in five of these real-world bugs on the first attempt. To put that in historical context: before February 2026, the entire AI industry was stalled in the high 70s on SWE-bench Verified — the 80% barrier had been treated as a near-term ceiling by researchers. Sonnet 5 broke it by 2.1 percentage points on launch day. For comparison: Gemini 3.1 Pro sits at 80.6%, GPT-5.4 at approximately 80%, and Claude Sonnet 4.6 at 79.6%. The practical implication for development teams is not just percentage points — it&rsquo;s the class of task the model can handle reliably. At 82.1%, Sonnet 5 can take a raw bug report with no additional context and independently write, test, and verify a patch. That&rsquo;s junior developer parity for isolated, well-scoped bugs. It still struggles with ambiguous cross-system issues requiring institutional knowledge, but for the bread-and-butter of ticket work — fixing broken tests, resolving regression bugs, implementing clearly specified features — it&rsquo;s more reliable than many junior engineers in controlled conditions.</p>
<h3 id="how-swe-bench-compares-to-real-development-work">How SWE-bench Compares to Real Development Work</h3>
<p>SWE-bench tasks are isolated from production context, which means scores don&rsquo;t translate directly to &ldquo;can replace an engineer.&rdquo; The benchmark tests patch-writing on single-file or small-scope changes. Real codebases involve implicit conventions, deployment dependencies, and stakeholder judgment that no benchmark captures. Teams using Sonnet 5 report the most gains in: isolated bug fixes with clear reproduction steps, adding tests to existing functions, and implementing well-specified API endpoint changes. The weakest results appear in cross-service refactors and tasks requiring knowledge of undocumented internal conventions.</p>
<h2 id="dev-team-mode-explained-multi-agent-collaboration-in-practice">Dev Team Mode Explained: Multi-Agent Collaboration in Practice</h2>
<p>Dev Team mode is Claude Sonnet 5&rsquo;s most architecturally novel feature: when enabled, the model acts as a Team Leader agent that decomposes complex tasks into parallel sub-tasks and delegates them to specialized sub-agents — Backend Specialist, QA Tester, Technical Writer, Frontend Engineer, and others depending on the task. Instead of a single model reasoning sequentially through a large codebase change, Dev Team spawns parallel reasoning threads that work simultaneously and report back to a coordinator. A task like &ldquo;add OAuth2 login to our API&rdquo; might split into: the Backend Specialist drafting the authentication middleware, the QA Tester writing integration test cases in parallel, and the Technical Writer generating the API documentation — all executing simultaneously. The Team Leader then reconciles outputs, resolves conflicts, and delivers a unified result. In Claude Code IDE extension, users can monitor parallel sub-agent progress in real time through the agent panel. Dev Team mode is particularly effective for tasks that naturally decompose: greenfield feature implementation, multi-layer test suite generation, and large-scale documentation updates. It is less useful for tightly coupled changes where agents would block on each other&rsquo;s outputs, and early users report occasional coordination overhead on simple tasks that a single-agent pass would handle faster.</p>
<h3 id="background-reasoning-what-changed-from-visible-thinking">Background Reasoning: What Changed from Visible Thinking</h3>
<p>Earlier Claude models with extended thinking surfaced reasoning as visible <code>&lt;thinking&gt;</code> blocks, which added latency and added noise for users who just wanted the answer. Sonnet 5 introduces background reasoning: the model still performs extended multi-step reasoning, but it runs internally without producing visible thinking output. The result is faster wall-clock response times for most tasks and cleaner output for production integrations. Background reasoning cannot be toggled off — it&rsquo;s always active. Developers who relied on visible thinking chains for debugging or explainability will need to use the model&rsquo;s ability to summarize its reasoning on request rather than inspecting raw <code>&lt;thinking&gt;</code> blocks.</p>
<h2 id="claude-sonnet-5-pricing-guide-api-costs-caching-and-batch-discounts">Claude Sonnet 5 Pricing Guide: API Costs, Caching, and Batch Discounts</h2>
<p>Claude Sonnet 5 is priced at $3.00 per million input tokens and $15.00 per million output tokens — identical to Claude Sonnet 4.6&rsquo;s pricing. This is the most significant pricing fact for teams evaluating the upgrade: you get the benchmark jump from 79.6% to 82.1% SWE-bench Verified at zero additional cost per token. For context, Claude Opus 4.7 costs $5.00 per million input tokens and $25.00 per million output tokens — Sonnet 5 delivers comparable coding performance at 60% of the input cost. Prompt caching provides a 90% discount on cached reads: $0.30 per million tokens versus $3.00 standard, which is critical for workflows that repeatedly process the same large codebase context. The Batch API offers a 50% discount for async workloads — $1.50 input / $7.50 output per million tokens — making it the right choice for CI/CD pipeline integrations, nightly code review passes, and bulk test generation. Claude Managed Agents adds $0.08 per session-hour of runtime on top of standard token costs, covering Dev Team mode sessions and other agent orchestration overhead.</p>
<h3 id="full-pricing-breakdown-table">Full Pricing Breakdown Table</h3>
<table>
  <thead>
      <tr>
          <th>Tier</th>
          <th>Input ($/MTok)</th>
          <th>Output ($/MTok)</th>
          <th>Best For</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Sonnet 5 Standard</td>
          <td>$3.00</td>
          <td>$15.00</td>
          <td>Interactive coding, API calls</td>
      </tr>
      <tr>
          <td>Sonnet 5 Cached Reads</td>
          <td>$0.30</td>
          <td>$15.00</td>
          <td>Repeated codebase context</td>
      </tr>
      <tr>
          <td>Sonnet 5 Batch API</td>
          <td>$1.50</td>
          <td>$7.50</td>
          <td>CI/CD, bulk jobs</td>
      </tr>
      <tr>
          <td>Managed Agents</td>
          <td>$3.00 + $0.08/hr</td>
          <td>$15.00</td>
          <td>Dev Team mode sessions</td>
      </tr>
      <tr>
          <td>Opus 4.7 Standard</td>
          <td>$5.00</td>
          <td>$25.00</td>
          <td>Deep reasoning, complex analysis</td>
      </tr>
  </tbody>
</table>
<p>For most development workflows, prompt caching plus Batch API can reduce effective Sonnet 5 costs to well below $1.00 per million input tokens for cached context. Teams running large-context codebase reviews should enable prompt caching as the default, not an optimization — the 90% discount makes it economically irrational not to.</p>
<h2 id="claude-sonnet-5-vs-gpt-54-vs-gemini-31-pro-head-to-head-benchmarks">Claude Sonnet 5 vs GPT-5.4 vs Gemini 3.1 Pro: Head-to-Head Benchmarks</h2>
<p>Claude Sonnet 5 leads the current coding model field at 82.1% SWE-bench Verified, but benchmark leadership doesn&rsquo;t tell the complete picture. GPT-5.4 scores approximately 80% on SWE-bench and outperforms Sonnet 5 on Terminal-Bench at 75.1% — a benchmark focused on shell command execution, CLI tool use, and terminal-native tasks. GPT-5.4 also benefits from deep GitHub Copilot integration, making it the default choice for teams heavily invested in the Microsoft/VS Code ecosystem. Gemini 3.1 Pro scores 80.6% on SWE-bench and is priced more aggressively at $2.00 input / $12.00 output per million tokens, making it the best price-to-performance option if Sonnet 5&rsquo;s 1.5 percentage point benchmark edge doesn&rsquo;t justify the cost premium for your workload. Sonnet 5&rsquo;s concrete advantages are multi-file refactoring, code review at repository scale, and tasks requiring deep cross-file dependency understanding — areas where the 1 million token context window provides a structural advantage over competitors with smaller context limits.</p>
<h3 id="model-comparison-table">Model Comparison Table</h3>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>SWE-bench</th>
          <th>Context</th>
          <th>Input $/MTok</th>
          <th>Best At</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Claude Sonnet 5</td>
          <td>82.1%</td>
          <td>1M tokens</td>
          <td>$3.00</td>
          <td>Multi-file refactoring, code review</td>
      </tr>
      <tr>
          <td>Gemini 3.1 Pro</td>
          <td>80.6%</td>
          <td>2M tokens</td>
          <td>$2.00</td>
          <td>Price-performance, long context</td>
      </tr>
      <tr>
          <td>GPT-5.4</td>
          <td>~80%</td>
          <td>128K tokens</td>
          <td>~$3.50</td>
          <td>Terminal-Bench, Copilot integration</td>
      </tr>
      <tr>
          <td>Claude Sonnet 4.6</td>
          <td>79.6%</td>
          <td>200K tokens</td>
          <td>$3.00</td>
          <td>Stable, proven production workloads</td>
      </tr>
  </tbody>
</table>
<p>The real-world advice from practitioners who use all three: pick your model based on the ecosystem, not just the benchmark. If your team runs Claude Code, Sonnet 5 is the obvious default. If you&rsquo;re on GitHub Copilot Enterprise, GPT-5.4 has better native integration. If you&rsquo;re cost-constrained and running large-scale async jobs, Gemini 3.1 Pro&rsquo;s Batch API pricing may win on economics.</p>
<h2 id="1-million-token-context-window-real-world-enterprise-use-cases">1 Million Token Context Window: Real-World Enterprise Use Cases</h2>
<p>Claude Sonnet 5&rsquo;s 1 million token context window is five times larger than Claude Opus 4.5&rsquo;s 200K limit and more than seven times the GPT-5.4 context window — enabling a qualitatively different class of development task. A 1 million token window fits approximately 750,000 words of text, which translates to entire mid-size open-source repositories, including all source files, test suites, and documentation, in a single context load. Before context windows at this scale, codebase-level understanding required chunking and retrieval augmented generation — a lossy process that forced the model to work with fragments rather than the full picture. With Sonnet 5, teams can load complete repository state, run holistic refactors across dozens of files, and ask the model to reason about cross-cutting concerns (like security policy enforcement or dependency upgrade impact) without losing context between steps. Enterprise use cases where this matters most include: compliance audit passes across entire microservice architectures, dependency security reviews after a CVE disclosure, and codebase migration projects where the full before-and-after state needs to be in context simultaneously. For context, loading a 1M token codebase at cached-read pricing costs $0.30 — approximately the cost of one minute of a junior developer&rsquo;s time.</p>
<h2 id="should-you-upgrade-sonnet-5-vs-claude-sonnet-46-decision-guide">Should You Upgrade? Sonnet 5 vs Claude Sonnet 4.6 Decision Guide</h2>
<p>The upgrade decision from Claude Sonnet 4.6 to Sonnet 5 is simple for most teams: same price, meaningfully better coding performance, and a 5x larger context window. The only reasons to stay on Sonnet 4.6 are active production deployments that have been tuned around its behavior and carry regression risk, or teams that specifically needed Sonnet 4.6&rsquo;s visible thinking blocks for explainability workflows. Sonnet 5&rsquo;s background reasoning is not configurable — if your application surface requires visible reasoning chains (customer-facing explanations, audit trails for compliance, debugging middleware), you&rsquo;ll need to implement an alternative via prompted self-explanation rather than native thinking blocks. Early users also report that Sonnet 5 occasionally over-reasons on simple tasks — spending more inference compute than necessary before producing a short answer. For high-volume, low-complexity workloads (simple completions, single-function edits), Haiku 4.5 at a lower price point may be more economical. Sonnet 5 is the right default for anything involving multi-file changes, complex bug diagnosis, or tasks that benefit from extended reasoning.</p>
<h3 id="migration-checklist-sonnet-46--sonnet-5">Migration Checklist: Sonnet 4.6 → Sonnet 5</h3>
<ul>
<li>Update model ID from <code>claude-sonnet-4-6</code> to <code>claude-sonnet-5</code> (or <code>claude-sonnet-5@20260203</code> for pinned versions)</li>
<li>Audit any code that parses or displays <code>&lt;thinking&gt;</code> blocks — Sonnet 5 doesn&rsquo;t emit them by default</li>
<li>Enable prompt caching for any workflow loading large context (the 90% discount pays for itself immediately)</li>
<li>Test Dev Team mode on your most complex multi-file task before rolling out broadly</li>
<li>Monitor token usage — the larger context window can increase costs if prompts are not optimized</li>
</ul>
<h2 id="verdict-who-should-use-claude-sonnet-5-in-2026">Verdict: Who Should Use Claude Sonnet 5 in 2026?</h2>
<p>Claude Sonnet 5 is the best coding model available at its price point as of May 2026 — full stop. The 82.1% SWE-bench Verified score is a genuine breakthrough, not a marginal increment, and the fact that it comes at the same $3/MTok price as its predecessor makes the upgrade argument almost trivial for teams already using Claude. Dev Team mode is genuinely useful for complex multi-component tasks, though it adds coordination overhead that isn&rsquo;t worth it for simple changes. The 1 million token context window is transformative for enterprise teams managing large codebases — the ability to reason across an entire repository without chunking is a qualitative shift in what AI-assisted development can do. The model is ideal for: engineering teams using Claude Code as their primary coding interface, API developers building automated code review or test generation pipelines, and enterprises running compliance or security audits across large codebases. It&rsquo;s less ideal for: teams needing visible reasoning chains for explainability, simple high-volume workloads where Haiku 4.5 is more economical, and teams fully committed to the GitHub Copilot ecosystem where GPT-5.4 integration is tighter. For the majority of development teams, Claude Sonnet 5 is the new default.</p>
<h2 id="faq">FAQ</h2>
<p>Claude Sonnet 5 is Anthropic&rsquo;s most capable mid-tier model as of 2026, scoring 82.1% on SWE-bench Verified — a first for any AI model at this price point. Released February 3, 2026 under the internal codename Fennec, it is priced at $3.00 per million input tokens and $15.00 per million output tokens, the same as Claude Sonnet 4.6. The model introduces two major new capabilities: Dev Team multi-agent mode, which decomposes complex tasks across specialized sub-agents, and a 1 million token context window that enables repository-level reasoning across entire codebases. Prompt caching brings cached read costs to $0.30/MTok — a 90% reduction — making large-context workflows economically viable at scale. Sonnet 5 is the default model for Claude Code Free and Pro users and is available on the Anthropic API, Amazon Bedrock, and Google Vertex AI. The following FAQ covers the most common questions from developers evaluating the upgrade, comparing pricing tiers, and deciding how to integrate Sonnet 5 into existing workflows.</p>
<h3 id="what-is-claude-sonnet-5s-swe-bench-verified-score">What is Claude Sonnet 5&rsquo;s SWE-bench Verified score?</h3>
<p>Claude Sonnet 5 scores 82.1% on SWE-bench Verified, making it the first AI model to break the 80% ceiling on this benchmark. For comparison, Gemini 3.1 Pro scores 80.6%, GPT-5.4 approximately 80%, and Claude Sonnet 4.6 79.6%.</p>
<h3 id="how-much-does-claude-sonnet-5-cost-per-million-tokens">How much does Claude Sonnet 5 cost per million tokens?</h3>
<p>Claude Sonnet 5 is priced at $3.00 per million input tokens and $15.00 per million output tokens — the same as Claude Sonnet 4.6. Prompt caching reduces cached reads to $0.30/MTok (90% discount). The Batch API offers 50% off for async workloads at $1.50/$7.50 per million tokens.</p>
<h3 id="what-is-dev-team-mode-in-claude-sonnet-5">What is Dev Team mode in Claude Sonnet 5?</h3>
<p>Dev Team mode is a multi-agent architecture where Claude Sonnet 5 acts as a Team Leader that decomposes complex tasks into parallel sub-tasks and delegates them to specialized agents (Backend Specialist, QA Tester, Technical Writer, etc.). It&rsquo;s designed for large-scale feature implementation and multi-component changes.</p>
<h3 id="what-is-the-context-window-size-for-claude-sonnet-5">What is the context window size for Claude Sonnet 5?</h3>
<p>Claude Sonnet 5 has a 1 million token context window — five times larger than Claude Opus 4.5&rsquo;s 200K limit and more than seven times the GPT-5.4 context window. This enables loading entire mid-size repositories into a single context for repository-level reasoning.</p>
<h3 id="should-i-upgrade-from-claude-sonnet-46-to-claude-sonnet-5">Should I upgrade from Claude Sonnet 4.6 to Claude Sonnet 5?</h3>
<p>Yes, for most teams. The upgrade costs nothing extra (same pricing), provides a meaningful benchmark improvement (79.6% → 82.1% SWE-bench), and adds a 5x larger context window. The main exceptions: workflows that depend on visible reasoning (<code>&lt;thinking&gt;</code>) blocks, or simple high-volume tasks where Haiku 4.5 is more economical.</p>
]]></content:encoded></item></channel></rss>