<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Ai-Benchmarks on RockB</title><link>https://baeseokjae.github.io/tags/ai-benchmarks/</link><description>Recent content in Ai-Benchmarks on RockB</description><image><title>RockB</title><url>https://baeseokjae.github.io/images/og-default.png</url><link>https://baeseokjae.github.io/images/og-default.png</link></image><generator>Hugo</generator><language>en-us</language><lastBuildDate>Sat, 09 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://baeseokjae.github.io/tags/ai-benchmarks/index.xml" rel="self" type="application/rss+xml"/><item><title>Terminal-Bench 2.0 Explained: The New Standard for AI Agent Benchmarks (2026 Guide)</title><link>https://baeseokjae.github.io/posts/terminal-bench-2026-guide/</link><pubDate>Sat, 09 May 2026 00:00:00 +0000</pubDate><guid>https://baeseokjae.github.io/posts/terminal-bench-2026-guide/</guid><description>Terminal-Bench 2.0 evaluates AI agents on real terminal tasks — compiling code, administering systems, running security workflows. Here&amp;#39;s how the 2026 leaderboard breaks down and how to use the scores to choose your model.</description><content:encoded><![CDATA[<p>Terminal-Bench 2.0 is the benchmark the DevOps and MLOps communities have needed for years. Unlike SWE-bench, which focuses narrowly on Python bug fixes in open-source repos, Terminal-Bench drops AI agents into a live terminal environment and asks them to do what senior engineers actually spend their days doing: compile unfamiliar codebases, configure servers, train models, write and debug scripts, and complete multi-step system administration tasks. As of May 2026, 39 models have been evaluated and the average score sits at 56.4% — a gap that reveals just how hard real terminal work is for even the most capable AI agents.</p>
<h2 id="terminal-bench-20-the-ai-agent-benchmark-that-matters-for-devops-teams">Terminal-Bench 2.0: The AI Agent Benchmark That Matters for DevOps Teams</h2>
<p>Terminal-Bench 2.0 matters because it is the first widely-adopted benchmark that tests AI agents against the full stack of operational software engineering work, not just code synthesis. With 39 models evaluated and an average score of 56.4% as of May 7, 2026, the results are a candid picture of where agentic AI actually stands when the rubber meets the road in a real shell. The benchmark evaluates models across four task categories — tool calling, reasoning, agents, and code — all executed inside live terminal sessions with no scaffolding shortcuts. This design choice is deliberate: real DevOps work happens in terminals, not in sandboxed IDE plugins. A model that aces autocomplete but cannot navigate a broken Makefile, debug a failing systemd service, or orchestrate a multi-command data pipeline delivers limited value to an infrastructure team. Terminal-Bench was last updated on May 7, 2026, and has quickly become the benchmark that platform engineers and MLOps practitioners reach for when evaluating which AI coding model to integrate into their CI/CD pipelines, on-call runbooks, or infrastructure-as-code workflows.</p>
<h2 id="what-terminal-bench-20-actually-tests-real-terminal-tasks-not-theory">What Terminal-Bench 2.0 Actually Tests: Real Terminal Tasks, Not Theory</h2>
<p>Terminal-Bench 2.0 tasks are drawn from the actual work of software engineers, sysadmins, and data scientists — not from textbook problems or toy programming exercises. The benchmark covers six core task domains: compiling code from source with missing or misconfigured dependencies, training machine learning models end-to-end, setting up and validating server configurations, performing system administration operations (user management, cron jobs, log analysis, disk management), executing security tasks such as identifying vulnerabilities and configuring access controls, and completing data science workflows that span data ingestion, transformation, and output validation. Every task runs inside a real terminal environment, which means the model must issue shell commands, handle error output, iterate on failures, and verify its own work without a human in the loop. This multi-step, error-tolerant structure is what separates Terminal-Bench from code completion benchmarks: a model cannot succeed by generating a block of Python in isolation — it must plan a sequence of terminal operations, observe results, adapt, and close the loop. The benchmark&rsquo;s four evaluation categories — tool calling, reasoning, agents, and code — are not independent modules; most tasks require all four competencies working together, which is precisely why the average score remains below 60% even among well-resourced frontier models.</p>
<h2 id="the-2026-leaderboard-gpt-55-at-827-vs-claude-at-65-69">The 2026 Leaderboard: GPT-5.5 at 82.7% vs Claude at 65-69%</h2>
<p>GPT-5.5 leads the Terminal-Bench 2.0 leaderboard with a score between 82.0% and 82.7%, a gap of roughly 17 percentage points above the 39-model average of 56.4% — a performance margin that is hard to dismiss as noise. Below GPT-5.5, the leaderboard reveals a cluster of strong performers in the 65–77% range that reflect meaningfully different model architectures and training priorities. GPT-5.3 Codex occupies second place at 77.3%, demonstrating that OpenAI&rsquo;s Codex-specialized training pays dividends on terminal-heavy tasks. Gemini 3 Flash lands at 76.2%, which is a headline number in its own right given its cost position (more on that later). MiMo-V2.5-Pro scores 68.4%, Claude Opus 4.7 Adaptive reaches 69.4%, and Claude Opus 4.6 trails at 65.4% — a 13-point gap from GPT-5.5 that is consequential for teams where terminal automation is a core workflow. The leaderboard tells a clear story: GPT-5.5 and GPT-5.3 Codex are the current leaders for raw terminal-task performance, Gemini 3 Flash over-delivers for its price tier, and the Claude Opus family — while competitive on other benchmarks — shows a specific weakness in the kinds of multi-step shell operations Terminal-Bench measures.</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Terminal-Bench 2.0 Score</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>GPT-5.5</td>
          <td>82.0–82.7%</td>
      </tr>
      <tr>
          <td>GPT-5.3 Codex</td>
          <td>77.3%</td>
      </tr>
      <tr>
          <td>Gemini 3 Flash</td>
          <td>76.2%</td>
      </tr>
      <tr>
          <td>Claude Opus 4.7 Adaptive</td>
          <td>69.4%</td>
      </tr>
      <tr>
          <td>MiMo-V2.5-Pro</td>
          <td>68.4%</td>
      </tr>
      <tr>
          <td>Claude Opus 4.6</td>
          <td>65.4%</td>
      </tr>
      <tr>
          <td>39-model average</td>
          <td>56.4%</td>
      </tr>
  </tbody>
</table>
<h2 id="terminal-bench-vs-swe-bench-why-you-need-both-benchmarks">Terminal-Bench vs SWE-bench: Why You Need Both Benchmarks</h2>
<p>Terminal-Bench and SWE-bench measure adjacent but non-overlapping competencies, and the 2026 data makes that distinction concrete in a way that has real purchasing consequences. SWE-bench focuses on GitHub issue resolution inside Python repositories: a model reads a bug report, navigates a codebase, produces a patch, and passes the original test suite. Terminal-Bench focuses on multi-step terminal operations that span DevOps, scripting, and system administration — workflows that rarely reduce to a single diff in a single file. The clearest illustration of this divergence is Claude Opus 4.6: it leads SWE-bench Verified at 80.8% but scores only 65.4% on Terminal-Bench 2.0 — a 15-point gap that reflects genuinely different strengths. Claude Opus 4.6 is exceptional at understanding large Python codebases and producing correct patches; it is less consistent at orchestrating multi-command shell sequences, handling system-level errors, and completing operational workflows that require environment awareness beyond the contents of a single repository. No model as of May 2026 leads both benchmarks simultaneously. This is not a coincidence — the two benchmarks require different training emphases, different reasoning patterns, and different tool-use behaviors. For practitioners, the implication is that you need to consult both benchmarks, not just one, before committing to a model for your team.</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Terminal-Bench 2.0</th>
          <th>SWE-bench Verified</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Claude Opus 4.6</td>
          <td>65.4%</td>
          <td>80.8% (leads)</td>
      </tr>
      <tr>
          <td>GPT-5.5</td>
          <td>82.7% (leads)</td>
          <td>—</td>
      </tr>
      <tr>
          <td>GPT-5.3 Codex</td>
          <td>77.3%</td>
          <td>77.3%</td>
      </tr>
      <tr>
          <td>Claude Opus 4.7 Adaptive</td>
          <td>69.4%</td>
          <td>87.6%</td>
      </tr>
      <tr>
          <td>Gemini 3 Flash</td>
          <td>76.2%</td>
          <td>—</td>
      </tr>
  </tbody>
</table>
<h2 id="what-terminal-bench-performance-predicts-in-production">What Terminal-Bench Performance Predicts in Production</h2>
<p>Terminal-Bench 2.0 scores are the best available proxy for how an AI agent will perform on CI/CD pipeline automation, infrastructure scripting, and system administration workflows in production environments. Models that score above 70% on Terminal-Bench consistently demonstrate the ability to complete multi-step shell operations without getting stuck on error recovery — a capability that is directly relevant if you are using AI agents to run build pipelines, manage server configurations, or automate deployment validation. Models in the 55–65% range tend to succeed on well-structured single-domain tasks (compile this project, run this test suite) but struggle when tasks require cross-domain chaining — for example, setting up a Python environment, training a model, verifying the output metrics, and adjusting hyperparameters based on results, all without human intervention. Models below 50% are currently not reliable enough for unattended terminal automation on anything but the most narrowly scoped scripts. SWE-bench, by contrast, predicts large codebase refactoring, bug resolution in legacy Python services, and PR automation workflows. If your AI use cases span both operational scripting and codebase maintenance, you need a model that scores competitively on both benchmarks — and as of May 2026, GPT-5.3 Codex is the closest thing to a balanced performer, with strong results on both leaderboards.</p>
<h3 id="task-categories-and-their-production-analogs">Task Categories and Their Production Analogs</h3>
<table>
  <thead>
      <tr>
          <th>Terminal-Bench Category</th>
          <th>Production Analog</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Tool calling</td>
          <td>CI/CD script automation, Makefile debugging</td>
      </tr>
      <tr>
          <td>Reasoning</td>
          <td>Root-cause analysis, incident diagnosis</td>
      </tr>
      <tr>
          <td>Agents</td>
          <td>Multi-step deployment pipelines, runbook execution</td>
      </tr>
      <tr>
          <td>Code</td>
          <td>Ad-hoc scripting, data transformation</td>
      </tr>
  </tbody>
</table>
<h2 id="the-cost-performance-curve-gemini-3-flash-at-76-for-33x-less">The Cost-Performance Curve: Gemini 3 Flash at 76% for 33x Less</h2>
<p>Gemini 3 Flash&rsquo;s Terminal-Bench score of 76.2% at approximately 33 times lower cost than Claude Opus represents one of the most significant cost-performance inflection points in the 2026 AI landscape. To put the comparison in concrete terms: if your team is currently paying Claude Opus-tier pricing for terminal automation tasks, you are paying a 33x premium for a model that scores 9 percentage points lower than GPT-5.5 and only 7 points higher than Gemini 3 Flash. For the majority of DevOps use cases — running build validation, setting up test environments, scripting routine system administration — that 7-point performance gap between Flash and Opus is unlikely to matter in practice. The tasks where the gap starts to matter are the edge-case, multi-step workflows that require deep error recovery and cross-domain reasoning: exactly the scenarios you would want to test in your own environment before making a model selection. The cost-performance calculus changes if you are running high volumes of terminal tasks at scale. A team running 10,000 terminal automation tasks per month at Opus pricing could achieve comparable outcomes on most tasks at Flash pricing, redirecting the cost savings toward additional evaluation cycles, model fine-tuning, or simply a larger allowance for agent retries. The practical recommendation: use Gemini 3 Flash as your default for routine terminal tasks, gate on GPT-5.5 for the complex, high-stakes operations that justify the cost premium.</p>
<h2 id="how-to-use-terminal-bench-scores-to-choose-your-ai-coding-model">How to Use Terminal-Bench Scores to Choose Your AI Coding Model</h2>
<p>Choosing an AI coding model based on Terminal-Bench scores requires mapping the benchmark&rsquo;s task categories to your team&rsquo;s actual workflow distribution, not simply picking the highest score on the leaderboard. Start by categorizing your team&rsquo;s AI-assisted work into two buckets: terminal and operational tasks (CI/CD, system administration, scripting, data pipelines) versus code comprehension and codebase tasks (bug fixing, PR review, refactoring, documentation). If the majority of your AI usage falls into the first bucket, Terminal-Bench should be your primary selection benchmark. If the majority falls into the second bucket, weight SWE-bench Verified or Pro more heavily. If your workload is split roughly evenly, look for models that score above 70% on Terminal-Bench and above 75% on SWE-bench Verified — a short list in 2026 that currently includes GPT-5.3 Codex as the most balanced option. Cost matters as much as capability for high-volume use: Gemini 3 Flash at 76.2% is the most defensible choice for teams that want strong terminal performance without Opus-tier pricing. Qwen3-Coder-Next is worth tracking for cost-sensitive teams building on open-source infrastructure — it achieves 70.6% on SWE-bench at just 3 billion active parameters under the Apache 2.0 license, making it the cheapest frontier performer currently available. Run a private eval on 10–15 representative terminal tasks from your own backlog before finalizing any model selection; public leaderboard scores give you the shortlist, but your workload-specific performance tells you the final answer.</p>
<h3 id="decision-framework-terminal-vs-codebase-workloads">Decision Framework: Terminal vs Codebase Workloads</h3>
<table>
  <thead>
      <tr>
          <th>Primary Use Case</th>
          <th>Recommended Model</th>
          <th>Rationale</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Heavy terminal / DevOps automation</td>
          <td>GPT-5.5</td>
          <td>Leads Terminal-Bench at 82.7%</td>
      </tr>
      <tr>
          <td>Cost-optimized terminal automation</td>
          <td>Gemini 3 Flash</td>
          <td>76.2% at 33x lower cost than Opus</td>
      </tr>
      <tr>
          <td>Large codebase bug fixing (Python)</td>
          <td>Claude Opus 4.6</td>
          <td>Leads SWE-bench Verified at 80.8%</td>
      </tr>
      <tr>
          <td>Balanced terminal + codebase</td>
          <td>GPT-5.3 Codex</td>
          <td>Strong on both benchmarks (77.3% terminal)</td>
      </tr>
      <tr>
          <td>Open-source / cost-constrained</td>
          <td>Qwen3-Coder-Next</td>
          <td>70.6% SWE-bench, Apache 2.0, 3B active params</td>
      </tr>
  </tbody>
</table>
<h2 id="terminal-bench-20-results-table-all-39-models-ranked">Terminal-Bench 2.0 Results Table: All 39 Models Ranked</h2>
<p>The complete Terminal-Bench 2.0 leaderboard as of May 7, 2026 spans 39 evaluated models with scores ranging from GPT-5.5&rsquo;s 82.7% ceiling down to sub-40% for weaker open-source models. The average score of 56.4% is the most important summary statistic: it means the typical evaluated model fails on nearly half of Terminal-Bench&rsquo;s real terminal tasks, which is a strong argument for treating model selection for DevOps workflows with the same rigor you would apply to any other infrastructure decision. The top tier (scores above 70%) currently includes GPT-5.5, GPT-5.3 Codex, and Gemini 3 Flash — three models from two providers, which reflects how concentrated top-tier terminal-task performance is in 2026. The second tier (65–70%) includes Claude Opus 4.7 Adaptive, MiMo-V2.5-Pro, and Claude Opus 4.6, all of which are viable for terminal automation on structured tasks but show higher failure rates on complex, multi-step workflows. The third tier (50–65%) contains the majority of evaluated models, many of which are capable coding assistants that simply have not been optimized for terminal-environment reasoning. Models below 50% are experimental for unattended terminal use and should be treated as requiring human oversight on all but the simplest scripted operations. The leaderboard is updated as new models are submitted for evaluation; check the official Terminal-Bench repository for the most current rankings, as frontier model releases in 2026 have been frequent enough to shift the leaderboard meaningfully within weeks of a new model launch.</p>
<table>
  <thead>
      <tr>
          <th>Tier</th>
          <th>Score Range</th>
          <th>Notable Models</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Tier 1 (Top performers)</td>
          <td>70%+</td>
          <td>GPT-5.5 (82.7%), GPT-5.3 Codex (77.3%), Gemini 3 Flash (76.2%)</td>
      </tr>
      <tr>
          <td>Tier 2 (Competitive)</td>
          <td>65–70%</td>
          <td>Claude Opus 4.7 Adaptive (69.4%), MiMo-V2.5-Pro (68.4%), Claude Opus 4.6 (65.4%)</td>
      </tr>
      <tr>
          <td>Tier 3 (Developing)</td>
          <td>50–65%</td>
          <td>Majority of the 39 evaluated models</td>
      </tr>
      <tr>
          <td>Tier 4 (Experimental)</td>
          <td>Below 50%</td>
          <td>Smaller open-source models; human oversight required</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="frequently-asked-questions">Frequently Asked Questions</h2>
<p><strong>Q: What is Terminal-Bench 2.0 and how is it different from SWE-bench?</strong></p>
<p>Terminal-Bench 2.0 evaluates AI agents on multi-step terminal tasks in live shell environments — compiling code, training models, configuring servers, running security checks, and completing data science workflows. SWE-bench evaluates models on resolving GitHub issues in Python codebases by producing a patch that passes an existing test suite. The two benchmarks measure different competencies: Terminal-Bench predicts CI/CD automation and scripting performance; SWE-bench predicts large codebase refactoring and bug resolution. No single model leads both benchmarks as of May 2026.</p>
<p><strong>Q: Which model scores highest on Terminal-Bench 2.0?</strong></p>
<p>GPT-5.5 leads the Terminal-Bench 2.0 leaderboard as of May 7, 2026, with a score of 82.0–82.7%. GPT-5.3 Codex is second at 77.3%, and Gemini 3 Flash is third at 76.2%. Among Claude models, Claude Opus 4.7 Adaptive leads at 69.4%, followed by Claude Opus 4.6 at 65.4%. The 39-model average is 56.4%.</p>
<p><strong>Q: Is Gemini 3 Flash a good choice for terminal automation given its lower cost?</strong></p>
<p>For the majority of routine terminal automation tasks — build validation, environment setup, scripting, log analysis — Gemini 3 Flash at 76.2% is an excellent choice. It runs at approximately 33 times lower cost than Claude Opus while delivering 7 percentage points less performance than GPT-5.5. For high-stakes, complex multi-step workflows requiring deep error recovery, GPT-5.5 or GPT-5.3 Codex are safer choices. Run a private eval on your specific workload before committing.</p>
<p><strong>Q: Why does Claude Opus 4.6 score lower on Terminal-Bench despite leading SWE-bench Verified?</strong></p>
<p>Claude Opus 4.6 leads SWE-bench Verified at 80.8% because it has been optimized for understanding large Python codebases and producing correct, context-aware patches — a capability that requires deep code comprehension but not necessarily multi-step terminal environment reasoning. Terminal-Bench requires orchestrating shell commands, handling live error output, adapting plans based on system feedback, and completing operational workflows across multiple tool categories. These different competencies reflect different training emphases. The 15-point gap between Claude Opus 4.6&rsquo;s SWE-bench (80.8%) and Terminal-Bench (65.4%) scores is one of the clearest illustrations in the 2026 benchmark landscape of why no single score can capture overall model capability.</p>
<p><strong>Q: What is Qwen3-Coder-Next and why is it relevant to Terminal-Bench users?</strong></p>
<p>Qwen3-Coder-Next achieves 70.6% on SWE-bench Verified at just 3 billion active parameters under the Apache 2.0 open-source license, making it the cheapest frontier-competitive coder currently available. While its Terminal-Bench score has not been separately highlighted in the leaderboard summary data, its efficiency profile makes it highly relevant for cost-constrained teams or organizations that require an open-source, self-hostable model for compliance reasons. It represents the best available option for teams that need SWE-bench-competitive performance without paying proprietary API pricing — and it is worth evaluating on your own terminal task suite to establish whether its codebase strengths extend to your operational workflows.</p>
]]></content:encoded></item></channel></rss>