SWE-bench measures how well an LLM can resolve real-world GitHub issues end-to-end — not toy problems. As of May 2026, scores range from 93.9% (Claude Mythos Preview on Verified) to 23% on the harder, contamination-resistant Pro variant. Here’s how to read those numbers without being misled.
What Is SWE-bench and Why Developers Should Care
SWE-bench is an open-source benchmark developed by Princeton NLP that evaluates LLMs on real software engineering tasks drawn from merged pull requests across popular open-source repositories. Unlike HumanEval — which tests whether a model can write a function to pass unit tests — SWE-bench requires a model to read a full repository, understand the failing test, locate the root cause across multiple files, and produce a patch that actually makes tests pass. As of May 2026, 89 models have been evaluated on SWE-bench Verified, with an average pass rate of 63.4% and a top score of 93.9% achieved by Claude Mythos Preview. The benchmark was released by Princeton in 2023 and has become the de facto standard for evaluating AI coding agents. If you are evaluating an AI coding assistant, SWE-bench Verified is the first leaderboard you should consult — but as this guide explains, it is not the last word on real-world performance.
How SWE-bench Tasks Work
Each task starts with a GitHub issue description and a snapshot of the repository at the time the issue was filed. The LLM must produce a patch without access to the solution. An automated harness applies the patch to the repo and runs the test suite. If previously failing tests now pass — and no previously passing tests break — the task is marked resolved. This end-to-end setup makes SWE-bench far more demanding than code completion benchmarks, which is why top scores on HumanEval (often 90%+) rarely translate to equivalent SWE-bench performance.
The Three SWE-bench Variants: Original, Verified, and Pro
SWE-bench has three distinct variants that serve different evaluation purposes, and confusing them is the most common mistake developers make when reading leaderboard data. The original SWE-bench contains 2,294 tasks scraped from 12 Python repositories. SWE-bench Verified is a curated subset of 500 tasks that human annotators validated as solvable and unambiguous — this is the version most leaderboards report and the one you will see cited in model launch announcements. SWE-bench Pro is the newest and hardest variant, released in early 2026 by Scale AI, containing 1,865 tasks across 41 repositories written in Python, Go, TypeScript, and JavaScript. The average fix in Pro requires changes across 107 lines spanning 4.1 files. These three variants are not interchangeable: a model’s score on Verified tells you almost nothing about how it will perform on Pro, and the gap between them is often 40+ percentage points for the same model.
| Variant | Tasks | Languages | Purpose |
|---|---|---|---|
| Original | 2,294 | Python only | Research baseline |
| Verified | 500 | Python only | Curated, human-validated |
| Pro | 1,865 | Python, Go, TS, JS | Contamination-resistant, multi-language |
Which Variant Should You Use?
For general tool selection, start with SWE-bench Verified because it has the most evaluated models (89+) and the longest track record. If your team works primarily in Go or TypeScript, weight SWE-bench Pro scores more heavily since Verified only tests Python repos. If you are making a high-stakes infrastructure decision, treat Pro scores as the ground truth and treat Verified scores as a ceiling — the model is unlikely to do better in production than its Pro score suggests.
The Contamination Problem: Why High Verified Scores Can Be Misleading
Data contamination is the single most important concept for interpreting SWE-bench Verified scores, and understanding it will save you from making an expensive mistake when choosing an AI coding tool. Contamination occurs when a model’s training data includes the solutions to the test cases it is being evaluated on — in effect, the model has “seen the answers.” Since SWE-bench Verified draws from public GitHub repositories, any model trained on large web crawls after those PRs were merged is likely to have encountered the solutions. OpenAI stopped reporting SWE-bench Verified scores in 2025 after internal analysis revealed every frontier model showed measurable training data contamination. The most striking evidence for contamination is the gap between Verified and Pro scores: Claude Mythos Preview scores 93.9% on Verified but only 45.9% on Pro — a 48-percentage-point difference on the same type of task, explainable primarily by the fact that Pro uses more recent, private, or previously unpublished code that is far less likely to have appeared in training data.
What the 48-Point Gap Tells You
A 48-point drop for the same model on the same type of task is not an engineering limitation — it is a measurement artifact. This does not mean Claude Mythos Preview is a bad coding model; it means that the Verified leaderboard overstates true capability for all models that trained on pre-2025 web data. The practical implication: when a vendor cites a Verified score above 80%, your first question should be the corresponding Pro score. If the gap is larger than 25 points, apply a skepticism discount to the Verified number. If the vendor does not publish a Pro score at all, assume contamination is a factor.
Reading the 2026 Leaderboard: Top Models and What Their Scores Actually Mean
The SWE-bench Verified leaderboard as of May 2026 covers 89 evaluated AI models with an average score of 63.4%. Claude Mythos Preview leads at 93.9%, followed by Claude Opus 4.7 (Adaptive) at 87.6%. These headline numbers represent best-case performance: the models are typically given multiple retries, access to scaffolding agents that handle tool use and context management, and evaluated on tasks skewed toward Python repositories where training data exposure is highest. For developers, the more useful signal is the spread between a model’s Verified and Pro scores, and whether the model can handle the specific language and repository size that matches your codebase. A model ranked fifth on Verified but performing consistently on Pro in Go or TypeScript may be a better choice for a polyglot backend team than the headline leader.
| Model | SWE-bench Verified | SWE-bench Pro | Gap |
|---|---|---|---|
| Claude Mythos Preview | 93.9% | 45.9% | 48 pts |
| Claude Opus 4.7 (Adaptive) | 87.6% | ~42% (est.) | ~45 pts |
| Average (83 models) | 63.4% | ~23% | ~40 pts |
How to Interpret Scores for Real Codebases
Benchmark scores assume a clean reproduction environment, a well-defined failing test, and no proprietary dependencies. Real codebases rarely satisfy all three. A model that resolves 87% of curated open-source GitHub issues may still struggle on your internal monorepo if your build system is unusual, your test suite is slow, or your domain logic requires understanding business context that is not in the repository. Use SWE-bench scores to narrow the field to two or three models, then run each on ten representative tickets from your actual backlog. The benchmark tells you the ceiling; your private eval tells you the floor.
SWE-bench Limitations: What It Doesn’t Test
SWE-bench is a rigorous benchmark for a specific slice of software engineering work — resolving standalone, testable bugs in open-source Python (or Go/TypeScript for Pro) repositories. That specificity is a feature for research but a limitation for practitioners choosing tools. SWE-bench does not test security vulnerability identification or remediation; it does not evaluate code review quality, documentation generation, architecture suggestions, or pairing workflows. It does not assess performance on private codebases with custom build pipelines, non-standard testing frameworks, or confidential business logic. It does not measure latency, cost per task, or how well a model handles an ambiguous requirements document. A model could score 90% on SWE-bench Verified and still be a poor fit for a data engineering team that primarily needs help writing and optimizing SQL pipelines or debugging Spark jobs, because those task types do not appear in the benchmark at all.
Benchmarks That Complement SWE-bench
For a more complete evaluation picture, pair SWE-bench data with LiveCodeBench (competitive programming, updated monthly), HumanEval+ (function synthesis with adversarial edge cases), and CRUXEval (code reasoning and execution prediction). For security-focused teams, add CyberSecEval results. None of these benchmarks replicate production conditions, but together they cover more of the capability surface than SWE-bench alone.
How to Use SWE-bench Scores to Pick an LLM for Your Use Case
Using SWE-bench scores to pick an LLM requires matching the benchmark’s evaluation conditions to your actual workflow rather than chasing the highest headline number. Start by identifying your dominant task type: if you primarily need an agent that opens a PR to fix a reported bug end-to-end, SWE-bench is directly relevant. If you mostly need inline autocomplete, refactoring suggestions, or code explanation, SWE-bench is a weak proxy at best. Once you have confirmed SWE-bench relevance, use the following three-step framework: first, filter to models with published Pro scores above 35%, which eliminates models whose Verified performance is likely inflated by contamination; second, filter to models tested on your primary language (Pro for Go/TS, Verified for Python); third, run a private eval on 10–20 tasks from your own backlog before committing to a subscription or API plan. This process takes two to three hours and will produce more reliable signal than any public leaderboard.
Decision Framework by Team Type
| Team Type | Primary Benchmark | Secondary Signal |
|---|---|---|
| Python / Django / Flask backend | SWE-bench Verified + Pro | Private eval on your repos |
| Go / TypeScript / Node.js backend | SWE-bench Pro only | LiveCodeBench (multi-language) |
| Security / pentest | CyberSecEval | SWE-bench Pro (lower weight) |
| Data engineering / SQL / Spark | HumanEval+ | Internal task eval only |
| Frontend / React | LiveCodeBench | SWE-bench Pro (TS tasks) |
Beyond SWE-bench: Private Benchmarking and Real-World Validation
Building a private benchmark is the most reliable way to select and maintain an LLM coding tool when public benchmarks are vulnerable to contamination or simply do not cover your stack. A private benchmark collects 20–50 resolved issues from your own backlog — bugs or features that were already fixed by your team, with the solution hidden from the model. You run each candidate model on these tasks under the same conditions (same scaffolding, same context budget, same pass/fail criteria), score the results, and update the benchmark quarterly as your codebase evolves. This approach is immune to contamination because your code was never in any model’s training data, and it directly measures performance on the exact task types, languages, and complexity levels your team encounters. Several teams at scale — including engineering organizations at Stripe and Shopify — have reported that their private evals produce rankings that diverge significantly from public SWE-bench leaderboards, with the private eval being more predictive of developer satisfaction after 30 days of use.
How to Build Your Private Eval in One Day
Select 20 closed GitHub issues from the past 6 months where the fix was isolated — meaning it touched fewer than 5 files and had a clear before/after test. Strip the solution from each branch, write a harness that applies model-generated patches and runs the test suite, and baseline each issue with your current tool. Run two or three candidate models and record pass rate, mean time per task, and cost. Score models on a weighted basis: pass rate (60%), cost per resolved task (25%), latency (15%). Rerun quarterly. The total effort is approximately four hours of setup and two hours per quarterly refresh.
FAQ
The following questions cover the most common points of confusion when interpreting SWE-bench data for real purchasing and tooling decisions. SWE-bench is a legitimate and rigorous benchmark, but its scores require context to be actionable — especially given the contamination gap between Verified and Pro variants. The short answers below are designed to be standalone: each can be read without the rest of this guide and still give you a correct mental model. If you are evaluating AI coding tools for a team of more than five engineers, work through all five questions before shortlisting models, because each addresses a different failure mode that leads developers to make expensive, reversible mistakes when picking an LLM based on leaderboard data alone. The contamination problem, the product-versus-model distinction, and the update cadence question in particular trip up experienced engineers who are otherwise well-equipped to read technical benchmarks.
What is a good SWE-bench Verified score in 2026?
As of May 2026, the average across 83 evaluated models is 63.4%. Scores above 80% are achieved by top frontier models. For practical tool selection, treat anything below 50% as a disqualifier for agent-based coding workflows and treat anything above 85% with skepticism unless the model also publishes a strong Pro score, because high Verified scores are increasingly associated with training data contamination.
Why did OpenAI stop reporting SWE-bench scores?
OpenAI stopped reporting SWE-bench Verified scores after internal analysis found that every frontier model showed measurable training data contamination on the Verified task set. Because Verified draws from public GitHub repositories merged before the model’s training cutoff, the solutions are likely present in the training data — making scores an unreliable measure of true generalization capability.
What is SWE-bench Pro and how is it different from Verified?
SWE-bench Pro is a harder, contamination-resistant variant released by Scale AI in 2026. It contains 1,865 tasks across 41 repositories in Python, Go, TypeScript, and JavaScript, with an average fix requiring 107 lines of changes across 4.1 files. Unlike Verified, Pro uses more recent and less publicly exposed code, making it significantly harder for models to benefit from memorized training data. Top models score 45–50% on Pro versus 85–93% on Verified.
Can I use SWE-bench scores to compare Cursor, Copilot, and other coding assistants?
SWE-bench benchmarks the underlying LLM, not the entire product. Two tools can use the same model and produce very different results depending on their scaffolding, context retrieval, tool use design, and retry logic. Use SWE-bench scores to narrow your model shortlist, then evaluate the full product through a free trial on your own codebase. Product-level differences in UX, IDE integration, and cost often matter more than a 5-point difference on the leaderboard.
How often is the SWE-bench leaderboard updated?
The SWE-bench Verified leaderboard is updated continuously as vendors submit results — there is no fixed release schedule. SWE-bench Pro, managed by Scale AI, updates on a less frequent cadence as new submissions are validated. For the most current standings, check the official SWE-bench GitHub repository, the Scale AI leaderboard for Pro, and aggregator sites like BenchLM which track 44+ models and update monthly.
