SWE-Bench

Kotlin Benchmark for AI Coding Agents: How Well Do AI Agents Write Kotlin Compared to Python and TypeScript?

Introduction — The Rise of Kotlin-Specific AI Coding Benchmarks For years, AI coding benchmarks have been dominated by Python and TypeScript. SWE-bench, HumanEval, and MBPP all lean heavily on these languages, leaving Kotlin developers wondering how well AI agents actually handle JVM-based code, Android development, and Kotlin-specific idioms. In July 2026, JetBrains changed that by releasing the official Kotlin Benchmark for AI Coding Agents — a rigorous, open-source evaluation framework built on Multi-SWE-bench infrastructure that measures how well AI coding agents resolve real-world Kotlin software engineering tasks. ...

Exploiting AI Agent Benchmarks: The 2026 Crisis of Trust in Agent Evaluation

Introduction — The Year AI Agent Benchmarks Broke If you have been following AI agent benchmarks in 2026, you have likely seen headline numbers that look too good to be true. A tool called BenchJack scored 100% on SWE-bench Verified, SWE-bench Pro, and Terminal-Bench without solving a single task — most runs never even called a language model. OpenAI retired SWE-bench Verified after discovering that 59.4% of hard failed tasks had broken test cases rejecting functionally correct patches. And the same frontier model scored 64.7% on one wrapper and 57.5% on another — a 7.2-point gap from scaffolding alone. This is the state of exploiting AI agent benchmarks in 2026: a system where the incentives to publish high scores have outpaced the rigor of the evaluations themselves. ...

Claude Sonnet 5 vs GPT-5.4 for Coding: SWE-bench Benchmark Comparison 2026

Claude Sonnet 5 scores 82.1% on SWE-bench Verified and 46%+ on SWE-bench Pro, while GPT-5.4 scores 57.7% on SWE-bench Pro with comparable Verified scores around 85%. For most coding workflows, Sonnet 5 delivers a stronger autonomous code-editing experience, but GPT-5.4’s reasoning levels give it an edge in cost-flexibility for high-stakes reasoning tasks. What Is the SWE-bench Benchmark and Why Does It Matter for Coding? SWE-bench is the most respected real-world coding benchmark in 2026, built from actual GitHub issues submitted to production Python repositories including Django, Flask, and Scikit-learn. Unlike HumanEval — which tests isolated function writing and is now saturated at 95%+ for frontier models — SWE-bench requires a model to read a bug report, navigate a real codebase, write a patch, and pass the repository’s own test suite. This means the benchmark tests the full software engineering loop, not just code generation from a clean prompt. SWE-bench Verified contains 500 human-validated tasks, while SWE-bench Pro uses harder tasks from private and less-contaminated repositories. As of May 2026, Claude Sonnet 5 holds an 82.1% SWE-bench Verified score (the first model to break the 80% barrier) and GPT-5.4 leads SWE-bench Pro at 57.7%, reflecting fundamentally different strengths: Sonnet 5 excels at agentic, autonomous patch generation, while GPT-5.4 integrates broader reasoning and computer-use capabilities in a single model. ...

GLM-5.1 Review 2026: #1 SWE-bench Pro, MIT License, $1/M Tokens

GLM-5.1 is the first open-weight model to claim the #1 position on SWE-Bench Pro, scoring 58.4 — ahead of GPT-5.4 (57.7) and Claude Opus 4.6 (57.3). Released April 7, 2026 by Z.AI under an MIT license, it costs $1.40/M input tokens versus Claude Opus 4.7’s $5.00/M, making it the most cost-effective frontier-class coding model available today. What Is GLM-5.1? The Open-Source Frontier Model from Z.AI GLM-5.1 is a 754B-parameter Mixture-of-Experts language model developed by Z.AI (formerly Zhipu AI) and released on April 7, 2026, under the MIT license. It activates only 40B parameters per forward pass via its sparse MoE routing, which delivers frontier-tier reasoning at significantly lower inference cost than dense models of comparable quality. The architecture combines DeepSeek Sparse Attention (DSA) for efficient long-context processing, a 203K-token context window, and asynchronous reinforcement learning via Z.AI’s proprietary “slime” training framework. In independent benchmarking by BenchLM, GLM-5.1 ranks 14th out of 115 models with an overall composite score of 83/100. What sets it apart is the combination of open weights, commercial-use permissive licensing, and a demonstrated capability peak at software engineering tasks that no prior open-weight model has matched. Teams can access it via the Z.AI API, self-host via Hugging Face and Ollama, or integrate it as a drop-in replacement for the OpenAI SDK through vLLM’s OpenAI-compatible endpoint. ...

LM Council Benchmarks: The Independent LLM Leaderboard Developers Should Trust

Claude Opus 4.6 resolves 80.8% of real GitHub issues on SWE-bench Verified while GPT-5.5 leads Terminal-Bench 2.0 at 82.7% — numbers that mean something precisely because they come from independent evaluation pipelines, not vendor press releases. Choosing an LLM in 2026 without understanding how these benchmarks work is like buying a server based solely on manufacturer marketing sheets. This guide covers the LM Council evaluation framework, the top independent leaderboards developers actually rely on, and how to read benchmark results without getting misled. ...

SWE-bench Explained: How to Use Coding Benchmarks to Pick an LLM (2026 Guide)

SWE-bench measures how well an LLM can resolve real-world GitHub issues end-to-end — not toy problems. As of May 2026, scores range from 93.9% (Claude Mythos Preview on Verified) to 23% on the harder, contamination-resistant Pro variant. Here’s how to read those numbers without being misled. What Is SWE-bench and Why Developers Should Care SWE-bench is an open-source benchmark developed by Princeton NLP that evaluates LLMs on real software engineering tasks drawn from merged pull requests across popular open-source repositories. Unlike HumanEval — which tests whether a model can write a function to pass unit tests — SWE-bench requires a model to read a full repository, understand the failing test, locate the root cause across multiple files, and produce a patch that actually makes tests pass. As of May 2026, 89 models have been evaluated on SWE-bench Verified, with an average pass rate of 63.4% and a top score of 93.9% achieved by Claude Mythos Preview. The benchmark was released by Princeton in 2023 and has become the de facto standard for evaluating AI coding agents. If you are evaluating an AI coding assistant, SWE-bench Verified is the first leaderboard you should consult — but as this guide explains, it is not the last word on real-world performance. ...

GLM-4.7 Coding Guide 2026: The Open-Source LLM Beating Claude Sonnet

GLM-4.7 from Zhipu AI scores 73.8% on SWE-bench and 84.9% on LiveCodeBench V6 — numbers that match or beat Claude Sonnet 4.5 on coding benchmarks. It’s fully open-source (Apache 2.0), runs locally, and costs $0 per token. If you’re paying $20+/month for a commercial coding assistant and your use case is standard development tasks, GLM-4.7 deserves a serious look. What Is GLM-4.7 and Why Are Developers Switching? GLM-4.7 is Zhipu AI’s flagship open-source large language model, optimized for multi-turn reasoning and software development tasks. Launched in early 2026, it sits at the top of the open-source coding benchmark leaderboard: 73.8% on SWE-bench and 84.9% on LiveCodeBench V6, putting it within 2-3 percentage points of Claude Sonnet 4.5. What makes GLM-4.7 different from previous open-source coding models isn’t just benchmark scores — it’s the “Preserved Thinking” architecture that maintains reasoning quality across extended, multi-turn coding sessions. Most open-source models degrade noticeably after 5-6 back-and-forth exchanges as context fills up. GLM-4.7 scores 8.5/10 for complex reasoning consistency across 10+ turns, a gap that shows up directly when you’re doing iterative refactoring or debugging complex systems. Zhipu AI also made a hardware bet: GLM series models are trained entirely on Huawei Ascend chips, not NVIDIA, which matters for organizations concerned about supply chain dependencies. The combination of competitive benchmarks, zero licensing costs, and hardware independence is driving 40% year-over-year growth in open-source coding model adoption according to GitHub’s 2026 developer survey. ...

LLM Benchmarks Guide for Developers 2026: SWE-bench, GPQA, LiveCodeBench Explained

LLM benchmark scores flood every model release announcement — but as of 2026, most of those scores tell you almost nothing useful. This guide explains which benchmarks still matter for developers, which are saturated or compromised, and how to pick the right signal for your actual workload. Why LLM Benchmarks Matter for Developers (And Why Most Are Now Useless) LLM benchmarks are standardized test suites that measure model capabilities across defined tasks — coding, reasoning, math, or domain knowledge — so developers can compare models without running every candidate through their own production workload. Done right, they save weeks of internal evaluation. Done wrong, they create a false confidence loop where a model scores 92% on a benchmark and then fails on the first real customer ticket you throw at it. As of May 2026, the benchmark landscape has split sharply: a small set of hard, contamination-resistant evaluations still provide genuine signal, while the legacy suites — MMLU, HumanEval, GSM8K — have been effectively retired by the community because frontier models have saturated them. MMLU, once the canonical academic reasoning suite, now sees frontier models cluster at 85–90% with no meaningful spread between Claude, GPT, and Gemini variants. HumanEval similarly sees 93%+ scores across top-tier models as of April 2026. When every serious model aces the same test, the test stops being useful. The benchmarks worth tracking now are the ones that are still hard enough to differentiate — and that requires understanding why they’re hard. ...

Devstral 2 Review 2026: Mistral's Open-Source Coding Agent Hits 72.2% SWE-bench

Devstral 2 is Mistral AI’s most capable open-weight coding model, achieving 72.2% on SWE-bench Verified — the highest score ever recorded by an open-source model at its parameter count. Released in late 2025 alongside the Mistral Vibe CLI, it costs $0.40 per million input tokens, making it up to 7x cheaper than Claude Sonnet for typical coding workloads. What Is Devstral 2? Overview of Mistral’s Latest Open-Source Coding Agent Devstral 2 is a 123-billion parameter open-weight large language model purpose-built for agentic software engineering tasks — it can autonomously navigate codebases, edit multiple files, run tools, and resolve GitHub issues end-to-end. Released by Mistral AI in December 2025, it achieves 72.2% on SWE-bench Verified (the industry-standard benchmark for autonomous bug-fixing), placing it at the frontier of all open-weight models and ahead of significantly larger competitors including DeepSeek V3.2 (672B) and Kimi K2 (1T). Unlike most frontier coding models, Devstral 2 is released under the Apache 2.0 license, meaning developers can download, self-host, fine-tune, and deploy it commercially without restriction. In human evaluations against DeepSeek V3.2, Devstral 2 wins 42.8% of coding tasks versus a 28.6% loss rate — a meaningful real-world advantage that SWE-bench alone doesn’t fully capture. The model supports a 256K-token context window, enabling comprehension of entire repositories in a single pass. For teams that need frontier-grade coding intelligence without proprietary lock-in, Devstral 2 is the clearest option available in 2026. ...