Llm-Review

Claude Opus 4.6 scores 80.8% on SWE-bench Verified — the highest for any general-purpose AI model at launch — and delivers an 83% jump in ARC-AGI-2 reasoning (from 37.6% to 68.8%). Agent Teams demonstrated building a 100,000-line C compiler that boots Linux. For most developer teams the question isn’t “is it better” but “where is it better and does that justify the cost.” Benchmark Breakdown: SWE-Bench, ARC-AGI-2, and Terminal-Bench Claude Opus 4.6 is the current SWE-bench Verified leader at 80.8%, an incremental step up from Opus 4.5’s 80.9% — essentially a tie, but a tie at the top. The more dramatic story is ARC-AGI-2: Opus 4.6 scores 68.8% compared to 37.6% on Opus 4.5, an 83% relative improvement on the benchmark designed to measure fluid reasoning and novel problem-solving rather than memorized patterns. GPQA Diamond (graduate-level science questions) reached 91.3%, the highest score ever recorded on that test. These are not incremental gains — the reasoning architecture changed fundamentally. Where Opus 4.6 falls short is Terminal-Bench 2.0, scoring 65.4% against GPT-5.3 Codex’s 77.3%. Terminal-Bench measures agentic, multi-step shell and CLI tasks, and the gap here explains a lot about why GPT-5.3 Codex wins head-to-head in highly autonomous terminal workflows even as Opus 4.6 leads on SWE-bench, which tests code quality, correctness, and test-passing rates. Response latency also improved: 2.9 seconds per 1,000 tokens versus 3.2s on Opus 4.5, a 9.4% speedup that matters when running long agent chains. ...