Anthropic

Claude Opus 4.6 Review 2026: The New SWE-Bench Leader for Coding

Claude Opus 4.6 scores 80.8% on SWE-bench Verified — the highest for any general-purpose AI model at launch — and delivers an 83% jump in ARC-AGI-2 reasoning (from 37.6% to 68.8%). Agent Teams demonstrated building a 100,000-line C compiler that boots Linux. For most developer teams the question isn’t “is it better” but “where is it better and does that justify the cost.” Benchmark Breakdown: SWE-Bench, ARC-AGI-2, and Terminal-Bench Claude Opus 4.6 is the current SWE-bench Verified leader at 80.8%, an incremental step up from Opus 4.5’s 80.9% — essentially a tie, but a tie at the top. The more dramatic story is ARC-AGI-2: Opus 4.6 scores 68.8% compared to 37.6% on Opus 4.5, an 83% relative improvement on the benchmark designed to measure fluid reasoning and novel problem-solving rather than memorized patterns. GPQA Diamond (graduate-level science questions) reached 91.3%, the highest score ever recorded on that test. These are not incremental gains — the reasoning architecture changed fundamentally. Where Opus 4.6 falls short is Terminal-Bench 2.0, scoring 65.4% against GPT-5.3 Codex’s 77.3%. Terminal-Bench measures agentic, multi-step shell and CLI tasks, and the gap here explains a lot about why GPT-5.3 Codex wins head-to-head in highly autonomous terminal workflows even as Opus 4.6 leads on SWE-bench, which tests code quality, correctness, and test-passing rates. Response latency also improved: 2.9 seconds per 1,000 tokens versus 3.2s on Opus 4.5, a 9.4% speedup that matters when running long agent chains. ...

LLM Function Calling and Tool Use Guide 2026: OpenAI, Anthropic, Google

Function calling is the bridge between a language model’s text output and the real world. Instead of asking a model to guess what the weather is, you hand it a get_weather tool definition, and it decides when to call it, what arguments to pass, and how to incorporate the result. As of 2026, every major provider—OpenAI, Anthropic, and Google—supports this pattern, but the APIs look meaningfully different. This guide walks through each one with working Python code and covers parallel calls, agent loops, security, and how to pick the right approach. ...

Claude API 300K Output Tokens: Complete Guide to Long-Form Generation (2026)

The Claude API now supports up to 300,000 output tokens per request — roughly 460 pages of text in a single API call — but only through the Message Batches API with a specific beta header. The synchronous API remains capped at 64K tokens. This guide explains exactly how to enable 300K output, which models support it, when to use it, and what it costs. What Are Claude API 300K Output Tokens? Claude API 300K output tokens refers to Anthropic’s maximum per-request generation limit, available on Claude Sonnet 4.6, Opus 4.6, and Opus 4.7 via the asynchronous Message Batches API. At approximately 650 words per 1,000 tokens, 300,000 tokens translates to roughly 195,000 words — the equivalent of a 460-page technical document or a full software codebase migration in a single API call. This capability is unlocked by passing the output-300k-2026-03-24 beta header with your batch request; without it, even Sonnet 4.6 caps at 64K tokens on synchronous calls. The 300K limit represents a 4.7× increase over the previous 64K ceiling and is the highest output token limit of any major LLM API in 2026 — GPT-4o Long Output tops out at 64K, and Gemini 1.5 Pro at 8K. For enterprises running document generation, codebase analysis, or legal drafting pipelines, this change fundamentally alters the economics of LLM-based automation. ...

Claude Opus 4.7 Tokenizer Cost Trap: Up to 35% More Tokens Explained

Claude Opus 4.7 launched on April 16, 2026 at the same $5/$25 per million token price as Opus 4.6 — but a redesigned tokenizer silently inflates English and code inputs by 1.20x–1.47x, meaning your real bill can jump 12–35% with zero sticker price change. What Changed: The Claude Opus 4.7 Tokenizer Update Explained Claude Opus 4.7’s tokenizer is a deliberate architectural redesign, not an incremental tweak. Anthropic replaced the byte-pair encoding vocabulary used in Opus 4.6 with a new multilingual-optimized tokenizer that assigns denser, more efficient representations to non-Latin scripts (Chinese, Japanese, Korean, Arabic) at the cost of slightly less efficient encoding for English text and structured code. In plain terms: the same English sentence or Python function now produces more tokens on Opus 4.7 than it did on Opus 4.6. Measurements from real production traffic show 1.20x–1.47x token inflation for English and code, while CJK content sees only 1.005x–1.07x change, and non-Latin multilingual content actually benefits with 20–35% fewer tokens. This means a $1,000 monthly invoice on Opus 4.6 can become $1,120–$1,350 on Opus 4.7 if you migrate without auditing your workload first. The model itself scores 87.6% on SWE-bench Verified (up from 80.8%), so the performance gain is real — but so is the tax. ...

Claude Opus 4.7 budget_tokens Removal: Migration from Extended Thinking

Claude Opus 4.7, released April 16, 2026, silently removed budget_tokens from its extended thinking API. Any code that passes budget_tokens to Opus 4.7 receives an immediate 400 Bad Request error. The fix is a four-step migration: switch to adaptive thinking type, replace budget_tokens with the effort parameter, update agentic loops to use task_budget, and strip temperature, top_p, and top_k. This guide walks through each step with exact before/after code. What Changed in Claude Opus 4.7: budget_tokens Is Gone Claude Opus 4.7 removed budget_tokens entirely from the extended thinking configuration, replacing it with an adaptive thinking system that automatically allocates reasoning compute based on task complexity. The change affects every application that previously used thinking: { type: "enabled", budget_tokens: N } to control how much the model “thinks” before responding. Released April 16, 2026, Opus 4.7 also removes temperature, top_p, and top_k parameters — three additional fields that silently accepted values in 4.6 but now return 400 errors in 4.7. Pricing remains unchanged at $5/M input tokens and $25/M output tokens, and the model shows a 13% coding benchmark lift over Opus 4.6 on Anthropic’s internal 93-task evaluation. For teams upgrading by changing only the model string, these breaking changes arrive without warning in production — there is no deprecation header or soft-failure mode in the API response before the hard 400 begins. ...

LLM Prompt Caching Guide 2026: Cut API Costs 70% with Anthropic and OpenAI

Prompt caching is the single highest-ROI optimization available for production LLM applications. If you run 10,000 requests per day with an 8K-token cached system prompt on Anthropic Claude, you save roughly $576/month — with a few lines of code change. OpenAI’s automatic caching requires zero code changes and gives you a 50% discount on repeated input tokens. Anthropic’s explicit caching offers up to 90% savings. This guide covers both, plus Gemini, with production code examples, real cost numbers, and the anti-patterns that silently destroy your cache hit rate. ...

Claude Code Best Practices 2026: 15 Habits of Developers Who Ship Faster

The difference between a developer who saves 10 minutes a day with Claude Code and one who saves 3–4 hours comes down to configuration and habit. Claude Code, launched as v1.0 by Anthropic in November 2025, is not a chat interface — it’s a programmable agent runtime that operates directly inside your terminal, reads and edits your codebase autonomously, and can be extended with persistent memory, custom skills, and external tool integrations. Developer surveys in 2026 report an average 40% reduction in coding task time for teams using it properly. The 15 habits below are what separates the 40% cohort from everyone else. ...