<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Model Selection on RockB</title><link>https://baeseokjae.github.io/tags/model-selection/</link><description>Recent content in Model Selection on RockB</description><image><title>RockB</title><url>https://baeseokjae.github.io/images/og-default.png</url><link>https://baeseokjae.github.io/images/og-default.png</link></image><generator>Hugo</generator><language>en-us</language><lastBuildDate>Mon, 18 May 2026 21:08:00 +0000</lastBuildDate><atom:link href="https://baeseokjae.github.io/tags/model-selection/index.xml" rel="self" type="application/rss+xml"/><item><title>GitHub Model Selection Guide: Choosing Claude vs Codex for GitHub Coding Agents</title><link>https://baeseokjae.github.io/posts/github-model-selection-coding-agents-2026/</link><pubDate>Mon, 18 May 2026 21:08:00 +0000</pubDate><guid>https://baeseokjae.github.io/posts/github-model-selection-coding-agents-2026/</guid><description>A practical guide to choosing between Claude and Codex models for GitHub coding agents — with benchmarks, cost breakdown, and task-based decision matrix.</description><content:encoded><![CDATA[<p>GitHub now lets you pick your AI model when kicking off a coding agent task. Claude Sonnet 4.6, Claude Opus 4.6, GPT-5.2-Codex, and GPT-5.4 are all available — and which one you choose has a direct impact on code quality, task completion rate, and your monthly bill. This guide cuts through the noise with benchmarks, cost data, and a concrete decision framework so you can stop guessing and start shipping.</p>
<h2 id="what-are-github-coding-agents-claude-codex-and-copilot-explained">What Are GitHub Coding Agents? (Claude, Codex, and Copilot Explained)</h2>
<p>GitHub coding agents are autonomous AI systems that accept a task description, plan a multi-step workflow, read and write files in your repository, run tests, and open pull requests — without requiring you to supervise every step. As of April 2026, GitHub supports three agent providers: Claude (from Anthropic), Codex (from OpenAI), and GitHub Copilot&rsquo;s built-in agent powered by the Auto model. Each agent runs inside a sandboxed environment with access to your repo, CI logs, and optionally external tools. The key difference from Copilot&rsquo;s inline autocomplete is that agents handle full tasks asynchronously: you describe what you want done (fix this bug, refactor this module, add these tests), the agent works autonomously, and you review the resulting PR. Claude agents are available on GitHub.com as of April 14, 2026, alongside the Codex agents that launched earlier in 2026. All three agents require at minimum a Copilot Business, Pro, Pro+, or Enterprise subscription to access.</p>
<h3 id="how-github-agents-differ-from-copilot-autocomplete">How GitHub Agents Differ from Copilot Autocomplete</h3>
<p>GitHub agents operate at the task level while Copilot autocomplete operates at the line level. Autocomplete suggests the next few lines of code as you type; agents read your entire repository context, write across multiple files, run your test suite, fix failing tests, and produce a reviewable PR. The mental model shift: autocomplete is a co-pilot, agents are a junior developer you can hand a ticket. Claude agents are particularly strong at understanding large codebases holistically before making changes — they analyze dependencies and architecture before touching a single line. Codex agents are optimized for speed, running parallel subagents (up to 8 simultaneously) and completing tasks asynchronously in OpenAI&rsquo;s cloud sandbox. Both require you to define the task clearly; vague prompts produce vague PRs from either system.</p>
<h2 id="github-agent-hq-the-multi-model-command-center">GitHub Agent HQ: The Multi-Model Command Center</h2>
<p>GitHub Agent HQ launched February 4, 2026, as a multi-agent command center where you can assign the same task — or complementary tasks — to Claude, Codex, and GitHub Copilot simultaneously. Agent HQ is accessible from github.com, GitHub Mobile, and VS Code. The key value proposition is parallel coverage: run Claude for a complex refactoring task while Codex handles async test generation for the same feature, then compare the resulting PRs side by side. Early Agent HQ users report error rates of 5–10% per run, meaning human oversight remains essential — but the throughput gains justify the coordination overhead for teams with high PR volumes. Multi-agent workflows typically cost 2–4x single-agent setups in premium request consumption, but teams report proportional productivity gains on large feature work. Agent HQ treats model selection as a first-class decision: you choose the model at task kick-off, not as an afterthought.</p>
<h3 id="navigating-the-agent-hq-interface">Navigating the Agent HQ Interface</h3>
<p>When you open Agent HQ on github.com, you see a task creation panel with a model picker at the top. The interface shows available agents (Claude, Codex, Copilot), the model variants within each agent, and a premium request cost indicator before you commit. You can assign multiple agents to the same task only if they won&rsquo;t modify overlapping files — Agent HQ warns you if assigned tasks risk merge conflicts. The task queue shows in-progress agents with live status updates, and completed tasks link directly to their PRs for review. VS Code integration surfaces Agent HQ results inline so you can accept, edit, or reject changes without leaving your editor.</p>
<h2 id="claude-agent-on-github--available-models-and-strengths">Claude Agent on GitHub — Available Models and Strengths</h2>
<p>Claude agents on GitHub are available in four model tiers: Claude Sonnet 4.6, Claude Opus 4.6, Claude Sonnet 4.5, and Claude Opus 4.5. Claude Opus 4.6 scores 80.8% on SWE-bench Verified, while Claude Sonnet 4.6 scores 79.6% — a narrow gap that makes Sonnet the default choice for most tasks given its 3x lower cost in premium requests. Claude&rsquo;s primary architectural strength is deep context understanding: it can load a large codebase, trace dependency graphs, and make coherent changes across dozens of files without losing track of constraints introduced 10,000 tokens earlier in the context. Anthropic&rsquo;s models consistently lead blind code quality evaluations — Claude Code (which powers the GitHub agent) achieved a 67% win rate against Codex in early 2026 head-to-head tests. The tradeoff is token efficiency: Claude uses approximately 4x more tokens than Codex on identical tasks, which matters if you&rsquo;re on a plan with premium request limits.</p>
<h3 id="claude-sonnet-46-vs-claude-opus-46-when-to-upgrade">Claude Sonnet 4.6 vs Claude Opus 4.6: When to Upgrade</h3>
<p>Claude Sonnet 4.6 costs 1 premium request per agent session; Claude Opus 4.6 costs 3 premium requests (the 3x multiplier). Given the narrow SWE-bench gap (79.6% vs 80.8%), Sonnet is the default for most tasks. Upgrade to Opus when: (1) the task requires architectural understanding across a large legacy codebase, (2) you need the agent to catch subtle logic errors in complex business rules, or (3) the task has high stakes and a failed PR costs more than the premium request difference. For routine tasks — adding tests, writing docstrings, fixing lint errors, implementing straightforward features — Sonnet delivers near-identical output at one-third the cost.</p>
<h2 id="codex-agent-on-github--available-models-and-strengths">Codex Agent on GitHub — Available Models and Strengths</h2>
<p>The Codex agent on GitHub runs on three model variants: GPT-5.2-Codex, GPT-5.3-Codex, and GPT-5.4. Codex was designed cloud-native from the ground up — it runs asynchronously in OpenAI&rsquo;s sandboxed environment, meaning your task keeps running even after you close the browser. Codex shipped subagents to general availability on March 14, 2026, supporting up to 8 parallel agents working simultaneously on decomposed subtasks. On Terminal-Bench 2.0, Codex leads at 77.3%, reflecting its strength in command-line-heavy workflows, build automation, and scripting tasks. Codex reads the AGENTS.md open standard, which is now supported by thousands of open-source projects — if your repo has an AGENTS.md file, Codex will follow its instructions for how to run tests, which commands to use, and which files to avoid. Codex trades some context depth for speed and parallelism: it&rsquo;s optimized for throughput on well-defined tasks, not deep architectural reasoning.</p>
<h3 id="gpt-52-codex-vs-gpt-53-codex-vs-gpt-54">GPT-5.2-Codex vs GPT-5.3-Codex vs GPT-5.4</h3>
<p>GPT-5.4 is the most capable Codex model on GitHub and the right choice for complex multi-file tasks. GPT-5.3-Codex is the mid-tier option, useful when you want better than baseline but need to conserve premium requests at scale. GPT-5.2-Codex is the entry-level Codex model — appropriate for simple automation, CI script generation, and repetitive boilerplate tasks. If your team runs hundreds of Codex tasks per month, the model tier selection becomes a significant cost lever. GPT-5.4 is the best choice for tasks that touch business-critical code paths; GPT-5.2-Codex is fine for tasks you&rsquo;d give to a junior developer with no ambiguity in the requirements.</p>
<h2 id="head-to-head-benchmark-claude-vs-codex-performance">Head-to-Head Benchmark: Claude vs Codex Performance</h2>
<p>Claude and Codex are the two dominant agent platforms on GitHub, and their benchmark profiles reflect fundamentally different design philosophies. Claude Opus 4.6 scores 80.8% on SWE-bench Verified — the industry standard for real-world software engineering tasks — while Codex leads Terminal-Bench 2.0 at 77.3%, a benchmark focused on command-line automation and scripting workflows. In blind code quality evaluations conducted in early 2026, Claude Code achieved a 67% win rate against Codex across diverse coding tasks. The token efficiency gap is substantial: Claude uses 4x more tokens than Codex on identical benchmark tasks, which does not mean Claude is slower per se, but does mean higher cost per task at equivalent capability tiers. For interactive and refactoring tasks that require understanding large context windows, Claude&rsquo;s benchmark lead is real. For parallel async automation where throughput matters more than depth, Codex&rsquo;s subagent architecture (up to 8 parallel agents) provides an architectural advantage that no single-agent score captures.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Claude Opus 4.6</th>
          <th>Claude Sonnet 4.6</th>
          <th>GPT-5.4 (Codex)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>SWE-bench Verified</td>
          <td>80.8%</td>
          <td>79.6%</td>
          <td>Not reported</td>
      </tr>
      <tr>
          <td>Terminal-Bench 2.0</td>
          <td>Not reported</td>
          <td>Not reported</td>
          <td>77.3% (Codex)</td>
      </tr>
      <tr>
          <td>Code quality (blind eval)</td>
          <td>67% win rate vs Codex</td>
          <td>—</td>
          <td>—</td>
      </tr>
      <tr>
          <td>Token efficiency</td>
          <td>4x Codex consumption</td>
          <td>~3x Codex</td>
          <td>Baseline</td>
      </tr>
      <tr>
          <td>Premium request cost</td>
          <td>3x multiplier</td>
          <td>1x</td>
          <td>Varies by tier</td>
      </tr>
      <tr>
          <td>Max parallel agents</td>
          <td>1 per task</td>
          <td>1 per task</td>
          <td>8 subagents (GA)</td>
      </tr>
      <tr>
          <td>Data residency</td>
          <td>Local (code stays on machine)</td>
          <td>Local</td>
          <td>OpenAI cloud sandbox</td>
      </tr>
  </tbody>
</table>
<h2 id="how-to-select-the-right-model-for-your-task-type">How to Select the Right Model for Your Task Type</h2>
<p>The right model choice follows from the nature of the task, not just the benchmark scores. Claude excels at tasks that require understanding large amounts of existing code before making changes — architectural refactoring, complex bug investigation, cross-cutting feature implementation. Codex excels at well-defined, parallelizable tasks where speed and async execution matter more than deep context reasoning — test generation at scale, CI/CD pipeline automation, migration scripts, and tasks where you need 8 parallel subtasks running simultaneously. The clearest signal: if you&rsquo;d give the task to a senior developer who needs to understand the whole system before touching it, use Claude Opus or Sonnet. If you&rsquo;d give it to a team of developers each working on a clearly defined slice, use Codex with subagents enabled.</p>
<h3 id="task-based-decision-matrix">Task-Based Decision Matrix</h3>
<table>
  <thead>
      <tr>
          <th>Task Type</th>
          <th>Recommended Model</th>
          <th>Reason</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Large codebase refactoring</td>
          <td>Claude Opus 4.6</td>
          <td>Needs deep context understanding</td>
      </tr>
      <tr>
          <td>Bug fix with unclear root cause</td>
          <td>Claude Sonnet 4.6</td>
          <td>Context depth at lower cost</td>
      </tr>
      <tr>
          <td>Test suite generation (broad coverage)</td>
          <td>GPT-5.4 Codex + subagents</td>
          <td>Parallelism wins here</td>
      </tr>
      <tr>
          <td>CI/CD pipeline automation</td>
          <td>GPT-5.2-Codex or GPT-5.3-Codex</td>
          <td>Scripting-heavy, well-defined</td>
      </tr>
      <tr>
          <td>Feature implementation (greenfield)</td>
          <td>Claude Sonnet 4.6</td>
          <td>Complex reasoning needed</td>
      </tr>
      <tr>
          <td>Repetitive boilerplate generation</td>
          <td>GPT-5.2-Codex</td>
          <td>Fast, cheap, reliable</td>
      </tr>
      <tr>
          <td>Security audit and fix</td>
          <td>Claude Opus 4.6</td>
          <td>Subtle logic analysis critical</td>
      </tr>
      <tr>
          <td>Migration scripts</td>
          <td>GPT-5.3-Codex</td>
          <td>Parallel execution helpful</td>
      </tr>
      <tr>
          <td>Docstring and comment generation</td>
          <td>Claude Sonnet 4.6</td>
          <td>Language quality matters</td>
      </tr>
      <tr>
          <td>Build tooling setup</td>
          <td>GPT-5.2-Codex</td>
          <td>Terminal-bench strength</td>
      </tr>
  </tbody>
</table>
<h3 id="when-auto-model-selection-is-enough">When Auto Model Selection Is Enough</h3>
<p>GitHub Copilot&rsquo;s Auto model selection automatically picks from GPT-4.1, GPT-5 mini, Claude Haiku 4.5, and Claude Sonnet 4.5 based on your prompt. Auto is appropriate for quick tasks with clear scope where you don&rsquo;t need a specific model&rsquo;s strengths. Override Auto when: (1) you need Claude&rsquo;s deep context reasoning for a large refactor, (2) you need Codex&rsquo;s parallel subagents for a throughput-heavy task, or (3) you&rsquo;re on a tight premium request budget and Auto tends to pick expensive models for simple tasks. Auto also lacks the ability to spawn Codex subagents — for parallel multi-agent workflows, you must select Codex explicitly.</p>
<h2 id="pricing-and-premium-request-costs-for-claude-and-codex">Pricing and Premium Request Costs for Claude and Codex</h2>
<p>GitHub Copilot pricing starts at Free, with Pro at $10/month, Pro+ at $39/month, Business at $19/user/month, and Enterprise at $39/user/month. Premium requests — the currency for agent usage — are allocated per subscription tier: Pro gets 300 premium requests per month, Pro+ gets significantly more. The cost multipliers for Claude models are critical to understand before you start running agents at scale: Claude Sonnet 4.6 costs 1 premium request per agent session; Claude Opus 4.6 costs 3 premium requests per session (the 3x multiplier). Codex premium request costs vary by GPT model tier. On Business plans, admins can set per-user premium request caps to prevent runaway agent usage. Multi-agent workflows in Agent HQ cost 2–4x single-agent setups because each agent in the workflow consumes its own premium request allocation. Plan accordingly: a team running 50 Opus tasks per week will burn 150 premium requests before any other Copilot usage.</p>
<h3 id="cost-optimization-strategies">Cost Optimization Strategies</h3>
<p>The most impactful cost optimization is defaulting to Claude Sonnet 4.6 instead of Opus for routine tasks — a 67% cost reduction per session with less than 2% benchmark gap. For Codex tasks, use GPT-5.2-Codex for well-defined automation and reserve GPT-5.4 for complex multi-file operations. On Business plans, set up Admin policies to require manager approval before agents can use premium models (Opus, GPT-5.4) — this prevents developers from defaulting to the most expensive option out of habit. Track premium request consumption in the GitHub Copilot usage dashboard, which breaks down consumption by agent, model, and user. Consider batching small Codex tasks into a single subagent workflow rather than spawning separate agent sessions — you&rsquo;ll pay one session cost with 8x the output.</p>
<h2 id="running-both-agents-together-multi-agent-workflows-in-agent-hq">Running Both Agents Together: Multi-Agent Workflows in Agent HQ</h2>
<p>Running Claude and Codex together in Agent HQ is the highest-leverage pattern for teams with complex feature work. The canonical workflow: assign Claude Sonnet 4.6 to the architectural analysis and core feature implementation, then assign Codex (GPT-5.4 with subagents) to generate test coverage for the new code in parallel. Because the tasks touch different files, Agent HQ allows concurrent execution without merge conflicts. Claude handles the code that requires deep reasoning; Codex handles the test scaffolding that benefits from parallel generation. When both PRs are ready, you merge the Claude PR first (it changes the source files), then review the Codex test PR against the updated code. This pattern reduces time-to-PR by 40–60% on large features compared to sequential single-agent workflows, based on early Agent HQ user reports.</p>
<h3 id="avoiding-merge-conflicts-in-multi-agent-workflows">Avoiding Merge Conflicts in Multi-Agent Workflows</h3>
<p>Agent HQ will warn you if two assigned tasks touch the same files, but it won&rsquo;t always catch all potential conflicts, especially in cases where agents create new files with the same name. Before assigning multiple agents: (1) define clear file ownership in your task descriptions (&ldquo;Agent 1: modify src/api/, Agent 2: modify tests/api/&rdquo;), (2) use AGENTS.md to specify which directories each agent should avoid, and (3) plan your merge order before both agents are in flight. Claude agents don&rsquo;t currently read AGENTS.md natively (it&rsquo;s an OpenAI/Codex open standard), so you&rsquo;ll need to specify constraints in the Claude task description itself. Error rates in Agent HQ run 5–10% per session — build in a review step before merging any multi-agent workflow output.</p>
<h2 id="enterprise-setup-admin-policies-and-repository-configuration">Enterprise Setup: Admin Policies and Repository Configuration</h2>
<p>Enterprise and Business plan admins control which models are available to developers through GitHub Copilot&rsquo;s admin policy interface. By default, all available Claude and Codex models are enabled for users on Business and Enterprise plans. Admins can restrict model access to specific tiers — for example, allowing only Claude Sonnet and GPT-5.2-Codex to prevent premium request overconsumption. Repository-level configuration is available via AGENTS.md for Codex (define test commands, excluded directories, and agent behavior) and via task description conventions for Claude. For Enterprise deployments, GitHub provides a usage dashboard showing agent activity by model, user, and repository — essential for compliance and cost allocation. Data residency is a key consideration: Claude Code processes code locally (it never leaves your machine in the CLI mode), while Codex runs in OpenAI&rsquo;s cloud sandbox. For teams with strict data governance requirements, Claude agents are the safer choice.</p>
<h3 id="setting-up-agentsmd-for-codex">Setting Up AGENTS.md for Codex</h3>
<p>AGENTS.md is a configuration file at the root of your repository that tells Codex how to behave in your specific project. A minimal AGENTS.md covers: (1) how to run tests (<code>pytest</code>, <code>npm test</code>, <code>go test ./...</code>), (2) which directories to avoid (<code>/secrets</code>, <code>/migrations/legacy</code>), (3) which files are generated and should not be edited manually, and (4) any project-specific conventions (branch naming, commit message format). Thousands of open-source projects now ship AGENTS.md as part of their contribution setup. If you want Claude to follow similar constraints, put them in your task description or in a CLAUDE.md file in your repository root — Claude agents will read this file automatically when present.</p>
<h2 id="decision-framework-quick-reference-guide-for-model-selection">Decision Framework: Quick Reference Guide for Model Selection</h2>
<p>The optimal model selection for GitHub coding agents comes down to four variables: task complexity, data residency requirements, budget constraints, and whether parallelism is a primary goal. Use this framework at the start of each task: if the task requires understanding the full codebase before making changes and touches fewer than 20 files, start with Claude Sonnet 4.6. If the task requires generating many independent outputs (tests, docs, scripts) across dozens of files, use Codex with subagents. If budget is the primary constraint, default to Auto model selection for simple tasks and manual Sonnet selection for anything complex. If your organization has data residency requirements that prevent sending code to third-party cloud infrastructure, use Claude (local processing) and avoid Codex (OpenAI cloud sandbox). Claude Opus is reserved for high-stakes architectural work where the 3x premium request cost is justified by the task&rsquo;s business impact.</p>
<h3 id="summary-decision-tree">Summary Decision Tree</h3>
<ol>
<li><strong>Does the task require understanding the whole codebase?</strong> → Claude Sonnet 4.6 (or Opus for large/complex repos)</li>
<li><strong>Does the task decompose into 3+ independent parallel subtasks?</strong> → Codex GPT-5.4 with subagents</li>
<li><strong>Is the task purely scripting/CLI automation?</strong> → Codex (Terminal-Bench 2.0 leader)</li>
<li><strong>Do you have strict data residency requirements?</strong> → Claude only (code stays local)</li>
<li><strong>Is the task simple and well-defined with clear requirements?</strong> → Auto model selection or GPT-5.2-Codex</li>
<li><strong>Is code quality the top priority and budget secondary?</strong> → Claude Opus 4.6</li>
</ol>
<hr>
<h2 id="faq">FAQ</h2>
<p><strong>Q: What models are available for GitHub coding agents in 2026?</strong>
Claude agents offer Claude Sonnet 4.6, Claude Opus 4.6, Claude Sonnet 4.5, and Claude Opus 4.5. Codex agents offer GPT-5.2-Codex, GPT-5.3-Codex, and GPT-5.4. GitHub Copilot&rsquo;s Auto model picks from GPT-4.1, GPT-5 mini, Claude Haiku 4.5, and Claude Sonnet 4.5. Model selection was made available for Claude and Codex agents on github.com on April 14, 2026.</p>
<p><strong>Q: How does GitHub&rsquo;s premium request pricing work for Claude vs Codex?</strong>
Claude Sonnet 4.6 costs 1 premium request per agent session; Claude Opus 4.6 costs 3 premium requests (3x multiplier). Codex model costs vary by tier. GitHub Copilot Pro includes 300 premium requests/month; Business and Enterprise plans have higher allocations. Multi-agent Agent HQ workflows consume premium requests for each agent independently.</p>
<p><strong>Q: Is it safe to send proprietary code to GitHub&rsquo;s Codex agent?</strong>
Codex runs in OpenAI&rsquo;s cloud sandbox, meaning your code is processed on OpenAI&rsquo;s infrastructure. If your organization has data residency or IP protection requirements, check your GitHub Enterprise agreement and OpenAI&rsquo;s data handling terms before using Codex. Claude agents process code locally when used via Claude Code CLI, though the GitHub integration routes through Anthropic&rsquo;s API. Review your legal requirements before sending sensitive code to any cloud agent.</p>
<p><strong>Q: Can I run Claude and Codex on the same task in Agent HQ?</strong>
Yes, as long as the agents&rsquo; tasks don&rsquo;t touch the same files. GitHub Agent HQ allows you to assign different tasks within the same feature to different agents and compare PRs side by side. Assign Claude to core implementation and Codex to test generation, merge the implementation PR first, then review the test PR against it. Agent HQ warns about potential file conflicts before you commit to multi-agent execution.</p>
<p><strong>Q: When should I override GitHub&rsquo;s Auto model selection?</strong>
Override Auto when you need Claude&rsquo;s deep context reasoning (large refactoring, complex bug investigation), when you need Codex&rsquo;s parallel subagents for high-throughput tasks, or when Auto consistently picks expensive models for simple tasks. Auto is designed for general-purpose use and doesn&rsquo;t optimize for specialist use cases. For any task that feels like more than 30 minutes of developer work, manual model selection is worth the extra 10 seconds.</p>
]]></content:encoded></item></channel></rss>