<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Eval-Harness on RockB</title><link>https://baeseokjae.github.io/tags/eval-harness/</link><description>Recent content in Eval-Harness on RockB</description><image><title>RockB</title><url>https://baeseokjae.github.io/images/og-default.png</url><link>https://baeseokjae.github.io/images/og-default.png</link></image><generator>Hugo</generator><language>en-us</language><lastBuildDate>Fri, 19 Jun 2026 12:00:00 +0000</lastBuildDate><atom:link href="https://baeseokjae.github.io/tags/eval-harness/index.xml" rel="self" type="application/rss+xml"/><item><title>Open Source Agent Eval Harness Comparison 2026</title><link>https://baeseokjae.github.io/posts/open-source-agent-eval-harness-comparison-2026/</link><pubDate>Fri, 19 Jun 2026 12:00:00 +0000</pubDate><guid>https://baeseokjae.github.io/posts/open-source-agent-eval-harness-comparison-2026/</guid><description>A 2026 comparison of 11 open-source agent evaluation harnesses: MASEval, ClawBench, reaatech, ProofAgent, OpenAgentBench, Kensa, PawBench, CUBE, evalh, ...</description><content:encoded><![CDATA[<p>The 2026 open-source agent eval harness market is undergoing a Cambrian explosion. Unlike 2024–2025 where the dominant tools focused on scoring LLM outputs — comparing a generated answer to a ground-truth label — this year&rsquo;s crop evaluates the entire agent system: harness configuration, tool-use trajectory, orchestration topology, and failure recovery as a unified stack. I spent the last month digging into 11 open-source eval frameworks that emerged in the past 12 months. The key finding: framework choice matters as much as model choice. PawBench demonstrates this directly — identical models across different harnesses produce up to an 11.5-point spread on the same task set. If you&rsquo;re still treating eval as &ldquo;run a model, check the answer,&rdquo; the tools below will change how you think about agent quality.</p>
<h2 id="the-2026-landscape-shift">The 2026 Landscape Shift</h2>
<p>Three things have changed since the early eval days of DeepEval, Ragas, and PromptFoo. First, trace-based scoring replaced final-answer checking as the architectural standard — tools score the <em>process</em> (tool selection order, recovery from errors, state management) not just the output. Second, reproducibility is recognized as the #1 unsolved problem: ClawBench found that 47% of benchmark variance is seed noise, which means a model that looks 10 points better might just be luckier with random seeds. Third, the market is fragmenting into focused niches — adversarial security harnesses (ProofAgent), coding-agent-specific evaluators (Kensa), TypeScript-native pipelines (reaatech), and YAML-driven regression gates (evalh). No single framework dominates; the right choice depends entirely on your architecture.</p>
<p>Below I cover the 11 tools grouped by what they do best.</p>
<h2 id="multi-agent-system-evaluation-maseval">Multi-Agent System Evaluation: MASEval</h2>
<p>MASEval (MIT, PyPI, arXiv 2603.08835) is the strongest option for evaluating full multi-agent systems. Unlike tools that treat agents as independent LLM calls, MASEval treats the entire orchestration topology as the unit of analysis — agent-to-agent communication patterns, error propagation across agents, and delegation decisions.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> maseval <span style="color:#f92672">import</span> SystemEvaluator, AgentTrace
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> maseval.metrics <span style="color:#f92672">import</span> (
</span></span><span style="display:flex;"><span>    TopologyCoherence, DelegationOptimality, RecoveryEfficiency
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>evaluator <span style="color:#f92672">=</span> SystemEvaluator(
</span></span><span style="display:flex;"><span>    adapters<span style="color:#f92672">=</span>[<span style="color:#e6db74">&#34;autogen&#34;</span>, <span style="color:#e6db74">&#34;langchain&#34;</span>, <span style="color:#e6db74">&#34;custom&#34;</span>],
</span></span><span style="display:flex;"><span>    metrics<span style="color:#f92672">=</span>[TopologyCoherence(), DelegationOptimality(), RecoveryEfficiency()]
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Evaluate a recorded multi-agent session from its trace</span>
</span></span><span style="display:flex;"><span>trace <span style="color:#f92672">=</span> AgentTrace<span style="color:#f92672">.</span>load(<span style="color:#e6db74">&#34;session_001.jsonl&#34;</span>)
</span></span><span style="display:flex;"><span>result <span style="color:#f92672">=</span> evaluator<span style="color:#f92672">.</span>evaluate(trace)
</span></span><span style="display:flex;"><span>print(result<span style="color:#f92672">.</span>report())
</span></span></code></pre></div><p>MASEval ships with adapters for AutoGen, LangChain, and custom frameworks, making it agent-agnostic. If you&rsquo;re running <a href="/posts/ai-harness-engineering-guide-2026/">multi-agent systems with LangGraph or CrewAI</a>, MASEval is the only harness that can score how well your agents coordinate rather than how well each individual LLM answers. The trade-off: no CI integration out of the box. You get detailed system-level reports, but you need to wire them into a pipeline yourself.</p>
<h2 id="reproducibility-first-benchmarking-clawbench">Reproducibility-First Benchmarking: ClawBench</h2>
<p>ClawBench (MIT, 180+ stars) approaches evaluation from the opposite direction of most tools. Rather than adding more metrics, it subtracts noise. Core v1 ships with 19 signal-curated tasks from a 40-task pool — dropping 21 that produce unreliable scores. Every run executes 3 mandatory trials with statistical significance tests, and the framework quantifies exactly how much of your score variance comes from seed noise versus actual model capability.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>clawbench run --model claude-opus-4.7 <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  --harness openclaw <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  --trials <span style="color:#ae81ff">3</span> <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  --variance-report full
</span></span></code></pre></div><p>The variance decomposition is ClawBench&rsquo;s killer feature. I ran a regression suite where GPT-5.5 appeared 5 points ahead of DeepSeek V4 on a single run, but ClawBench&rsquo;s bootstrap confidence intervals showed the difference was within noise — saving me from a false-positive model swap. For teams running leaderboard-style evaluations where a wrong read could derail model selection, ClawBench&rsquo;s statistical rigor is worth the narrower task scope. It&rsquo;s not for CI regression gates; it&rsquo;s for high-stakes model and harness decisions.</p>
<h2 id="typescript-native-production-eval-reaatechagent-eval-harness">TypeScript-Native Production Eval: reaatech/agent-eval-harness</h2>
<p>Every other harness in this comparison is Python. reaatech/agent-eval-harness (TypeScript, pnpm monorepo) is the first serious TypeScript-native eval framework. It ships 12 composable packages covering trajectory analysis, tool-use correctness, cost-per-task, latency budgets, golden trajectory curation, and an MCP server for LLM client integration.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-typescript" data-lang="typescript"><span style="display:flex;"><span><span style="color:#66d9ef">import</span> { <span style="color:#a6e22e">Suite</span>, <span style="color:#a6e22e">TrajectoryMetric</span>, <span style="color:#a6e22e">CostMetric</span>, <span style="color:#a6e22e">Gate</span> } <span style="color:#66d9ef">from</span> <span style="color:#e6db74">&#39;@agent-eval/harness&#39;</span>;
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">const</span> <span style="color:#a6e22e">suite</span> <span style="color:#f92672">=</span> <span style="color:#66d9ef">new</span> <span style="color:#a6e22e">Suite</span>({
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">metrics</span><span style="color:#f92672">:</span> [<span style="color:#66d9ef">new</span> <span style="color:#a6e22e">TrajectoryMetric</span>(), <span style="color:#66d9ef">new</span> <span style="color:#a6e22e">CostMetric</span>()],
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">gate</span>: <span style="color:#66d9ef">new</span> <span style="color:#a6e22e">Gate</span>({ <span style="color:#a6e22e">maxCostPerTask</span>: <span style="color:#66d9ef">0.05</span>, <span style="color:#a6e22e">maxLatencyMs</span>: <span style="color:#66d9ef">15000</span> })
</span></span><span style="display:flex;"><span>});
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">const</span> <span style="color:#a6e22e">report</span> <span style="color:#f92672">=</span> <span style="color:#66d9ef">await</span> <span style="color:#a6e22e">suite</span>.<span style="color:#a6e22e">run</span>(<span style="color:#a6e22e">trajectories</span>);
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">console</span>.<span style="color:#a6e22e">log</span>(<span style="color:#a6e22e">report</span>.<span style="color:#a6e22e">toJUnit</span>());
</span></span></code></pre></div><p>The JUnit/XML output and GitHub Annotations support mean it integrates into CI the same way Playwright or Vitest does — failing a PR when an agent&rsquo;s trajectory deviates from the golden path or when cost-per-task exceeds budget. If your stack is TypeScript end-to-end (agents, API layer, CI), this harness avoids the Python impedance mismatch. If you&rsquo;re in Python, skip it — the ecosystem is still young with less community adoption than the Python tools.</p>
<h2 id="adversarial-security-eval-proofagent">Adversarial Security Eval: ProofAgent</h2>
<p>ProofAgent is positioned as &ldquo;pytest for AI agents&rdquo; and lives up to that claim. It ships 183 bundled adversarial traps across 11 families — GDPR, HIPAA, PCI, SOX, prompt injection, jailbreak, PII leakage, and malware-gen solicitation. The scoring uses a 3-Harness-Juror Delphi consensus: three independent evaluators score the agent&rsquo;s response, and if they disagree, the system re-votes until consensus is reached.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> proofagent <span style="color:#f92672">import</span> Harness, JurorPool
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>harness <span style="color:#f92672">=</span> Harness(
</span></span><span style="display:flex;"><span>    traps<span style="color:#f92672">=</span>[<span style="color:#e6db74">&#34;prompt_injection&#34;</span>, <span style="color:#e6db74">&#34;pii_leakage&#34;</span>, <span style="color:#e6db74">&#34;hipaa_boundary&#34;</span>],
</span></span><span style="display:flex;"><span>    jurors<span style="color:#f92672">=</span>JurorPool(size<span style="color:#f92672">=</span><span style="color:#ae81ff">3</span>, consensus<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;delphi&#34;</span>)
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>results <span style="color:#f92672">=</span> harness<span style="color:#f92672">.</span>run_against(my_agent_endpoint)
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">assert</span> results<span style="color:#f92672">.</span>overall_pass_rate <span style="color:#f92672">&gt;</span> <span style="color:#ae81ff">0.95</span>
</span></span></code></pre></div><p>The assertion-style output makes it natural to gate deployments on security eval scores. If you&rsquo;re shipping an agent that handles user data, running ProofAgent as a pre-deploy gate should be table stakes. If your agent is purely internal tooling with no user-facing prompts, the trap coverage is overkill — use a simpler approach.</p>
<h2 id="stateful-control-system-verification-openagentbench">Stateful Control System Verification: OpenAgentBench</h2>
<p>OpenAgentBench evaluates agents as stateful control systems, not transcript generators. It scores tool-choice optimality (did the agent pick the right tool or brute-force with the wrong one?), privilege safety (did it try to escalate permissions?), memory hygiene (did it leak context across sessions?), and recovery behavior (how did it handle tool failures?).</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> openagentbench <span style="color:#f92672">import</span> AgentTester
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> openagentbench.checks <span style="color:#f92672">import</span> (
</span></span><span style="display:flex;"><span>    ToolOptimality, PrivilegeSafety, RecoveryGrace
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>tester <span style="color:#f92672">=</span> AgentTester(checks<span style="color:#f92672">=</span>[ToolOptimality(), PrivilegeSafety()])
</span></span><span style="display:flex;"><span>report <span style="color:#f92672">=</span> tester<span style="color:#f92672">.</span>run(<span style="color:#e6db74">&#34;my_agent.py&#34;</span>, scenarios<span style="color:#f92672">=</span>[<span style="color:#e6db74">&#34;billing_flow&#34;</span>])
</span></span></code></pre></div><p>The Agent Chaos Lab injects failures mid-session — revoking a tool mid-call, dropping a network request, corrupting a memory cell — and measures whether the agent recovers gracefully. This is the closest thing to chaos engineering for agents. For production-critical agents that handle money, credentials, or user data, OpenAgentBench&rsquo;s state verification catches failure modes that no output-scoring metric ever would.</p>
<h2 id="coding-agent-eval-kensa">Coding Agent Eval: Kensa</h2>
<p>Kensa (MIT, free) is purpose-built for evaluating coding agents — Claude Code, OpenAI Codex, Cursor, OpenCode, Gemini CLI. Its workflow is capture-then-generate-then-eval: you record real agent sessions via auto-instrumentation (zero code changes, OTel), the tool synthesizes new test scenarios from those captures using an LLM, then you run the eval.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>kensa capture --agent claude-code --output my_traces.jsonl
</span></span><span style="display:flex;"><span>kensa generate scenarios --from my_traces.jsonl --output scenarios.jsonl
</span></span><span style="display:flex;"><span>kensa eval --scenarios scenarios.jsonl --models claude-opus-4.7,gpt-5.5
</span></span></code></pre></div><p>The auto-instrumentation uses OpenTelemetry and requires no changes to the agent — it hooks into the process environment. Kensa&rsquo;s 5 built-in skills (audit, generate scenarios, generate judges, validate judge, diagnose errors) make it useful for teams building custom eval pipelines for coding agents. It&rsquo;s the most practical tool I&rsquo;ve found for running a regression suite against new coding agent versions. For evaluating general-purpose non-coding agents, look elsewhere — Kensa&rsquo;s scenario generation is heavily optimized for code-editing tasks.</p>
<h2 id="model-x-harness-co-evaluation-pawbench-and-cube">Model x Harness Co-Evaluation: PawBench and CUBE</h2>
<p>PawBench (Apache 2.0, 57 stars) is unique in that it scores the model <em>and</em> the harness together. It runs 150 tasks from 6 source benchmarks across 9 models and 3 harnesses, then produces a matrix showing harness gaps even when the model is fixed. The spread is up to 11.5 points — QwenPaw averages 74.9, OpenClaw 72.9, Hermes 69.3. If you&rsquo;re debating which eval harness to standardize on, PawBench gives you data, not opinions.</p>
<p>CUBE Harness (AI Alliance) takes the opposite approach: it standardizes the evaluation <em>protocol</em> so harness choice becomes irrelevant. Based on the CUBE Standard benchmark protocol, it uses Ray for parallel episode execution and outputs RL-ready trajectories with full LLM call logging. Still alpha-stage, but if CUBE gains adoption, it could consolidate the fragmented harness landscape. Watch this space rather than adopting it today.</p>
<h2 id="config-driven-regression-evalh">Config-Driven Regression: evalh</h2>
<p>evalh is the most boring tool in this comparison, and that&rsquo;s its strength. One YAML config drives everything — metrics, variants (N x M cases for A/B testing), executor backends (6 adapter families), and observability integrations (8 backends).</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-yaml" data-lang="yaml"><span style="display:flex;"><span><span style="color:#f92672">suite</span>: <span style="color:#ae81ff">billing-agent-regression</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">variants</span>:
</span></span><span style="display:flex;"><span>  - <span style="color:#f92672">model</span>: <span style="color:#ae81ff">claude-opus-4.7</span>
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">config</span>: <span style="color:#ae81ff">production.toml</span>
</span></span><span style="display:flex;"><span>  - <span style="color:#f92672">model</span>: <span style="color:#ae81ff">gpt-5.5</span>
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">config</span>: <span style="color:#ae81ff">experimental.toml</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">metrics</span>:
</span></span><span style="display:flex;"><span>  - <span style="color:#f92672">text_checks</span>: [<span style="color:#ae81ff">contains_invoice_total, no_pii]</span>
</span></span><span style="display:flex;"><span>  - <span style="color:#f92672">llm_judge</span>: { <span style="color:#f92672">prompt</span>: <span style="color:#ae81ff">judge_billing.txt }</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">gate</span>:
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">drift_detection</span>: <span style="color:#66d9ef">true</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">baseline</span>: <span style="color:#ae81ff">production-2026-06-01</span>
</span></span></code></pre></div><p>I use evalh for my day-to-day CI regression suite because it takes 10 minutes to configure a new eval — no boilerplate, just YAML. The drift detection mode promotes a baseline, then flags any regression across variants. For teams that want eval coverage without a multi-week integration project, evalh is the fastest path.</p>
<h2 id="lightweight-and-auto-improving-siddharth-1001-and-harness-bench">Lightweight and Auto-Improving: Siddharth-1001 and Harness Bench</h2>
<p>Siddharth-1001/agent-eval-harness (MIT) covers 4 metric dimensions (tool success, hallucination, latency, cost) with adapters for LangGraph, CrewAI, OpenAI Agents SDK, and Anthropic. The local dashboard and HTML export for side-by-side comparison make it useful for hacking on eval configurations before investing in production infrastructure.</p>
<p>Harness Bench takes this one step further with &ldquo;Improvers&rdquo; — automated research loops that analyze eval failures and propose harness mutations, similar to how DSPy optimizes prompts. You point it at a failing eval run, and it suggests changes to the harness configuration. Novel concept, early stage — the generated mutations still require human review before deployment.</p>
<h2 id="a-staged-ci-framework-for-2026">A Staged CI Framework for 2026</h2>
<p>No single harness covers all four eval stages. Here&rsquo;s what I run in practice:</p>
<table>
  <thead>
      <tr>
          <th>Stage</th>
          <th>Tool</th>
          <th>What It Checks</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Unit eval</td>
          <td>evalh + Siddharth-1001</td>
          <td>Per-task correctness, tool accuracy, latency</td>
      </tr>
      <tr>
          <td>Integration eval</td>
          <td>MASEval + OpenAgentBench</td>
          <td>Multi-agent coordination, state recovery</td>
      </tr>
      <tr>
          <td>Regression gate</td>
          <td>evalh (drift detection)</td>
          <td>Score regression from baseline</td>
      </tr>
      <tr>
          <td>Production security</td>
          <td>ProofAgent</td>
          <td>Adversarial resilience, compliance boundary</td>
      </tr>
  </tbody>
</table>
<p><a href="/posts/ai-harness-engineering-guide-2026/">AI Harness Engineering</a> covers the orchestration layer connecting these stages. For understanding the benchmarks these harnesses run against, see the <a href="/posts/llm-benchmarks-guide-2026/">LLM Benchmarks Guide</a> and <a href="/posts/swe-bench-coding-benchmarks-guide-2026/">SWE-bench Guide</a>.</p>
<h2 id="decision-matrix">Decision Matrix</h2>
<table>
  <thead>
      <tr>
          <th>Your Need</th>
          <th>Best Pick</th>
          <th>Runner-Up</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Multi-agent system eval</td>
          <td>MASEval</td>
          <td>OpenAgentBench</td>
      </tr>
      <tr>
          <td>High-stakes model selection</td>
          <td>ClawBench</td>
          <td>PawBench</td>
      </tr>
      <tr>
          <td>TypeScript CI pipeline</td>
          <td>reaatech/agent-eval-harness</td>
          <td>—</td>
      </tr>
      <tr>
          <td>Security/compliance gating</td>
          <td>ProofAgent</td>
          <td>—</td>
      </tr>
      <tr>
          <td>Coding agent regression</td>
          <td>Kensa</td>
          <td>evalh</td>
      </tr>
      <tr>
          <td>Stateful failure injection</td>
          <td>OpenAgentBench</td>
          <td>MASEval</td>
      </tr>
      <tr>
          <td>Quick YAML-driven regression</td>
          <td>evalh</td>
          <td>Siddharth-1001</td>
      </tr>
      <tr>
          <td>Harness benchmarking</td>
          <td>PawBench</td>
          <td>CUBE (alpha)</td>
      </tr>
      <tr>
          <td>Auto-improving harnesses</td>
          <td>Harness Bench</td>
          <td>—</td>
      </tr>
  </tbody>
</table>
<p>The fragmentation in this space is real — there isn&rsquo;t one framework to rule them all, and I don&rsquo;t expect one to emerge in 2026. Pick the tool that matches your eval stage and architecture, chain them together with a CI pipeline, and treat eval infrastructure as a first-class investment alongside model selection. The teams winning at production AI agents aren&rsquo;t the ones with better models; they&rsquo;re the ones that can measure whether their agents actually work.</p>
]]></content:encoded></item></channel></rss>