Eval-Harness

The 2026 open-source agent eval harness market is undergoing a Cambrian explosion. Unlike 2024–2025 where the dominant tools focused on scoring LLM outputs — comparing a generated answer to a ground-truth label — this year’s crop evaluates the entire agent system: harness configuration, tool-use trajectory, orchestration topology, and failure recovery as a unified stack. I spent the last month digging into 11 open-source eval frameworks that emerged in the past 12 months. The key finding: framework choice matters as much as model choice. PawBench demonstrates this directly — identical models across different harnesses produce up to an 11.5-point spread on the same task set. If you’re still treating eval as “run a model, check the answer,” the tools below will change how you think about agent quality. ...