Llm-Testing

AWS Agent-EvalKit Developer Tutorial 2026

AWS Agent-EvalKit: Open-Source AI Agent Evaluation for Developers — Tutorial & Deep Dive

AWS Agent-EvalKit is an open-source toolkit (Apache 2.0, released June 11, 2026) that runs AI agent evaluation directly inside your coding assistant via slash commands. Instead of treating agent evaluation as a post-deployment activity, it brings a six-phase workflow — Plan, Data, Trace, Run Agent, Eval, Report — into Claude Code, Kiro CLI, or Kilo Code, combining code-based evaluators with LLM-as-judge scoring through Amazon Bedrock. I’ve been running evaluations against AI agents for the last two years, and the pattern I kept seeing was: teams either buy a managed eval platform or cobble together Python scripts and a prompt template. Agent-EvalKit splits the difference — it’s a CLI that reads your agent source code, generates test cases, instruments tracing, runs the trials, and recommends fixes with file-level accuracy. In this tutorial, I’ll walk through installing it, running your first evaluation, and the real-world case study where it caught a hallucination problem that output-level testing missed entirely. ...

Open Source Agent Eval Harness Comparison 2026

The 2026 open-source agent eval harness market is undergoing a Cambrian explosion. Unlike 2024–2025 where the dominant tools focused on scoring LLM outputs — comparing a generated answer to a ground-truth label — this year’s crop evaluates the entire agent system: harness configuration, tool-use trajectory, orchestration topology, and failure recovery as a unified stack. I spent the last month digging into 11 open-source eval frameworks that emerged in the past 12 months. The key finding: framework choice matters as much as model choice. PawBench demonstrates this directly — identical models across different harnesses produce up to an 11.5-point spread on the same task set. If you’re still treating eval as “run a model, check the answer,” the tools below will change how you think about agent quality. ...

AI Agent Testing Guide 2026: Practical Evaluation Framework for Multi-Step Agents

AI agent testing in 2026 requires a fundamentally different approach than traditional software QA: because agents plan, call tools, and adapt across multiple steps, you must evaluate the entire decision trajectory — not just the final output. This guide walks through the complete evaluation stack, from golden dataset construction to CI/CD deployment gates. Why Traditional Software Testing Breaks for Multi-Step AI Agents Traditional software testing assumes deterministic, predictable behavior: given input X, the function reliably returns Y. Multi-step AI agents violate this assumption at every layer. An agent doesn’t just map inputs to outputs — it perceives context, selects tools, interprets intermediate results, adjusts its plan, and eventually produces an answer through a sequence of decisions that can vary on every run. As of 2026, 79% of organizations have adopted AI agents to some extent, and 57% already have agents in production (Multimodal.dev). Yet over 40% of agentic AI projects are at risk of cancellation by 2027 if governance, observability, and ROI clarity are not established (Gartner). The root cause is almost always testing inadequacy — teams apply unit-test thinking to systems that require trajectory evaluation. A unit test catches a wrong return value; what it cannot catch is an agent that reaches the right answer through a broken series of tool calls that would fail at scale or under edge-case inputs. ...

OpenAI Acquires PromptFoo: What It Means for AI Security Testing in 2026

OpenAI acquiring PromptFoo is not a talent grab — it is a strategic acknowledgment that AI security testing is no longer optional infrastructure. With 93% of organizations now shipping AI-generated code and only 12% applying equivalent security standards, the attack surface is enormous and growing. PromptFoo was the most mature open-source tool purpose-built for LLM red-teaming, and OpenAI buying it means the company is betting that security evaluation needs to be a first-class part of the developer workflow, not an afterthought bolted on by a third-party CLI. ...