Agent Testing

AWS Agent-EvalKit is an open-source toolkit (Apache 2.0, released June 11, 2026) that runs AI agent evaluation directly inside your coding assistant via slash commands. Instead of treating agent evaluation as a post-deployment activity, it brings a six-phase workflow — Plan, Data, Trace, Run Agent, Eval, Report — into Claude Code, Kiro CLI, or Kilo Code, combining code-based evaluators with LLM-as-judge scoring through Amazon Bedrock. I’ve been running evaluations against AI agents for the last two years, and the pattern I kept seeing was: teams either buy a managed eval platform or cobble together Python scripts and a prompt template. Agent-EvalKit splits the difference — it’s a CLI that reads your agent source code, generates test cases, instruments tracing, runs the trials, and recommends fixes with file-level accuracy. In this tutorial, I’ll walk through installing it, running your first evaluation, and the real-world case study where it caught a hallucination problem that output-level testing missed entirely. ...

LangWatch is an open-source monitoring, evaluation, and optimization platform for LLM applications and AI agents. It provides tracing, real-time evaluation, agent simulation, and prompt management in a single unified system — with cloud plans starting at €59/month and self-hosting completely free with no feature gates. What Is LangWatch? (The LLM Observability Platform Explained) LangWatch is an open-source LLMOps platform that combines production monitoring, automated evaluation, agent simulation testing, and prompt optimization in a single unified system. Founded to address the fragmented tooling problem facing AI teams — where developers typically need 3–5 separate tools for tracing, evals, prompt management, and cost control — LangWatch consolidates all these workflows under one roof. As of 2026, the platform has surpassed 3,000 GitHub stars and supports 10+ LLM providers including OpenAI, Azure, AWS Bedrock, Google Gemini, Deepseek, Groq, MistralAI, VertexAI, and LiteLLM. The platform is built natively on OpenTelemetry, meaning enterprise teams can integrate with existing observability stacks without vendor lock-in. The LLM observability market it operates in is expanding fast: from $1.97 billion in 2025, it’s projected to hit $2.69 billion in 2026 at a 36.3% CAGR, and $9.26 billion by 2030. LangWatch positions itself as the platform for developers who want production-grade AI monitoring without stitching together half a dozen point solutions. ...

Agent Testing

AWS Agent-EvalKit: Open-Source AI Agent Evaluation for Developers — Tutorial & Deep Dive

LangWatch Review 2026: LLM and Agent Application Monitoring Platform