<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Pytest on RockB</title><link>https://baeseokjae.github.io/tags/pytest/</link><description>Recent content in Pytest on RockB</description><image><title>RockB</title><url>https://baeseokjae.github.io/images/og-default.png</url><link>https://baeseokjae.github.io/images/og-default.png</link></image><generator>Hugo</generator><language>en-us</language><lastBuildDate>Tue, 12 May 2026 21:03:44 +0000</lastBuildDate><atom:link href="https://baeseokjae.github.io/tags/pytest/index.xml" rel="self" type="application/rss+xml"/><item><title>DeepEval Tutorial 2026: Pytest-Native LLM Evaluation for Production AI</title><link>https://baeseokjae.github.io/posts/deepeval-tutorial-2026/</link><pubDate>Tue, 12 May 2026 21:03:44 +0000</pubDate><guid>https://baeseokjae.github.io/posts/deepeval-tutorial-2026/</guid><description>Step-by-step DeepEval tutorial covering pytest-native LLM testing, RAG metrics, G-Eval, agent evaluation, and CI/CD integration in 2026.</description><content:encoded><![CDATA[<p>DeepEval is an open-source, pytest-native framework for evaluating LLM outputs using 50+ research-backed metrics — no labeled data required for most production use cases. Install it with <code>pip install deepeval</code>, write test cases like Python unit tests, and run <code>deepeval test run</code> from the CLI to catch regressions before they reach users.</p>
<h2 id="what-is-deepeval-and-why-pytest-native-llm-evaluation-matters-in-2026">What Is DeepEval and Why Pytest-Native LLM Evaluation Matters in 2026</h2>
<p>DeepEval is an open-source LLM evaluation framework built by Confident AI that treats model quality testing the same way software engineers treat unit testing: write test cases in Python, run them from the CLI, and fail the build when outputs degrade. As of May 2026, DeepEval has 15,291 GitHub stars, 250+ contributors, and is used by 150,000+ developers running over 100 million daily evaluations — including more than 50% of Fortune 500 companies for LLM quality assurance. The Apache 2.0 license means no usage restrictions in commercial products.</p>
<p>The &ldquo;pytest-native&rdquo; design is the key differentiator from notebook-centric tools like RAGAS or Weights &amp; Biases Weave. Your evaluation suite lives in <code>tests/</code> alongside your application code. Every push triggers the same CI pipeline. When a model upgrade changes tone subtly enough to tank your faithfulness score, you catch it in the PR — not in a Monday incident review. In 2026, evaluation woven into CI/CD pipelines is table stakes for teams shipping production LLM features; DeepEval is the framework that makes that pattern feel natural to Python engineers already fluent in pytest.</p>
<h2 id="how-to-install-deepeval-and-configure-your-evaluation-environment">How to Install DeepEval and Configure Your Evaluation Environment</h2>
<p>DeepEval installs as a standard Python package and requires only an LLM API key to run most metrics out of the box. The evaluation judge defaults to GPT-4o but supports any OpenAI-compatible endpoint, Anthropic Claude, or a local Ollama model — making it viable in air-gapped deployments. The setup takes under five minutes for a working evaluation harness against a real application endpoint.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>pip install deepeval
</span></span></code></pre></div><p>Set your judge model credentials:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>export OPENAI_API_KEY<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;sk-...&#34;</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># or for Claude-based judgment:</span>
</span></span><span style="display:flex;"><span>export ANTHROPIC_API_KEY<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;sk-ant-...&#34;</span>
</span></span></code></pre></div><p>Initialize the project (creates <code>.deepeval</code> config in current directory):</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>deepeval login  <span style="color:#75715e"># optional: connects to Confident AI cloud dashboard</span>
</span></span></code></pre></div><p>For teams using Anthropic Claude as the judge, configure the model in code:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> deepeval.models <span style="color:#f92672">import</span> DeepEvalBaseLLM
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> anthropic <span style="color:#f92672">import</span> Anthropic
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">class</span> <span style="color:#a6e22e">ClaudeJudge</span>(DeepEvalBaseLLM):
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">def</span> __init__(self):
</span></span><span style="display:flex;"><span>        self<span style="color:#f92672">.</span>client <span style="color:#f92672">=</span> Anthropic()
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">def</span> <span style="color:#a6e22e">load_model</span>(self):
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> self<span style="color:#f92672">.</span>client
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">def</span> <span style="color:#a6e22e">generate</span>(self, prompt: str) <span style="color:#f92672">-&gt;</span> str:
</span></span><span style="display:flex;"><span>        response <span style="color:#f92672">=</span> self<span style="color:#f92672">.</span>client<span style="color:#f92672">.</span>messages<span style="color:#f92672">.</span>create(
</span></span><span style="display:flex;"><span>            model<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;claude-sonnet-4-6&#34;</span>,
</span></span><span style="display:flex;"><span>            max_tokens<span style="color:#f92672">=</span><span style="color:#ae81ff">1024</span>,
</span></span><span style="display:flex;"><span>            messages<span style="color:#f92672">=</span>[{<span style="color:#e6db74">&#34;role&#34;</span>: <span style="color:#e6db74">&#34;user&#34;</span>, <span style="color:#e6db74">&#34;content&#34;</span>: prompt}]
</span></span><span style="display:flex;"><span>        )
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> response<span style="color:#f92672">.</span>content[<span style="color:#ae81ff">0</span>]<span style="color:#f92672">.</span>text
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">async</span> <span style="color:#66d9ef">def</span> <span style="color:#a6e22e">a_generate</span>(self, prompt: str) <span style="color:#f92672">-&gt;</span> str:
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> self<span style="color:#f92672">.</span>generate(prompt)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">def</span> <span style="color:#a6e22e">get_model_name</span>(self) <span style="color:#f92672">-&gt;</span> str:
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> <span style="color:#e6db74">&#34;claude-sonnet-4-6&#34;</span>
</span></span></code></pre></div><h3 id="verifying-the-installation">Verifying the Installation</h3>
<p>Run <code>deepeval test run</code> on a trivial test to confirm the environment is wired up:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>deepeval test run tests/test_smoke.py -v
</span></span></code></pre></div><p>A passing smoke test confirms your judge model is reachable, metrics can score, and the CLI finds your test files. Add this as a pre-push hook or the first step in your CI pipeline.</p>
<h2 id="core-concepts-llmtestcase-evaluationdataset-and-metrics">Core Concepts: LLMTestCase, EvaluationDataset, and Metrics</h2>
<p>DeepEval&rsquo;s core data structure is <code>LLMTestCase</code>, a typed container that holds everything needed to evaluate one model interaction: the input, the model&rsquo;s actual output, an optional expected output, and any retrieval context for RAG pipelines. Metrics accept <code>LLMTestCase</code> instances directly, which means evaluation logic is decoupled from your application code — you can swap metrics without touching the test runner. <code>EvaluationDataset</code> wraps a list of test cases and integrates with <code>@pytest.mark.parametrize</code> to run the full suite as individual pytest items, each with its own pass/fail result in the test report. DeepEval recommends keeping metric counts per case to 2–3 generic system metrics plus 1–2 use-case-specific metrics — five metrics maximum per evaluation run — to avoid combinatorial noise in your signal.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> deepeval.test_case <span style="color:#f92672">import</span> LLMTestCase
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> deepeval.dataset <span style="color:#f92672">import</span> EvaluationDataset
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Single test case</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">case</span> <span style="color:#f92672">=</span> LLMTestCase(
</span></span><span style="display:flex;"><span>    input<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;What is the capital of France?&#34;</span>,
</span></span><span style="display:flex;"><span>    actual_output<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;The capital of France is Paris.&#34;</span>,
</span></span><span style="display:flex;"><span>    expected_output<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;Paris&#34;</span>,
</span></span><span style="display:flex;"><span>    retrieval_context<span style="color:#f92672">=</span>[<span style="color:#e6db74">&#34;France is a country in Western Europe. Its capital city is Paris.&#34;</span>]
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Dataset for bulk evaluation</span>
</span></span><span style="display:flex;"><span>dataset <span style="color:#f92672">=</span> EvaluationDataset(test_cases<span style="color:#f92672">=</span>[
</span></span><span style="display:flex;"><span>    LLMTestCase(
</span></span><span style="display:flex;"><span>        input<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;Explain gradient descent&#34;</span>,
</span></span><span style="display:flex;"><span>        actual_output<span style="color:#f92672">=</span>my_llm(<span style="color:#e6db74">&#34;Explain gradient descent&#34;</span>),
</span></span><span style="display:flex;"><span>        expected_output<span style="color:#f92672">=</span><span style="color:#66d9ef">None</span>,  <span style="color:#75715e"># referenceless metrics don&#39;t need this</span>
</span></span><span style="display:flex;"><span>    ),
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># ... more cases</span>
</span></span><span style="display:flex;"><span>])
</span></span></code></pre></div><h3 id="running-assertions-with-assert_test">Running Assertions with assert_test()</h3>
<p><code>assert_test()</code> is the primitive that connects DeepEval metrics to pytest:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> pytest
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> deepeval <span style="color:#f92672">import</span> assert_test
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> deepeval.metrics <span style="color:#f92672">import</span> AnswerRelevancyMetric
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>metric <span style="color:#f92672">=</span> AnswerRelevancyMetric(threshold<span style="color:#f92672">=</span><span style="color:#ae81ff">0.7</span>, model<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;gpt-4o&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">@pytest.mark.parametrize</span>(<span style="color:#e6db74">&#34;test_case&#34;</span>, dataset)
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">test_answer_quality</span>(test_case):
</span></span><span style="display:flex;"><span>    assert_test(test_case, [metric])
</span></span></code></pre></div><p>When the metric score drops below <code>threshold</code>, <code>assert_test()</code> raises <code>AssertionError</code> with the exact score and reason — the same failure contract as any other pytest assertion.</p>
<h2 id="built-in-metrics-deep-dive--rag-hallucination-and-answer-relevancy">Built-in Metrics Deep Dive — RAG, Hallucination, and Answer Relevancy</h2>
<p>DeepEval ships 50+ research-backed metrics covering RAG pipelines, hallucination detection, safety, conversational systems, and agent behavior. The five core RAG metrics each catch a distinct failure mode: <code>AnswerRelevancyMetric</code> checks whether the response addresses the query, <code>FaithfulnessMetric</code> verifies claims are grounded in retrieved context, <code>ContextualPrecisionMetric</code> scores whether retrieved chunks are actually relevant, <code>ContextualRecallMetric</code> checks whether necessary information was retrieved at all, and <code>ContextualRelevancyMetric</code> evaluates overall context quality. For hallucination detection specifically, <code>HallucinationMetric</code> uses an LLM judge to identify statements in the output that contradict or go beyond the provided context — a critical gate before responses reach end users. In production RAG systems, running all five metrics on a sample of daily traffic catches retrieval drift that would otherwise surface only through user complaints weeks later.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> deepeval.metrics <span style="color:#f92672">import</span> (
</span></span><span style="display:flex;"><span>    AnswerRelevancyMetric,
</span></span><span style="display:flex;"><span>    FaithfulnessMetric,
</span></span><span style="display:flex;"><span>    ContextualPrecisionMetric,
</span></span><span style="display:flex;"><span>    ContextualRecallMetric,
</span></span><span style="display:flex;"><span>    HallucinationMetric,
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># RAG pipeline evaluation</span>
</span></span><span style="display:flex;"><span>rag_case <span style="color:#f92672">=</span> LLMTestCase(
</span></span><span style="display:flex;"><span>    input<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;What causes transformer attention to scale quadratically?&#34;</span>,
</span></span><span style="display:flex;"><span>    actual_output<span style="color:#f92672">=</span>rag_pipeline<span style="color:#f92672">.</span>query(<span style="color:#e6db74">&#34;What causes transformer attention to scale quadratically?&#34;</span>),
</span></span><span style="display:flex;"><span>    retrieval_context<span style="color:#f92672">=</span>rag_pipeline<span style="color:#f92672">.</span>retrieve(<span style="color:#e6db74">&#34;What causes transformer attention to scale quadratically?&#34;</span>),
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>metrics <span style="color:#f92672">=</span> [
</span></span><span style="display:flex;"><span>    AnswerRelevancyMetric(threshold<span style="color:#f92672">=</span><span style="color:#ae81ff">0.7</span>),
</span></span><span style="display:flex;"><span>    FaithfulnessMetric(threshold<span style="color:#f92672">=</span><span style="color:#ae81ff">0.8</span>),
</span></span><span style="display:flex;"><span>    ContextualPrecisionMetric(threshold<span style="color:#f92672">=</span><span style="color:#ae81ff">0.6</span>),
</span></span><span style="display:flex;"><span>    HallucinationMetric(threshold<span style="color:#f92672">=</span><span style="color:#ae81ff">0.1</span>),  <span style="color:#75715e"># lower is better; fail above 10% hallucination rate</span>
</span></span><span style="display:flex;"><span>]
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">test_rag_quality</span>():
</span></span><span style="display:flex;"><span>    assert_test(rag_case, metrics)
</span></span></code></pre></div><h3 id="hallucination-detection-in-practice">Hallucination Detection in Practice</h3>
<p><code>HallucinationMetric</code> scores from 0.0 (no hallucination) to 1.0 (complete hallucination). The <code>threshold</code> parameter is an upper bound — set it to <code>0.1</code> to fail any case where more than 10% of the output contains ungrounded claims. Pair with <code>FaithfulnessMetric</code> (a lower bound measuring how well the output adheres to context) for full coverage of grounding failures.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>hallucination_metric <span style="color:#f92672">=</span> HallucinationMetric(
</span></span><span style="display:flex;"><span>    threshold<span style="color:#f92672">=</span><span style="color:#ae81ff">0.1</span>,
</span></span><span style="display:flex;"><span>    model<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;gpt-4o&#34;</span>,
</span></span><span style="display:flex;"><span>    include_reason<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>,  <span style="color:#75715e"># surfaces which specific claims triggered the failure</span>
</span></span><span style="display:flex;"><span>)
</span></span></code></pre></div><h2 id="g-eval-flexible-llm-as-a-judge-scoring-with-custom-criteria">G-Eval: Flexible LLM-as-a-Judge Scoring with Custom Criteria</h2>
<p>G-Eval is DeepEval&rsquo;s most versatile metric: define evaluation criteria in plain English, and DeepEval uses chain-of-thought (CoT) decomposition to automatically build a scoring rubric without hand-crafted examples or labeled datasets. Unlike rigid rule-based metrics, G-Eval handles subjective quality dimensions — tone, technical depth, safety posture, brand voice adherence — that don&rsquo;t map cleanly to predefined metrics. The approach follows the G-Eval research paper: the LLM judge first generates evaluation steps from your criteria description, then scores each case against those steps on a 0–1 continuous scale. In practice, this means a product team can add a &ldquo;sounds like a senior engineer, not a chatbot&rdquo; criterion in an afternoon without writing any scoring code. G-Eval&rsquo;s CoT reasoning also surfaces actionable failure explanations rather than just a score, making it practical for debugging regression batches during model upgrades.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> deepeval.metrics <span style="color:#f92672">import</span> GEval
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> deepeval.test_case <span style="color:#f92672">import</span> LLMTestCaseParams
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Custom criterion: evaluate technical depth for a developer tool</span>
</span></span><span style="display:flex;"><span>technical_depth_metric <span style="color:#f92672">=</span> GEval(
</span></span><span style="display:flex;"><span>    name<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;TechnicalDepth&#34;</span>,
</span></span><span style="display:flex;"><span>    criteria<span style="color:#f92672">=</span>(
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#34;Evaluate whether the response demonstrates genuine technical depth. &#34;</span>
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#34;A high-scoring response: uses precise terminology, provides concrete examples, &#34;</span>
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#34;acknowledges tradeoffs, and avoids vague generalities. &#34;</span>
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#34;A low-scoring response: uses buzzwords without explanation, makes unsupported claims, &#34;</span>
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#34;or oversimplifies complex topics.&#34;</span>
</span></span><span style="display:flex;"><span>    ),
</span></span><span style="display:flex;"><span>    evaluation_params<span style="color:#f92672">=</span>[
</span></span><span style="display:flex;"><span>        LLMTestCaseParams<span style="color:#f92672">.</span>INPUT,
</span></span><span style="display:flex;"><span>        LLMTestCaseParams<span style="color:#f92672">.</span>ACTUAL_OUTPUT,
</span></span><span style="display:flex;"><span>    ],
</span></span><span style="display:flex;"><span>    threshold<span style="color:#f92672">=</span><span style="color:#ae81ff">0.6</span>,
</span></span><span style="display:flex;"><span>    model<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;gpt-4o&#34;</span>,
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">case</span> <span style="color:#f92672">=</span> LLMTestCase(
</span></span><span style="display:flex;"><span>    input<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;Explain why vector databases use HNSW indexing&#34;</span>,
</span></span><span style="display:flex;"><span>    actual_output<span style="color:#f92672">=</span>my_llm(<span style="color:#e6db74">&#34;Explain why vector databases use HNSW indexing&#34;</span>),
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">test_technical_depth</span>():
</span></span><span style="display:flex;"><span>    assert_test(<span style="color:#66d9ef">case</span>, [technical_depth_metric])
</span></span></code></pre></div><h3 id="when-to-use-g-eval-vs-built-in-metrics">When to Use G-Eval vs. Built-in Metrics</h3>
<p>Use built-in metrics (<code>FaithfulnessMetric</code>, <code>AnswerRelevancyMetric</code>) when the criterion maps directly to a researched definition — they&rsquo;re faster and more reproducible. Use G-Eval when you need to capture something domain-specific or stylistic that doesn&rsquo;t have a standard definition. G-Eval adds 1–3 seconds per evaluation due to CoT generation; run it on sampled batches in production rather than every request.</p>
<h2 id="agent-evaluation--stepefficiency-tool-correctness-and-task-completion">Agent Evaluation — StepEfficiency, Tool Correctness, and Task Completion</h2>
<p>Agent evaluation is the fastest-growing segment of LLM testing in 2026, and DeepEval&rsquo;s agent-specific metrics are purpose-built for multi-step agentic systems. <code>StepEfficiencyMetric</code> is particularly important for production cost control: it penalizes redundant tool calls and unnecessary reasoning loops that inflate token usage without improving output quality. A 10-step agent completing a task solvable in 3 steps is a latency and cost problem even if the final answer is correct. <code>ToolCorrectnessMetric</code> evaluates whether the agent called the right tools with correct arguments, and <code>TaskCompletionMetric</code> measures whether the agent&rsquo;s final output actually satisfied the original goal. Together, these three metrics give you correctness, efficiency, and goal-orientation signals — the minimum viable coverage for any production agent system.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> deepeval.test_case <span style="color:#f92672">import</span> LLMTestCase, ToolCall
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> deepeval.metrics <span style="color:#f92672">import</span> (
</span></span><span style="display:flex;"><span>    TaskCompletionMetric,
</span></span><span style="display:flex;"><span>    ToolCorrectnessMetric,
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>agent_case <span style="color:#f92672">=</span> LLMTestCase(
</span></span><span style="display:flex;"><span>    input<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;Find all open GitHub issues labeled &#39;bug&#39; and summarize them&#34;</span>,
</span></span><span style="display:flex;"><span>    actual_output<span style="color:#f92672">=</span>agent<span style="color:#f92672">.</span>run(<span style="color:#e6db74">&#34;Find all open GitHub issues labeled &#39;bug&#39; and summarize them&#34;</span>),
</span></span><span style="display:flex;"><span>    tools_called<span style="color:#f92672">=</span>[
</span></span><span style="display:flex;"><span>        ToolCall(name<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;github_search&#34;</span>, input<span style="color:#f92672">=</span>{<span style="color:#e6db74">&#34;query&#34;</span>: <span style="color:#e6db74">&#34;is:issue is:open label:bug&#34;</span>}),
</span></span><span style="display:flex;"><span>        ToolCall(name<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;summarize&#34;</span>, input<span style="color:#f92672">=</span>{<span style="color:#e6db74">&#34;text&#34;</span>: <span style="color:#e6db74">&#34;...issues content...&#34;</span>}),
</span></span><span style="display:flex;"><span>    ],
</span></span><span style="display:flex;"><span>    expected_tools<span style="color:#f92672">=</span>[
</span></span><span style="display:flex;"><span>        ToolCall(name<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;github_search&#34;</span>, input<span style="color:#f92672">=</span>{<span style="color:#e6db74">&#34;query&#34;</span>: <span style="color:#e6db74">&#34;is:issue is:open label:bug&#34;</span>}),
</span></span><span style="display:flex;"><span>        ToolCall(name<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;summarize&#34;</span>),
</span></span><span style="display:flex;"><span>    ],
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>agent_metrics <span style="color:#f92672">=</span> [
</span></span><span style="display:flex;"><span>    TaskCompletionMetric(threshold<span style="color:#f92672">=</span><span style="color:#ae81ff">0.8</span>),
</span></span><span style="display:flex;"><span>    ToolCorrectnessMetric(threshold<span style="color:#f92672">=</span><span style="color:#ae81ff">0.9</span>),
</span></span><span style="display:flex;"><span>]
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">test_agent_task</span>():
</span></span><span style="display:flex;"><span>    assert_test(agent_case, agent_metrics)
</span></span></code></pre></div><h3 id="measuring-step-efficiency-in-long-chains">Measuring Step Efficiency in Long Chains</h3>
<p>For agents with reasoning traces, capture the full tool call sequence in <code>tools_called</code> and set <code>StepEfficiencyMetric</code> with a <code>threshold</code> matching your acceptable overhead ratio. A threshold of <code>0.7</code> means the agent must complete the task in no more than ~43% more steps than the theoretical minimum:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> deepeval.metrics <span style="color:#f92672">import</span> StepEfficiencyMetric
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>efficiency <span style="color:#f92672">=</span> StepEfficiencyMetric(threshold<span style="color:#f92672">=</span><span style="color:#ae81ff">0.7</span>)
</span></span></code></pre></div><h2 id="how-to-integrate-deepeval-into-cicd-pipelines-with-github-actions">How to Integrate DeepEval into CI/CD Pipelines with GitHub Actions</h2>
<p>Integrating DeepEval into GitHub Actions turns your evaluation suite into a quality gate — pull requests that regress LLM output quality below your thresholds fail the CI check and cannot merge without explicit override. This is the same pattern software teams use for code coverage thresholds, applied to model quality metrics. The key implementation detail: DeepEval exits with a non-zero status code when any metric threshold is violated, which GitHub Actions interprets as a build failure. For API cost control, scope CI evaluations to a representative sample (20–50 cases) rather than the full production dataset; run full-dataset evaluations nightly or on release tags. In 2026, teams shipping features on top of third-party LLM APIs — where the underlying model can be silently updated by the provider — treat these CI gates as insurance against unannounced model behavior changes degrading user experience.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-yaml" data-lang="yaml"><span style="display:flex;"><span><span style="color:#75715e"># .github/workflows/llm-eval.yml</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">name</span>: <span style="color:#ae81ff">LLM Evaluation</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">on</span>:
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">pull_request</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">branches</span>: [<span style="color:#ae81ff">main]</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">push</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">branches</span>: [<span style="color:#ae81ff">main]</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">jobs</span>:
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">evaluate</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">runs-on</span>: <span style="color:#ae81ff">ubuntu-latest</span>
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">steps</span>:
</span></span><span style="display:flex;"><span>      - <span style="color:#f92672">uses</span>: <span style="color:#ae81ff">actions/checkout@v4</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>      - <span style="color:#f92672">name</span>: <span style="color:#ae81ff">Set up Python</span>
</span></span><span style="display:flex;"><span>        <span style="color:#f92672">uses</span>: <span style="color:#ae81ff">actions/setup-python@v5</span>
</span></span><span style="display:flex;"><span>        <span style="color:#f92672">with</span>:
</span></span><span style="display:flex;"><span>          <span style="color:#f92672">python-version</span>: <span style="color:#e6db74">&#34;3.12&#34;</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>      - <span style="color:#f92672">name</span>: <span style="color:#ae81ff">Install dependencies</span>
</span></span><span style="display:flex;"><span>        <span style="color:#f92672">run</span>: |<span style="color:#e6db74">
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">          pip install deepeval
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">          pip install -r requirements.txt</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>      - <span style="color:#f92672">name</span>: <span style="color:#ae81ff">Run DeepEval suite</span>
</span></span><span style="display:flex;"><span>        <span style="color:#f92672">env</span>:
</span></span><span style="display:flex;"><span>          <span style="color:#f92672">OPENAI_API_KEY</span>: <span style="color:#ae81ff">${{ secrets.OPENAI_API_KEY }}</span>
</span></span><span style="display:flex;"><span>        <span style="color:#f92672">run</span>: |<span style="color:#e6db74">
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">          deepeval test run tests/eval/ -v --exit-on-first-failure</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>      - <span style="color:#f92672">name</span>: <span style="color:#ae81ff">Upload evaluation report</span>
</span></span><span style="display:flex;"><span>        <span style="color:#f92672">if</span>: <span style="color:#ae81ff">always()</span>
</span></span><span style="display:flex;"><span>        <span style="color:#f92672">uses</span>: <span style="color:#ae81ff">actions/upload-artifact@v4</span>
</span></span><span style="display:flex;"><span>        <span style="color:#f92672">with</span>:
</span></span><span style="display:flex;"><span>          <span style="color:#f92672">name</span>: <span style="color:#ae81ff">deepeval-report</span>
</span></span><span style="display:flex;"><span>          <span style="color:#f92672">path</span>: <span style="color:#ae81ff">.deepeval/</span>
</span></span></code></pre></div><h3 id="caching-evaluation-results-to-reduce-api-costs">Caching Evaluation Results to Reduce API Costs</h3>
<p>DeepEval supports result caching via <code>.deepeval/cache/</code>. Enable it to skip re-evaluating unchanged test cases between CI runs:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> deepeval <span style="color:#f92672">import</span> evaluate
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>results <span style="color:#f92672">=</span> evaluate(
</span></span><span style="display:flex;"><span>    test_cases<span style="color:#f92672">=</span>dataset,
</span></span><span style="display:flex;"><span>    metrics<span style="color:#f92672">=</span>metrics,
</span></span><span style="display:flex;"><span>    use_cache<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>,      <span style="color:#75715e"># skip cases with identical inputs/outputs from prior runs</span>
</span></span><span style="display:flex;"><span>    run_async<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>,      <span style="color:#75715e"># parallel evaluation — 3-5x faster on large suites</span>
</span></span><span style="display:flex;"><span>)
</span></span></code></pre></div><h2 id="production-evaluation-patterns--async-referenceless-and-threshold-based">Production Evaluation Patterns — Async, Referenceless, and Threshold-Based</h2>
<p>Production LLM evaluation differs from offline testing in three ways: it must be non-blocking, it usually has no ground-truth labels, and it needs threshold-based alerting rather than binary pass/fail. DeepEval&rsquo;s async evaluation API runs metric scoring in a background task without blocking the response path — users get their answer immediately while evaluation happens asynchronously. Referenceless metrics like <code>AnswerRelevancyMetric</code> and <code>FaithfulnessMetric</code> score quality against the input and retrieved context alone, requiring no hand-labeled expected outputs. This means you can evaluate 100% of production traffic on day one without building a labeling pipeline first. Threshold-based alerting integrates with the Confident AI dashboard or your own observability stack: when the rolling average score for a metric drops below a threshold over a time window, trigger an alert before users notice the degradation. This is the closest LLM operations comes to classical SLO monitoring.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> asyncio
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> deepeval.metrics <span style="color:#f92672">import</span> AnswerRelevancyMetric, FaithfulnessMetric
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">async</span> <span style="color:#66d9ef">def</span> <span style="color:#a6e22e">evaluate_production_response</span>(user_input: str, llm_output: str, context: list[str]):
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">case</span> <span style="color:#f92672">=</span> LLMTestCase(
</span></span><span style="display:flex;"><span>        input<span style="color:#f92672">=</span>user_input,
</span></span><span style="display:flex;"><span>        actual_output<span style="color:#f92672">=</span>llm_output,
</span></span><span style="display:flex;"><span>        retrieval_context<span style="color:#f92672">=</span>context,
</span></span><span style="display:flex;"><span>    )
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    metrics <span style="color:#f92672">=</span> [
</span></span><span style="display:flex;"><span>        AnswerRelevancyMetric(threshold<span style="color:#f92672">=</span><span style="color:#ae81ff">0.7</span>, async_mode<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>),
</span></span><span style="display:flex;"><span>        FaithfulnessMetric(threshold<span style="color:#f92672">=</span><span style="color:#ae81ff">0.8</span>, async_mode<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>),
</span></span><span style="display:flex;"><span>    ]
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Non-blocking: scores are computed in background</span>
</span></span><span style="display:flex;"><span>    tasks <span style="color:#f92672">=</span> [m<span style="color:#f92672">.</span>a_measure(<span style="color:#66d9ef">case</span>) <span style="color:#66d9ef">for</span> m <span style="color:#f92672">in</span> metrics]
</span></span><span style="display:flex;"><span>    scores <span style="color:#f92672">=</span> <span style="color:#66d9ef">await</span> asyncio<span style="color:#f92672">.</span>gather(<span style="color:#f92672">*</span>tasks)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Log to your observability stack</span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">for</span> metric, score <span style="color:#f92672">in</span> zip(metrics, scores):
</span></span><span style="display:flex;"><span>        log_metric(name<span style="color:#f92672">=</span>metric<span style="color:#f92672">.</span>__name__, score<span style="color:#f92672">=</span>score, input<span style="color:#f92672">=</span>user_input)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span> scores
</span></span></code></pre></div><h3 id="sampling-strategy-for-cost-effective-production-monitoring">Sampling Strategy for Cost-Effective Production Monitoring</h3>
<p>Evaluating every production request with LLM-as-a-Judge metrics is expensive. Use a stratified sampling approach: evaluate 100% of low-confidence outputs (detect these via your model&rsquo;s logprobs or a cheap classifier), 10% of standard traffic uniformly, and 100% of flagged conversations (user thumbs-down, error states, long sessions). This targets evaluation budget at the cases most likely to reveal real problems.</p>
<h2 id="deepeval-vs-ragas-vs-trulens-vs-braintrust--when-to-use-each">DeepEval vs RAGAS vs TruLens vs Braintrust — When to Use Each</h2>
<p>DeepEval is the right choice when your team thinks in code, wants CI/CD integration as a first-class feature, and needs broad metric coverage across RAG, agents, safety, and custom criteria. It is not the only strong option in 2026, and picking the wrong tool creates friction that undermines adoption. RAGAS is purpose-built for RAG pipeline evaluation with deeper retrieval-chain diagnostics than DeepEval, but it lacks agent metrics and has no native CI/CD integration. TruLens focuses on observability integration — it pairs well with LangChain and LlamaIndex and provides tracing alongside eval, but its metric library is narrower. Braintrust (reviewed separately) prioritizes product and PM dashboards with A/B experiment workflows and a polished web UI, making it the right choice when non-engineers need to participate in evaluation — but it&rsquo;s a managed SaaS platform, not an open-source library. The decision usually comes down to who runs evaluations and where they live: engineers in CI choose DeepEval, data scientists in notebooks choose RAGAS, product teams in dashboards choose Braintrust.</p>
<table>
  <thead>
      <tr>
          <th>Framework</th>
          <th>Best For</th>
          <th style="text-align: center">CI/CD Native</th>
          <th style="text-align: center">Open Source</th>
          <th style="text-align: center">Agent Metrics</th>
          <th style="text-align: center">RAG Depth</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>DeepEval</strong></td>
          <td>Code-first engineering teams</td>
          <td style="text-align: center">✅</td>
          <td style="text-align: center">✅</td>
          <td style="text-align: center">✅</td>
          <td style="text-align: center">Good</td>
      </tr>
      <tr>
          <td><strong>RAGAS</strong></td>
          <td>RAG pipeline specialists</td>
          <td style="text-align: center">❌</td>
          <td style="text-align: center">✅</td>
          <td style="text-align: center">❌</td>
          <td style="text-align: center">Excellent</td>
      </tr>
      <tr>
          <td><strong>TruLens</strong></td>
          <td>LangChain/LlamaIndex observability</td>
          <td style="text-align: center">Partial</td>
          <td style="text-align: center">✅</td>
          <td style="text-align: center">Partial</td>
          <td style="text-align: center">Good</td>
      </tr>
      <tr>
          <td><strong>Braintrust</strong></td>
          <td>Product/PM A-B testing dashboards</td>
          <td style="text-align: center">Partial</td>
          <td style="text-align: center">❌</td>
          <td style="text-align: center">Limited</td>
          <td style="text-align: center">Good</td>
      </tr>
  </tbody>
</table>
<h3 id="migration-path-from-ragas-to-deepeval">Migration Path from RAGAS to DeepEval</h3>
<p>If your team currently uses RAGAS, DeepEval&rsquo;s RAG metrics cover the same ground and the migration is mostly a data-structure swap. Replace <code>ragas.metrics</code> imports with <code>deepeval.metrics</code>, convert your evaluation rows to <code>LLMTestCase</code> objects, and wrap each case in <code>assert_test()</code>. The pytest harness handles the rest. Teams that migrate typically report better CI integration and broader metric coverage, at the cost of RAGAS&rsquo;s more granular retrieval-chain attribution.</p>
<hr>
<h2 id="frequently-asked-questions">Frequently Asked Questions</h2>
<p><strong>Q: Does DeepEval require internet access or a specific LLM API to run?</strong></p>
<p>DeepEval requires an LLM judge to score most metrics — by default it uses OpenAI GPT-4o. However, you can configure any OpenAI-compatible endpoint, including local Ollama models, using the <code>DeepEvalBaseLLM</code> base class. For completely offline evaluation, point the judge at a local model serving an OpenAI-compatible API (e.g., <code>ollama serve</code>). The <code>--model</code> parameter accepts a custom model class at the CLI level.</p>
<p><strong>Q: How do I handle flaky evaluation results caused by LLM judge nondeterminism?</strong></p>
<p>Set <code>temperature=0</code> on your judge model for maximum reproducibility. For metrics where score variance still matters, run each case through the metric 3 times and take the median — DeepEval supports <code>n_retries</code> on most metrics for this purpose. In CI, treat scores below threshold by more than 0.05 as definite failures; scores within 0.05 of the threshold as warnings requiring human review rather than automatic build failures.</p>
<p><strong>Q: Can DeepEval evaluate streaming LLM responses?</strong></p>
<p>DeepEval evaluates complete outputs, not streams. Accumulate the full streamed response into a string before constructing the <code>LLMTestCase</code>. For production monitoring with streaming responses, buffer the output in your application layer, evaluate asynchronously after streaming completes, and log the score to your observability stack. The latency impact on the user experience is zero.</p>
<p><strong>Q: What&rsquo;s the cheapest way to run DeepEval on a large evaluation dataset?</strong></p>
<p>Enable caching (<code>use_cache=True</code>) to skip re-evaluating unchanged cases. Use async evaluation (<code>run_async=True</code>) to parallelize API calls. Choose <code>gpt-4o-mini</code> or <code>claude-haiku-4-5</code> as the judge model for lower-stakes metrics — they score within 5-8% of GPT-4o on most standard metrics while costing 10-20x less. Reserve the strongest judge model for G-Eval and custom criteria that require deeper reasoning.</p>
<p><strong>Q: How does DeepEval handle multi-turn conversation evaluation?</strong></p>
<p>DeepEval supports conversational evaluation through <code>ConversationalTestCase</code>, which wraps a list of <code>LLMTestCase</code> objects representing each turn. Conversational metrics like <code>ConversationRelevancyMetric</code> and <code>KnowledgeRetentionMetric</code> score the full dialogue arc rather than individual turns. This is essential for evaluating chatbots, support agents, and multi-step assistants where quality degrades across turns rather than within a single response.</p>
]]></content:encoded></item></channel></rss>