<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Opentelemetry on RockB</title><link>https://baeseokjae.github.io/tags/opentelemetry/</link><description>Recent content in Opentelemetry on RockB</description><image><title>RockB</title><url>https://baeseokjae.github.io/images/og-default.png</url><link>https://baeseokjae.github.io/images/og-default.png</link></image><generator>Hugo</generator><language>en-us</language><lastBuildDate>Tue, 19 May 2026 09:04:46 +0000</lastBuildDate><atom:link href="https://baeseokjae.github.io/tags/opentelemetry/index.xml" rel="self" type="application/rss+xml"/><item><title>AI Agent Observability with OpenTelemetry: From Dev to Production in 2026</title><link>https://baeseokjae.github.io/posts/ai-agent-observability-opentelemetry-2026/</link><pubDate>Tue, 19 May 2026 09:04:46 +0000</pubDate><guid>https://baeseokjae.github.io/posts/ai-agent-observability-opentelemetry-2026/</guid><description>Complete guide to instrumenting AI agents with OpenTelemetry GenAI semantic conventions — from local Jaeger to production Grafana Cloud in 2026.</description><content:encoded><![CDATA[<p>OpenTelemetry is the standard way to add structured tracing, metrics, and logs to AI agents in 2026 — covering token usage, tool call latency, and multi-agent context propagation with a single SDK and vendor-neutral backends.</p>
<h2 id="why-traditional-observability-fails-for-ai-agents">Why Traditional Observability Fails for AI Agents</h2>
<p>Traditional APM tools like Datadog APM or New Relic were designed for deterministic request/response cycles: a user hits an endpoint, a function runs, a database query fires, a response returns. The execution path is fixed, latency is bounded, and errors are binary. AI agents break every one of these assumptions. An agent reasoning chain is non-deterministic — the same input prompt can trigger three tool calls in one run and seven in the next. Execution duration ranges from 500ms for a fast LLM call to 3+ minutes for a multi-step agent that searches the web, queries a database, and synthesizes results. Without agent-native spans, you cannot tell which tool call caused a timeout or why a particular run cost $0.40 while a similar one cost $0.03. Traditional APM measures function latency in microseconds and ignores tokens entirely. The LLM observability platform market recognized this gap — growing to an estimated $2.69 billion in 2026 and projected to reach $9.26 billion by 2030 at a 36.2% CAGR. OpenTelemetry&rsquo;s GenAI Semantic Conventions fill that gap with a purpose-built span model for LLM operations, agent reasoning loops, and tool executions that traditional APM never anticipated.</p>
<h3 id="what-makes-ai-agent-telemetry-different">What Makes AI Agent Telemetry Different?</h3>
<p>AI agents require three observability primitives that traditional APM lacks. First, <strong>token-based cost attribution</strong> — you need to know how many input and output tokens each LLM call consumed, mapped to a session, user, or feature. Second, <strong>reasoning chain tracing</strong> — a parent span for the agent loop with child spans for each tool call, LLM request, and decision step, linked by trace context so you can reconstruct the full execution tree. Third, <strong>non-deterministic failure modes</strong> — an agent might hallucinate a tool name, exceed its context window mid-run, or loop indefinitely; catching these requires span attributes that conventional HTTP APM never defines. GenAI conventions add <code>gen_ai.operation.name</code>, <code>gen_ai.system</code>, <code>gen_ai.request.model</code>, and <code>gen_ai.usage.input_tokens</code> to fill exactly these gaps.</p>
<h3 id="the-token-economy-problem">The Token Economy Problem</h3>
<p>A single user session might trigger dozens of LLM calls across multiple agents. Without per-call token tracking, your billing dashboard shows a lump sum while your engineers have no idea which feature, agent, or user is driving costs. OpenTelemetry&rsquo;s <code>gen_ai.client.token.usage</code> metric and corresponding span attributes let you aggregate token spend by <code>gen_ai.agent.name</code>, session ID, or custom attribute — giving you cost observability with the same instrumentation that drives latency dashboards.</p>
<h2 id="opentelemetry-genai-semantic-conventions-the-2026-standard">OpenTelemetry GenAI Semantic Conventions: The 2026 Standard</h2>
<p>OpenTelemetry GenAI Semantic Conventions are the standardized attribute names, span structure, and metric definitions that give AI telemetry a common language across every vendor and framework. In early 2026, GenAI client spans and the <code>gen_ai.client.token.usage</code> / <code>gen_ai.client.operation.duration</code> metrics exited experimental status and became stable — meaning you can rely on them in production without fear of breaking changes. Agent-specific spans (<code>gen_ai.agent.name</code>, <code>gen_ai.tool.name</code>) and framework-level instrumentation remain experimental but are production-stable at most major observability vendors. The conventions define how to capture prompt and completion content safely (in span events, not span attributes, to enable opt-in content capture without leaking PII into your metrics store). Gartner predicts that by 2028, LLM observability investments will account for 50% of GenAI deployments, up from 15% in early 2026 — and OpenTelemetry&rsquo;s vendor-neutral standard is what makes that investment transferable across backends.</p>
<h3 id="core-genai-span-attributes">Core GenAI Span Attributes</h3>
<p>The stable attributes every AI agent span should carry:</p>
<table>
  <thead>
      <tr>
          <th>Attribute</th>
          <th>Type</th>
          <th>Example</th>
          <th>Purpose</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>gen_ai.system</code></td>
          <td>string</td>
          <td><code>openai</code>, <code>anthropic</code></td>
          <td>Identifies the LLM provider</td>
      </tr>
      <tr>
          <td><code>gen_ai.operation.name</code></td>
          <td>string</td>
          <td><code>chat</code>, <code>execute_tool</code></td>
          <td>Type of GenAI operation</td>
      </tr>
      <tr>
          <td><code>gen_ai.request.model</code></td>
          <td>string</td>
          <td><code>gpt-5</code>, <code>claude-opus-4</code></td>
          <td>Requested model name</td>
      </tr>
      <tr>
          <td><code>gen_ai.response.model</code></td>
          <td>string</td>
          <td><code>gpt-5-2026-05</code></td>
          <td>Actual model version used</td>
      </tr>
      <tr>
          <td><code>gen_ai.usage.input_tokens</code></td>
          <td>int</td>
          <td><code>1248</code></td>
          <td>Prompt tokens consumed</td>
      </tr>
      <tr>
          <td><code>gen_ai.usage.output_tokens</code></td>
          <td>int</td>
          <td><code>342</code></td>
          <td>Completion tokens generated</td>
      </tr>
      <tr>
          <td><code>gen_ai.agent.name</code></td>
          <td>string</td>
          <td><code>research_agent</code></td>
          <td>Identifies the agent (experimental)</td>
      </tr>
      <tr>
          <td><code>gen_ai.tool.name</code></td>
          <td>string</td>
          <td><code>web_search</code></td>
          <td>Tool called by the agent (experimental)</td>
      </tr>
  </tbody>
</table>
<h3 id="span-events-vs-span-attributes-for-content">Span Events vs Span Attributes for Content</h3>
<p>The conventions deliberately separate prompt and completion content from the main span attribute set. Content goes into <strong>span events</strong> — specifically <code>gen_ai.content.prompt</code> and <code>gen_ai.content.completion</code> events — rather than span attributes. This design means that a) content capture is opt-in (disabled by default), b) you can strip content at the collector level without losing metrics, and c) you avoid accidentally indexing PII into your tracing backend. For GDPR compliance, this is critical: you can run full token usage and latency observability without ever storing a single user message.</p>
<h2 id="setting-up-opentelemetry-for-ai-agents-in-python-step-by-step">Setting Up OpenTelemetry for AI Agents in Python (Step-by-Step)</h2>
<p>Getting OpenTelemetry running for an AI agent takes about 20 minutes from zero to local Jaeger traces. The setup uses <code>opentelemetry-sdk</code>, a GenAI instrumentation library (<code>openlit</code>, <code>openinference</code>, or <code>opentelemetry-instrumentation-openai</code> depending on your framework), and a local Jaeger instance for development. In production, you swap the exporter endpoint to Grafana Cloud, Honeycomb, or any OTLP-compatible backend — the instrumentation code stays identical. 85% of organizations with GenAI deployments planned for LLM observability as key infrastructure in 2026, and OpenTelemetry&rsquo;s backend-agnostic design is why they can avoid vendor lock-in at the SDK layer. The key insight is that auto-instrumentation handles the heavy lifting for LLM API calls, while manual spans wrap the agent loop itself. This two-layer approach — auto-instrumented LLM calls nested inside manually-traced agent runs — gives you complete visibility into both LLM-level metrics (tokens, latency per call) and agent-level behavior (iterations, tool success rates, end-to-end duration) without duplicating code across every model integration your agent might use. The five steps below take you from a fresh Python environment to a trace visible in Jaeger, then show the one-line change needed to point that same setup at a production backend.</p>
<h3 id="step-1-install-dependencies">Step 1: Install Dependencies</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>pip install opentelemetry-sdk <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>            opentelemetry-exporter-otlp <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>            openlit <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>            openai  <span style="color:#75715e"># or anthropic, langchain, etc.</span>
</span></span></code></pre></div><p><code>openlit</code> is the simplest auto-instrumentation library for 2026 — one <code>openlit.init()</code> call instruments OpenAI, Anthropic, LangChain, and LlamaIndex clients automatically.</p>
<h3 id="step-2-configure-the-tracer-provider">Step 2: Configure the Tracer Provider</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> opentelemetry <span style="color:#f92672">import</span> trace
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> opentelemetry.sdk.trace <span style="color:#f92672">import</span> TracerProvider
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> opentelemetry.sdk.trace.export <span style="color:#f92672">import</span> BatchSpanProcessor
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> opentelemetry.exporter.otlp.proto.grpc.trace_exporter <span style="color:#f92672">import</span> OTLPSpanExporter
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> openlit
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Point to local Jaeger in dev, Grafana Cloud / Honeycomb in prod</span>
</span></span><span style="display:flex;"><span>otlp_endpoint <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;http://localhost:4317&#34;</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>provider <span style="color:#f92672">=</span> TracerProvider()
</span></span><span style="display:flex;"><span>provider<span style="color:#f92672">.</span>add_span_processor(
</span></span><span style="display:flex;"><span>    BatchSpanProcessor(OTLPSpanExporter(endpoint<span style="color:#f92672">=</span>otlp_endpoint))
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>trace<span style="color:#f92672">.</span>set_tracer_provider(provider)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Auto-instrument all supported LLM clients</span>
</span></span><span style="display:flex;"><span>openlit<span style="color:#f92672">.</span>init(
</span></span><span style="display:flex;"><span>    otlp_endpoint<span style="color:#f92672">=</span>otlp_endpoint,
</span></span><span style="display:flex;"><span>    capture_message_content<span style="color:#f92672">=</span><span style="color:#66d9ef">False</span>,  <span style="color:#75715e"># Opt-in; set True only in dev</span>
</span></span><span style="display:flex;"><span>)
</span></span></code></pre></div><h3 id="step-3-instrument-your-agent-loop">Step 3: Instrument Your Agent Loop</h3>
<p>Auto-instrumentation covers LLM API calls. For the agent loop itself, add manual spans:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>tracer <span style="color:#f92672">=</span> trace<span style="color:#f92672">.</span>get_tracer(<span style="color:#e6db74">&#34;my_agent&#34;</span>, <span style="color:#e6db74">&#34;1.0.0&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">run_agent</span>(task: str, session_id: str) <span style="color:#f92672">-&gt;</span> str:
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">with</span> tracer<span style="color:#f92672">.</span>start_as_current_span(
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#34;agent.run&#34;</span>,
</span></span><span style="display:flex;"><span>        attributes<span style="color:#f92672">=</span>{
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;gen_ai.agent.name&#34;</span>: <span style="color:#e6db74">&#34;research_agent&#34;</span>,
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;session.id&#34;</span>: session_id,
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;agent.task&#34;</span>: task[:<span style="color:#ae81ff">100</span>],  <span style="color:#75715e"># Truncate for index efficiency</span>
</span></span><span style="display:flex;"><span>        }
</span></span><span style="display:flex;"><span>    ) <span style="color:#66d9ef">as</span> span:
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">for</span> iteration <span style="color:#f92672">in</span> range(MAX_ITERATIONS):
</span></span><span style="display:flex;"><span>            span<span style="color:#f92672">.</span>set_attribute(<span style="color:#e6db74">&#34;agent.iterations&#34;</span>, iteration <span style="color:#f92672">+</span> <span style="color:#ae81ff">1</span>)
</span></span><span style="display:flex;"><span>            
</span></span><span style="display:flex;"><span>            <span style="color:#75715e"># LLM call — auto-instrumented by openlit</span>
</span></span><span style="display:flex;"><span>            response <span style="color:#f92672">=</span> client<span style="color:#f92672">.</span>chat<span style="color:#f92672">.</span>completions<span style="color:#f92672">.</span>create(
</span></span><span style="display:flex;"><span>                model<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;gpt-5&#34;</span>,
</span></span><span style="display:flex;"><span>                messages<span style="color:#f92672">=</span>messages
</span></span><span style="display:flex;"><span>            )
</span></span><span style="display:flex;"><span>            
</span></span><span style="display:flex;"><span>            tool_calls <span style="color:#f92672">=</span> extract_tool_calls(response)
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">if</span> <span style="color:#f92672">not</span> tool_calls:
</span></span><span style="display:flex;"><span>                <span style="color:#66d9ef">break</span>
</span></span><span style="display:flex;"><span>                
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">for</span> tool_call <span style="color:#f92672">in</span> tool_calls:
</span></span><span style="display:flex;"><span>                <span style="color:#66d9ef">with</span> tracer<span style="color:#f92672">.</span>start_as_current_span(
</span></span><span style="display:flex;"><span>                    <span style="color:#e6db74">&#34;agent.tool_call&#34;</span>,
</span></span><span style="display:flex;"><span>                    attributes<span style="color:#f92672">=</span>{
</span></span><span style="display:flex;"><span>                        <span style="color:#e6db74">&#34;gen_ai.tool.name&#34;</span>: tool_call<span style="color:#f92672">.</span>name,
</span></span><span style="display:flex;"><span>                        <span style="color:#e6db74">&#34;gen_ai.tool.call.id&#34;</span>: tool_call<span style="color:#f92672">.</span>id,
</span></span><span style="display:flex;"><span>                    }
</span></span><span style="display:flex;"><span>                ) <span style="color:#66d9ef">as</span> tool_span:
</span></span><span style="display:flex;"><span>                    result <span style="color:#f92672">=</span> execute_tool(tool_call)
</span></span><span style="display:flex;"><span>                    tool_span<span style="color:#f92672">.</span>set_attribute(<span style="color:#e6db74">&#34;tool.success&#34;</span>, result<span style="color:#f92672">.</span>ok)
</span></span><span style="display:flex;"><span>        
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> extract_final_answer(response)
</span></span></code></pre></div><h3 id="step-4-run-local-jaeger-for-development">Step 4: Run Local Jaeger for Development</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>docker run -d --name jaeger <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  -p 4317:4317 <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  -p 16686:16686 <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  jaegertracing/all-in-one:latest
</span></span></code></pre></div><p>Open <code>http://localhost:16686</code> to see traces. Each agent run appears as a root span with nested LLM call spans and tool call spans — you can drill into any span to see token counts, model versions, and timing.</p>
<h3 id="step-5-switch-to-production-backend">Step 5: Switch to Production Backend</h3>
<p>Replace the OTLP endpoint with your production backend:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#75715e"># Grafana Cloud</span>
</span></span><span style="display:flex;"><span>otlp_endpoint <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;https://otlp-gateway-prod-us-east-0.grafana.net/otlp&#34;</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Add authentication header</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> opentelemetry.exporter.otlp.proto.grpc.trace_exporter <span style="color:#f92672">import</span> OTLPSpanExporter
</span></span><span style="display:flex;"><span>exporter <span style="color:#f92672">=</span> OTLPSpanExporter(
</span></span><span style="display:flex;"><span>    endpoint<span style="color:#f92672">=</span>otlp_endpoint,
</span></span><span style="display:flex;"><span>    headers<span style="color:#f92672">=</span>{<span style="color:#e6db74">&#34;Authorization&#34;</span>: <span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Bearer </span><span style="color:#e6db74">{</span>GRAFANA_API_KEY<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>}
</span></span><span style="display:flex;"><span>)
</span></span></code></pre></div><p>The instrumentation code does not change — only the exporter endpoint and auth header.</p>
<h2 id="the-6-essential-metrics-every-production-ai-agent-needs">The 6 Essential Metrics Every Production AI Agent Needs</h2>
<p>Production AI agent observability requires six distinct metrics that cover cost, reliability, performance, and capacity. These map directly to OpenTelemetry GenAI metric definitions and can be derived from spans if you do not emit them explicitly. Agent execution durations range from 500ms to 3+ minutes; without these metrics, identifying which tool call caused a timeout is nearly impossible. The six metrics form a complete diagnostic surface: token usage ties to cost, tool call success rate ties to reliability, LLM latency ties to user experience, loop iterations catch infinite loops, context window utilization prevents silent truncation, and end-to-end latency covers the full user-facing impact. Two of these — <code>gen_ai.client.token.usage</code> and <code>gen_ai.client.operation.duration</code> — are stable OTel metrics in 2026, meaning vendor-provided dashboards and alerting templates are available out of the box. The remaining four are derived from span attributes on your agent spans. Tracking all six from day one of production deployment means you have a complete baseline when something goes wrong, rather than scrambling to add instrumentation after an incident. Each metric below includes the exact OTel attribute or metric name and a concrete alert threshold that distinguishes healthy agent behavior from a problem worth waking someone up for.</p>
<h3 id="1-token-usage-per-run">1. Token Usage per Run</h3>
<p><strong>Metric:</strong> <code>gen_ai.client.token.usage</code> (histogram, stable in OTel 2026)</p>
<p>Emit this metric with <code>gen_ai.token.type</code> (<code>input</code> / <code>output</code>), <code>gen_ai.system</code>, <code>gen_ai.request.model</code>, and a custom <code>agent.name</code> attribute. This lets you build dashboards showing cost per agent, per session, and per feature. For a production agent handling 10,000 sessions/day, a 10% reduction in input tokens can cut monthly spend by thousands of dollars — but you cannot optimize what you do not measure.</p>
<h3 id="2-tool-call-success-rate">2. Tool Call Success Rate</h3>
<p>Track <code>tool.success</code> as a boolean span attribute on each tool call span. Aggregate to a success rate metric by <code>gen_ai.tool.name</code>. A web search tool with a 95% success rate looks fine until you check that the 5% failures all cluster around a specific query pattern — only per-tool tracing surfaces that.</p>
<h3 id="3-llm-latency-distribution-p50p95p99">3. LLM Latency Distribution (p50/p95/p99)</h3>
<p><strong>Metric:</strong> <code>gen_ai.client.operation.duration</code> (histogram, stable in OTel 2026)</p>
<p>Track latency distribution by model and operation type. p99 latency matters for user-facing agents — if your p99 is 12 seconds, some users experience 12-second waits even if your median is 800ms. Percentile tracking requires a histogram metric, not an average.</p>
<h3 id="4-agent-loop-iterations">4. Agent Loop Iterations</h3>
<p>Set <code>agent.iterations</code> on the root agent span at completion. A healthy agent typically resolves in 1-5 iterations. Runs exceeding 10 iterations usually indicate prompt issues or tool failures causing the agent to retry. An alert on <code>agent.iterations &gt; 8</code> catches runaway loops before they exhaust token budgets.</p>
<h3 id="5-context-window-utilization">5. Context Window Utilization</h3>
<p>Calculate <code>(input_tokens / model_context_window) * 100</code> per LLM call. When utilization exceeds 85%, you risk silent context truncation where the model loses early conversation history. Track this as a gauge metric by model — it informs when to implement context compression strategies.</p>
<h3 id="6-end-to-end-latency">6. End-to-End Latency</h3>
<p>The duration of the root <code>agent.run</code> span, not individual LLM calls. This is the user-facing latency that maps to actual experience. An agent might have fast LLM calls but slow tool executions; only end-to-end latency catches that. SLA alerts should be set on this metric.</p>
<h2 id="distributed-tracing-for-multi-agent-and-tool-calling-workflows">Distributed Tracing for Multi-Agent and Tool-Calling Workflows</h2>
<p>Distributed tracing across agent boundaries is the hardest part of multi-agent observability — and the part where getting it wrong makes all other telemetry useless. When a coordinator agent calls a subagent via an HTTP API or a message queue, the trace context must propagate so that the subagent&rsquo;s spans appear as children of the coordinator&rsquo;s span in the same trace. Without propagation, you get disconnected traces: one for the coordinator, one for the subagent, with no way to link them. OpenTelemetry&rsquo;s W3C Trace Context standard (<code>traceparent</code> and <code>tracestate</code> HTTP headers) handles this automatically for HTTP-based agent communication. For async message passing, you inject the trace context into message headers and extract it on the consumer side. In a real multi-agent system — for example, a coordinator that fans out to a research subagent, a writing subagent, and a fact-checking subagent — proper context propagation means a single trace ID covers the entire execution tree. You can see in one Jaeger view that the coordinator took 45 seconds total, the research subagent took 32 of those seconds (mostly waiting on a web search tool), and the writing subagent ran in 8 seconds. Without propagation, you would have three separate 3-node traces with no causal relationship visible between them. The code examples below show propagation for both HTTP and message queue communication patterns.</p>
<h3 id="http-based-agent-communication">HTTP-Based Agent Communication</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> opentelemetry.propagate <span style="color:#f92672">import</span> inject, extract
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> opentelemetry <span style="color:#f92672">import</span> trace
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> httpx
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>tracer <span style="color:#f92672">=</span> trace<span style="color:#f92672">.</span>get_tracer(<span style="color:#e6db74">&#34;coordinator_agent&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">call_subagent</span>(task: str, subagent_url: str) <span style="color:#f92672">-&gt;</span> dict:
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">with</span> tracer<span style="color:#f92672">.</span>start_as_current_span(<span style="color:#e6db74">&#34;coordinator.call_subagent&#34;</span>) <span style="color:#66d9ef">as</span> span:
</span></span><span style="display:flex;"><span>        headers <span style="color:#f92672">=</span> {}
</span></span><span style="display:flex;"><span>        inject(headers)  <span style="color:#75715e"># Injects traceparent and tracestate headers</span>
</span></span><span style="display:flex;"><span>        
</span></span><span style="display:flex;"><span>        span<span style="color:#f92672">.</span>set_attribute(<span style="color:#e6db74">&#34;subagent.url&#34;</span>, subagent_url)
</span></span><span style="display:flex;"><span>        span<span style="color:#f92672">.</span>set_attribute(<span style="color:#e6db74">&#34;gen_ai.agent.name&#34;</span>, <span style="color:#e6db74">&#34;coordinator&#34;</span>)
</span></span><span style="display:flex;"><span>        
</span></span><span style="display:flex;"><span>        response <span style="color:#f92672">=</span> httpx<span style="color:#f92672">.</span>post(
</span></span><span style="display:flex;"><span>            subagent_url,
</span></span><span style="display:flex;"><span>            json<span style="color:#f92672">=</span>{<span style="color:#e6db74">&#34;task&#34;</span>: task},
</span></span><span style="display:flex;"><span>            headers<span style="color:#f92672">=</span>headers
</span></span><span style="display:flex;"><span>        )
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> response<span style="color:#f92672">.</span>json()
</span></span></code></pre></div><p>On the subagent side:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> opentelemetry.propagate <span style="color:#f92672">import</span> extract
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> opentelemetry <span style="color:#f92672">import</span> trace
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> flask <span style="color:#f92672">import</span> Flask, request
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>app <span style="color:#f92672">=</span> Flask(__name__)
</span></span><span style="display:flex;"><span>tracer <span style="color:#f92672">=</span> trace<span style="color:#f92672">.</span>get_tracer(<span style="color:#e6db74">&#34;subagent&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">@app.post</span>(<span style="color:#e6db74">&#34;/run&#34;</span>)
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">run_subagent</span>():
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Extract trace context from incoming request headers</span>
</span></span><span style="display:flex;"><span>    ctx <span style="color:#f92672">=</span> extract(request<span style="color:#f92672">.</span>headers)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">with</span> tracer<span style="color:#f92672">.</span>start_as_current_span(
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#34;subagent.run&#34;</span>,
</span></span><span style="display:flex;"><span>        context<span style="color:#f92672">=</span>ctx,  <span style="color:#75715e"># Links this span to coordinator&#39;s trace</span>
</span></span><span style="display:flex;"><span>        attributes<span style="color:#f92672">=</span>{<span style="color:#e6db74">&#34;gen_ai.agent.name&#34;</span>: <span style="color:#e6db74">&#34;research_subagent&#34;</span>}
</span></span><span style="display:flex;"><span>    ) <span style="color:#66d9ef">as</span> span:
</span></span><span style="display:flex;"><span>        result <span style="color:#f92672">=</span> execute_research_task(request<span style="color:#f92672">.</span>json[<span style="color:#e6db74">&#34;task&#34;</span>])
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> {<span style="color:#e6db74">&#34;result&#34;</span>: result}
</span></span></code></pre></div><p>The result: coordinator call + subagent execution + all LLM calls inside both appear in a single trace in Jaeger or Grafana.</p>
<h3 id="message-queue-propagation-kafkaredis-streams">Message Queue Propagation (Kafka/Redis Streams)</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#75715e"># Producer (coordinator)</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> opentelemetry.propagate <span style="color:#f92672">import</span> inject
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">enqueue_task</span>(task: dict, producer):
</span></span><span style="display:flex;"><span>    headers <span style="color:#f92672">=</span> {}
</span></span><span style="display:flex;"><span>    inject(headers)
</span></span><span style="display:flex;"><span>    producer<span style="color:#f92672">.</span>send(<span style="color:#e6db74">&#34;agent_tasks&#34;</span>, value<span style="color:#f92672">=</span>task, headers<span style="color:#f92672">=</span>list(headers<span style="color:#f92672">.</span>items()))
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Consumer (subagent worker)</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> opentelemetry.propagate <span style="color:#f92672">import</span> extract
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">process_task</span>(message):
</span></span><span style="display:flex;"><span>    headers <span style="color:#f92672">=</span> dict(message<span style="color:#f92672">.</span>headers)
</span></span><span style="display:flex;"><span>    ctx <span style="color:#f92672">=</span> extract(headers)
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">with</span> tracer<span style="color:#f92672">.</span>start_as_current_span(<span style="color:#e6db74">&#34;subagent.process&#34;</span>, context<span style="color:#f92672">=</span>ctx):
</span></span><span style="display:flex;"><span>        execute_task(message<span style="color:#f92672">.</span>value)
</span></span></code></pre></div><h3 id="baggage-for-session-level-context">Baggage for Session-Level Context</h3>
<p>Use OpenTelemetry Baggage to propagate session IDs, user IDs, and feature flags across agent boundaries without adding them to every span manually:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> opentelemetry.baggage <span style="color:#f92672">import</span> set_baggage, get_baggage
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> opentelemetry <span style="color:#f92672">import</span> context
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Set at entry point</span>
</span></span><span style="display:flex;"><span>ctx <span style="color:#f92672">=</span> set_baggage(<span style="color:#e6db74">&#34;session.id&#34;</span>, session_id)
</span></span><span style="display:flex;"><span>ctx <span style="color:#f92672">=</span> set_baggage(<span style="color:#e6db74">&#34;user.tier&#34;</span>, <span style="color:#e6db74">&#34;premium&#34;</span>, context<span style="color:#f92672">=</span>ctx)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Automatically available in all descendant spans</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Retrieve in subagent</span>
</span></span><span style="display:flex;"><span>session_id <span style="color:#f92672">=</span> get_baggage(<span style="color:#e6db74">&#34;session.id&#34;</span>)
</span></span></code></pre></div><h2 id="choosing-your-observability-backend-self-hosted-vs-managed">Choosing Your Observability Backend (Self-Hosted vs Managed)</h2>
<p>The choice between self-hosted and managed observability backends for AI agents comes down to three factors: data residency requirements, engineering capacity for ops, and cost at scale. OTel in production nearly doubled year-over-year from 6% to 11% among enterprises in 2026, with 89% rating vendor compliance with GenAI conventions as critical. The good news: any backend that accepts OTLP works — you are not locked to any vendor at the SDK layer. The trade-off is operational overhead vs monthly SaaS spend. A managed backend like Grafana Cloud or Honeycomb costs roughly $20–$200/month for a medium-traffic AI agent deployment and requires zero ops work. A self-hosted Jaeger + VictoriaMetrics stack requires maintaining the infrastructure but gives you full control over data retention, no per-event pricing, and no data leaving your environment — critical for healthcare or financial services applications subject to HIPAA or SOC 2 requirements. Langfuse occupies a middle ground: open source and self-hostable, but with a managed cloud tier if you want LLM-native features without the ops overhead. The comparison table below shows GenAI-specific feature support across the major options so you can match backend capabilities to your observability requirements.</p>
<h3 id="comparison-table">Comparison Table</h3>
<table>
  <thead>
      <tr>
          <th>Backend</th>
          <th>Type</th>
          <th>GenAI Support</th>
          <th>Cost Model</th>
          <th>Best For</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Grafana Cloud</strong></td>
          <td>Managed</td>
          <td>Native GenAI dashboards</td>
          <td>Free tier + usage</td>
          <td>Most teams starting out</td>
      </tr>
      <tr>
          <td><strong>Honeycomb</strong></td>
          <td>Managed</td>
          <td>Full attribute querying</td>
          <td>Per event</td>
          <td>High-cardinality debugging</td>
      </tr>
      <tr>
          <td><strong>Langfuse</strong></td>
          <td>Managed + OSS</td>
          <td>LLM-native, 21K+ GitHub stars</td>
          <td>Free OSS / managed</td>
          <td>LLM-first observability</td>
      </tr>
      <tr>
          <td><strong>Jaeger</strong></td>
          <td>Self-hosted</td>
          <td>Standard OTel traces</td>
          <td>Infrastructure cost</td>
          <td>Dev/test, cost-sensitive</td>
      </tr>
      <tr>
          <td><strong>Grafana + Tempo</strong></td>
          <td>Self-hosted</td>
          <td>Custom dashboards</td>
          <td>Infrastructure cost</td>
          <td>Full control, data residency</td>
      </tr>
      <tr>
          <td><strong>VictoriaMetrics</strong></td>
          <td>Self-hosted</td>
          <td>Prometheus-compatible metrics</td>
          <td>Infrastructure cost</td>
          <td>Metrics-heavy workloads</td>
      </tr>
  </tbody>
</table>
<h3 id="managed-grafana-cloud">Managed: Grafana Cloud</h3>
<p>Grafana Cloud accepts OTLP traces, metrics, and logs from the same endpoint. Their AI/LLM dashboard templates include token usage panels, latency percentile histograms, and cost aggregation by agent. The free tier covers 50GB of logs and 10K traces/month — enough for a medium-traffic development environment.</p>
<h3 id="self-hosted-langfuse">Self-Hosted: Langfuse</h3>
<p>Langfuse is the most popular open-source LLM observability platform (21,000+ GitHub stars by early 2026). It provides a purpose-built UI for LLM traces with session views, prompt management, and evaluation tooling that generic APM tools lack. Deploy with Docker Compose for single-node or Kubernetes for production:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>git clone https://github.com/langfuse/langfuse
</span></span><span style="display:flex;"><span>cd langfuse
</span></span><span style="display:flex;"><span>docker compose up -d
</span></span></code></pre></div><p>Then point openlit at the Langfuse OTLP endpoint. Langfuse also maintains a Python SDK for direct integration if you prefer to skip the OTel SDK layer.</p>
<h3 id="self-hosted-jaeger--opentelemetry-collector">Self-Hosted: Jaeger + OpenTelemetry Collector</h3>
<p>The minimal self-hosted stack for production:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-yaml" data-lang="yaml"><span style="display:flex;"><span><span style="color:#75715e"># docker-compose.yml</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">services</span>:
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">otel-collector</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">image</span>: <span style="color:#ae81ff">otel/opentelemetry-collector-contrib:latest</span>
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">command</span>: [<span style="color:#e6db74">&#34;--config=/etc/otel-collector-config.yaml&#34;</span>]
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">volumes</span>:
</span></span><span style="display:flex;"><span>      - <span style="color:#ae81ff">./otel-collector-config.yaml:/etc/otel-collector-config.yaml</span>
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">ports</span>:
</span></span><span style="display:flex;"><span>      - <span style="color:#e6db74">&#34;4317:4317&#34;</span>  <span style="color:#75715e"># OTLP gRPC</span>
</span></span><span style="display:flex;"><span>      - <span style="color:#e6db74">&#34;4318:4318&#34;</span>  <span style="color:#75715e"># OTLP HTTP</span>
</span></span><span style="display:flex;"><span>  
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">jaeger</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">image</span>: <span style="color:#ae81ff">jaegertracing/all-in-one:latest</span>
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">ports</span>:
</span></span><span style="display:flex;"><span>      - <span style="color:#e6db74">&#34;16686:16686&#34;</span>  <span style="color:#75715e"># UI</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-yaml" data-lang="yaml"><span style="display:flex;"><span><span style="color:#75715e"># otel-collector-config.yaml</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">receivers</span>:
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">otlp</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">protocols</span>:
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">grpc</span>:
</span></span><span style="display:flex;"><span>        <span style="color:#f92672">endpoint</span>: <span style="color:#e6db74">&#34;0.0.0.0:4317&#34;</span>
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">http</span>:
</span></span><span style="display:flex;"><span>        <span style="color:#f92672">endpoint</span>: <span style="color:#e6db74">&#34;0.0.0.0:4318&#34;</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">processors</span>:
</span></span><span style="display:flex;"><span>  <span style="color:#75715e"># Redact prompt content for GDPR compliance</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">redaction</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">allow_all_keys</span>: <span style="color:#66d9ef">true</span>
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">blocked_values</span>:
</span></span><span style="display:flex;"><span>      - <span style="color:#e6db74">&#34;gen_ai.content.prompt&#34;</span>
</span></span><span style="display:flex;"><span>      - <span style="color:#e6db74">&#34;gen_ai.content.completion&#34;</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">exporters</span>:
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">jaeger</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">endpoint</span>: <span style="color:#ae81ff">jaeger:14250</span>
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">tls</span>:
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">insecure</span>: <span style="color:#66d9ef">true</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">service</span>:
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">pipelines</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">traces</span>:
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">receivers</span>: [<span style="color:#ae81ff">otlp]</span>
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">processors</span>: [<span style="color:#ae81ff">redaction]</span>
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">exporters</span>: [<span style="color:#ae81ff">jaeger]</span>
</span></span></code></pre></div><h2 id="production-deployment-checklist-dev-to-prod-in-one-guide">Production Deployment Checklist: Dev to Prod in One Guide</h2>
<p>Moving AI agent observability from a local Jaeger instance to production requires addressing four concerns that do not exist in development: authentication, cardinality, sampling, and alerting. Each of these can either degrade your observability posture (too aggressive sampling) or destabilize your backend (unbounded cardinality). The following checklist is what a production AI agent deployment looks like when observability is treated as a first-class requirement from day one — not bolted on after the first incident. Authentication is the most urgent: an unauthenticated OTLP exporter will fail silently against a production backend, leaving you with zero traces and no error in application logs. Cardinality problems typically appear 2–4 weeks after launch when someone adds a user-ID-based span attribute and your metrics cardinality explodes. Sampling decisions made before you understand your traffic patterns are almost always wrong — start with 100% traces in production for the first week, then tune down using tail-based sampling once you understand what &ldquo;normal&rdquo; looks like. The alerting section below maps each of the six essential metrics from the previous section to a specific alert condition, giving you a working alert configuration you can paste directly into Grafana or your alerting tool of choice.</p>
<h3 id="authentication-and-transport-security">Authentication and Transport Security</h3>
<ul>
<li><input disabled="" type="checkbox"> Replace unauthenticated OTLP exporter with authenticated connection using API keys or mTLS</li>
<li><input disabled="" type="checkbox"> Rotate API keys for observability backends on the same schedule as other service credentials</li>
<li><input disabled="" type="checkbox"> Ensure OTLP exporter uses TLS (<code>InsecureSkipVerify: false</code> in production)</li>
<li><input disabled="" type="checkbox"> Validate that <code>capture_message_content=False</code> is set in production (opt-in content capture only in dev/staging)</li>
</ul>
<h3 id="cardinality-management">Cardinality Management</h3>
<ul>
<li><input disabled="" type="checkbox"> Never use unbounded values (user IDs, session IDs, full URLs) as span attribute keys — only as values</li>
<li><input disabled="" type="checkbox"> Cap <code>agent.task</code> attribute to 100 characters to avoid high-cardinality string fields</li>
<li><input disabled="" type="checkbox"> Use <code>gen_ai.request.model</code> (the standardized attribute) instead of a custom model attribute — this ensures consistent cardinality across frameworks</li>
<li><input disabled="" type="checkbox"> Review tool name attributes — if tool names are dynamically generated, normalize them to a fixed set</li>
</ul>
<h3 id="sampling-strategy">Sampling Strategy</h3>
<p>For production AI agents, <strong>tail-based sampling</strong> is the right default: sample 100% of errored traces, 100% of traces exceeding your p95 latency threshold, and 5-10% of successful fast traces. Head-based sampling at 10% will randomly drop slow or errored traces, defeating the purpose.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-yaml" data-lang="yaml"><span style="display:flex;"><span><span style="color:#75715e"># otel-collector sampling config</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">processors</span>:
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">tail_sampling</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">decision_wait</span>: <span style="color:#ae81ff">10s</span>
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">policies</span>:
</span></span><span style="display:flex;"><span>      - <span style="color:#f92672">name</span>: <span style="color:#ae81ff">errors-policy</span>
</span></span><span style="display:flex;"><span>        <span style="color:#f92672">type</span>: <span style="color:#ae81ff">status_code</span>
</span></span><span style="display:flex;"><span>        <span style="color:#f92672">status_code</span>: {<span style="color:#f92672">status_codes</span>: [<span style="color:#ae81ff">ERROR]}</span>
</span></span><span style="display:flex;"><span>      - <span style="color:#f92672">name</span>: <span style="color:#ae81ff">slow-traces-policy</span>
</span></span><span style="display:flex;"><span>        <span style="color:#f92672">type</span>: <span style="color:#ae81ff">latency</span>
</span></span><span style="display:flex;"><span>        <span style="color:#f92672">latency</span>: {<span style="color:#f92672">threshold_ms</span>: <span style="color:#ae81ff">5000</span>}
</span></span><span style="display:flex;"><span>      - <span style="color:#f92672">name</span>: <span style="color:#ae81ff">probabilistic-policy</span>
</span></span><span style="display:flex;"><span>        <span style="color:#f92672">type</span>: <span style="color:#ae81ff">probabilistic</span>
</span></span><span style="display:flex;"><span>        <span style="color:#f92672">probabilistic</span>: {<span style="color:#f92672">sampling_percentage</span>: <span style="color:#ae81ff">10</span>}
</span></span></code></pre></div><h3 id="alerting-configuration">Alerting Configuration</h3>
<p>Set up these five alerts at minimum:</p>
<table>
  <thead>
      <tr>
          <th>Alert</th>
          <th>Condition</th>
          <th>Severity</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>High agent loop iterations</td>
          <td><code>agent.iterations &gt; 8</code> for &gt;1% of runs</td>
          <td>Warning</td>
      </tr>
      <tr>
          <td>Tool call failure spike</td>
          <td>Tool success rate drops below 90%</td>
          <td>Critical</td>
      </tr>
      <tr>
          <td>Token budget exceeded</td>
          <td><code>gen_ai.usage.input_tokens</code> &gt; model limit × 0.9</td>
          <td>Warning</td>
      </tr>
      <tr>
          <td>End-to-end latency p99</td>
          <td>Agent run duration p99 &gt; 30s</td>
          <td>Critical</td>
      </tr>
      <tr>
          <td>Trace loss</td>
          <td>No traces received from agent in 5 min</td>
          <td>Critical</td>
      </tr>
  </tbody>
</table>
<h3 id="instrumentation-coverage-verification">Instrumentation Coverage Verification</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#75715e"># Add this to your CI pipeline to verify instrumentation is active</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> opentelemetry.trace <span style="color:#66d9ef">as</span> trace_api
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">test_tracer_configured</span>():
</span></span><span style="display:flex;"><span>    tracer <span style="color:#f92672">=</span> trace_api<span style="color:#f92672">.</span>get_tracer(<span style="color:#e6db74">&#34;test&#34;</span>)
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">assert</span> <span style="color:#f92672">not</span> isinstance(
</span></span><span style="display:flex;"><span>        tracer, trace_api<span style="color:#f92672">.</span>ProxyTracer
</span></span><span style="display:flex;"><span>    ), <span style="color:#e6db74">&#34;TracerProvider not configured — spans will be no-ops&#34;</span>
</span></span></code></pre></div><p>A no-op TracerProvider is the silent failure mode: your code runs, no errors appear, and no traces arrive. This test catches that in CI before it reaches production.</p>
<h3 id="environment-variable-configuration">Environment Variable Configuration</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#75715e"># Production environment</span>
</span></span><span style="display:flex;"><span>OTEL_SERVICE_NAME<span style="color:#f92672">=</span>my-production-agent
</span></span><span style="display:flex;"><span>OTEL_RESOURCE_ATTRIBUTES<span style="color:#f92672">=</span>deployment.environment<span style="color:#f92672">=</span>production,service.version<span style="color:#f92672">=</span>1.2.3
</span></span><span style="display:flex;"><span>OTEL_EXPORTER_OTLP_ENDPOINT<span style="color:#f92672">=</span>https://your-otlp-endpoint:4317
</span></span><span style="display:flex;"><span>OTEL_EXPORTER_OTLP_HEADERS<span style="color:#f92672">=</span>Authorization<span style="color:#f92672">=</span>Bearer <span style="color:#e6db74">${</span>OTLP_API_KEY<span style="color:#e6db74">}</span>
</span></span><span style="display:flex;"><span>OTEL_TRACES_SAMPLER<span style="color:#f92672">=</span>parentbased_always_on  <span style="color:#75715e"># Let collector handle tail sampling</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Dev override</span>
</span></span><span style="display:flex;"><span>OTEL_EXPORTER_OTLP_ENDPOINT<span style="color:#f92672">=</span>http://localhost:4317
</span></span><span style="display:flex;"><span>OPENLIT_CAPTURE_MESSAGE_CONTENT<span style="color:#f92672">=</span>true
</span></span></code></pre></div><p>Keep all observability configuration in environment variables — never hardcode endpoints in instrumentation code.</p>
<hr>
<h2 id="faq">FAQ</h2>
<p><strong>What is OpenTelemetry GenAI semantic conventions and why does it matter?</strong></p>
<p>OpenTelemetry GenAI semantic conventions are the standardized set of attribute names, span types, and metric definitions that give AI/LLM telemetry a consistent structure across every framework and vendor. They matter because without them, a LangChain trace uses different attribute names than an OpenAI trace, making aggregation and alerting across your stack impossible. In 2026, core client spans and token usage metrics are stable, meaning you can build production dashboards on them without breaking change risk.</p>
<p><strong>How do I instrument an OpenAI or Anthropic client without changing my application code?</strong></p>
<p>Use <code>openlit.init()</code> before your first API call. It patches the OpenAI and Anthropic client libraries automatically using monkey-patching, so every <code>client.chat.completions.create()</code> call generates a span with GenAI attributes without requiring any code changes to your existing agent logic. You only need to configure the TracerProvider once at application startup.</p>
<p><strong>What is the difference between session-level and request-level observability for AI agents?</strong></p>
<p>Request-level observability covers a single LLM call: model, tokens, latency. Session-level observability covers the entire user interaction: all agent runs, all tool calls, cost totals, and outcome. Session-level requires threading a <code>session.id</code> through every span and aggregating across multiple traces. OpenTelemetry Baggage is the standard mechanism for propagating session context across agent boundaries without adding it to every span attribute manually.</p>
<p><strong>How do I avoid storing sensitive user data (PII) in my traces?</strong></p>
<p>Set <code>capture_message_content=False</code> in openlit (which is the default). GenAI conventions separate prompt/completion content into span events rather than attributes, so stripping them at the OpenTelemetry Collector level is straightforward — add a <code>redaction</code> processor that blocks <code>gen_ai.content.prompt</code> and <code>gen_ai.content.completion</code> event attributes. This gives you full token usage and latency observability without storing any message content.</p>
<p><strong>Which observability backend should I use for AI agents in 2026?</strong></p>
<p>For most teams: Grafana Cloud for its generous free tier, native OTLP support, and pre-built LLM dashboard templates. For LLM-specific features like prompt management and evaluation: Langfuse (open source, 21K+ GitHub stars). For maximum flexibility and self-hosting: Jaeger + OpenTelemetry Collector + VictoriaMetrics. All three work with the same OTel SDK instrumentation — you switch backends by changing an endpoint URL, not rewriting instrumentation code.</p>
]]></content:encoded></item><item><title>Arize Phoenix Guide: Open-Source LLM Observability for Developers (2026)</title><link>https://baeseokjae.github.io/posts/arize-phoenix-observability-guide-2026/</link><pubDate>Sun, 17 May 2026 15:03:42 +0000</pubDate><guid>https://baeseokjae.github.io/posts/arize-phoenix-observability-guide-2026/</guid><description>Step-by-step guide to Arize Phoenix: install, instrument LLM apps, trace RAG pipelines, and run evals — all open-source, zero vendor lock-in.</description><content:encoded><![CDATA[<p>Arize Phoenix is a free, open-source LLM observability platform that gives developers full-stack visibility into LLM applications — tracing requests, evaluating outputs, and debugging RAG pipelines — without requiring a cloud subscription or vendor account. It runs locally in a Python process or scales to Docker and Kubernetes for production deployments.</p>
<h2 id="what-is-arize-phoenix-and-why-it-matters-in-2026">What Is Arize Phoenix and Why It Matters in 2026</h2>
<p>Arize Phoenix is an open-source observability platform built specifically for LLM applications, agents, and retrieval-augmented generation (RAG) pipelines. Unlike generic APM tools, Phoenix understands LLM-native concepts — spans, traces, embeddings, prompts, retrieved contexts, and model outputs — and surfaces them in a UI designed for AI engineers. As of 2026, Phoenix has surpassed 9,000 GitHub stars, making it one of the most-adopted open-source observability tools in the AI ecosystem. The platform is backed by Arize AI but released under a permissive open-source license, meaning you can run it entirely on your own infrastructure with no usage caps or feature gating.</p>
<p>The urgency behind Phoenix adoption is clear: the LLM observability market is growing from $1.97B in 2025 to $2.69B in 2026 at a 36.3% CAGR, and Gartner predicts that by 2028, observability will be embedded in 50% of GenAI deployments — up from just 15% today. Yet 57% of organizations already running AI agents in production rate observability as the lowest-quality part of their AI stack. Phoenix exists to close that gap for teams who can&rsquo;t afford to ship LLM apps blind, and who want to own their trace data rather than send it to a SaaS vendor.</p>
<h2 id="what-core-features-does-phoenix-offer">What Core Features Does Phoenix Offer?</h2>
<p>Arize Phoenix ships four interconnected capabilities that cover the full LLM development lifecycle: tracing, evaluation, dataset management, and a prompt playground. Together they form a workflow loop: trace what your app is doing, evaluate whether outputs meet quality thresholds, curate failure cases into datasets, and iterate on prompts in the playground before deploying changes. This feedback loop is the key reason teams migrate from generic logging to Phoenix — instead of reading raw JSON logs, engineers see structured span trees, latency breakdowns per retrieval step, and LLM judge scores alongside the actual model outputs.</p>
<p><strong>Tracing</strong> captures every span in an LLM workflow as an OpenTelemetry trace. A single user request to a RAG pipeline generates spans for the embedding call, vector DB retrieval, context concatenation, and final LLM generation — each with token counts, latency, and input/output payloads.</p>
<p><strong>Evaluation</strong> runs 50+ research-backed metrics including hallucination detection, relevance, Q&amp;A correctness, toxicity, and faithfulness. These can run in the Phoenix UI as one-off evals or in CI via the <code>phoenix.evals</code> Python API.</p>
<p><strong>Dataset management</strong> lets you export traces — especially failure cases — into labeled datasets for fine-tuning or regression testing.</p>
<p><strong>Prompt playground</strong> connects to your LLM provider APIs and lets you replay any captured trace against modified prompts to A/B test prompt changes against real historical inputs.</p>
<h2 id="how-do-you-install-phoenix-in-5-minutes">How Do You Install Phoenix in 5 Minutes?</h2>
<p>Phoenix installs via pip and launches as a local web server that requires no external dependencies for basic usage. The minimum viable setup takes under five minutes and works in any Python 3.9+ environment, including notebooks, Docker containers, and CI runners.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>pip install arize-phoenix arize-phoenix-otel openinference-instrumentation-openai
</span></span></code></pre></div><p>Then start the Phoenix server and point your app at it:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> phoenix <span style="color:#66d9ef">as</span> px
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Start Phoenix server (opens UI at http://localhost:6006)</span>
</span></span><span style="display:flex;"><span>session <span style="color:#f92672">=</span> px<span style="color:#f92672">.</span>launch_app()
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Configure OpenTelemetry to send traces to Phoenix</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> phoenix.otel <span style="color:#f92672">import</span> register
</span></span><span style="display:flex;"><span>tracer_provider <span style="color:#f92672">=</span> register(
</span></span><span style="display:flex;"><span>    project_name<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;my-llm-app&#34;</span>,
</span></span><span style="display:flex;"><span>    endpoint<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;http://localhost:6006/v1/traces&#34;</span>,
</span></span><span style="display:flex;"><span>)
</span></span></code></pre></div><p>For Docker, a single command pulls and starts the full server:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>docker run -p 6006:6006 -p 4317:4317 arizephoenix/phoenix:latest
</span></span></code></pre></div><p>Port <code>6006</code> serves the web UI. Port <code>4317</code> is the OpenTelemetry OTLP gRPC ingest endpoint. You can persist traces across restarts by mounting a volume to <code>/mnt/data</code>.</p>
<h3 id="notebook-usage">Notebook Usage</h3>
<p>In Jupyter or Colab environments, <code>px.launch_app()</code> renders an embedded iframe directly in the notebook cell output. No separate terminal or process management required — Phoenix starts as a background thread within the kernel, making it ideal for exploratory data analysis on LLM outputs.</p>
<h2 id="how-does-opentelemetry-auto-instrumentation-work-with-phoenix">How Does OpenTelemetry Auto-Instrumentation Work with Phoenix?</h2>
<p>Phoenix uses OpenTelemetry (OTel) as its trace collection standard, which means it benefits from a growing ecosystem of vendor-neutral instrumentation libraries. Auto-instrumentation patches popular LLM SDKs at import time — you add two lines of code and Phoenix captures every API call automatically, with no manual span creation required.</p>
<p>OpenTelemetry instrumentation in Phoenix works through the <code>openinference</code> family of packages. These are OTel-compatible semantic conventions for LLM-specific data: input messages, output messages, token usage, model name, embedding vectors, retrieved documents, and tool calls. When you call <code>OpenAIInstrumentor().instrument()</code>, the instrumentor monkey-patches the OpenAI Python client so every <code>client.chat.completions.create()</code> call emits a span with the full request/response payload automatically attached.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> openinference.instrumentation.openai <span style="color:#f92672">import</span> OpenAIInstrumentor
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>OpenAIInstrumentor()<span style="color:#f92672">.</span>instrument(tracer_provider<span style="color:#f92672">=</span>tracer_provider)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Now every OpenAI call is automatically traced</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> openai
</span></span><span style="display:flex;"><span>client <span style="color:#f92672">=</span> openai<span style="color:#f92672">.</span>OpenAI()
</span></span><span style="display:flex;"><span>response <span style="color:#f92672">=</span> client<span style="color:#f92672">.</span>chat<span style="color:#f92672">.</span>completions<span style="color:#f92672">.</span>create(
</span></span><span style="display:flex;"><span>    model<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;gpt-4o&#34;</span>,
</span></span><span style="display:flex;"><span>    messages<span style="color:#f92672">=</span>[{<span style="color:#e6db74">&#34;role&#34;</span>: <span style="color:#e6db74">&#34;user&#34;</span>, <span style="color:#e6db74">&#34;content&#34;</span>: <span style="color:#e6db74">&#34;Explain observability in one sentence.&#34;</span>}]
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Trace appears in Phoenix UI automatically</span>
</span></span></code></pre></div><p>Supported auto-instrumentation packages as of 2026:</p>
<table>
  <thead>
      <tr>
          <th>Package</th>
          <th>Instruments</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>openinference-instrumentation-openai</code></td>
          <td>OpenAI Chat, Embeddings, Responses API</td>
      </tr>
      <tr>
          <td><code>openinference-instrumentation-anthropic</code></td>
          <td>Claude Messages API</td>
      </tr>
      <tr>
          <td><code>openinference-instrumentation-langchain</code></td>
          <td>LangChain chains, agents, tools</td>
      </tr>
      <tr>
          <td><code>openinference-instrumentation-llama-index</code></td>
          <td>LlamaIndex query engines, retrievers</td>
      </tr>
      <tr>
          <td><code>openinference-instrumentation-crewai</code></td>
          <td>CrewAI agent crews and tasks</td>
      </tr>
      <tr>
          <td><code>openinference-instrumentation-litellm</code></td>
          <td>LiteLLM proxy (any provider)</td>
      </tr>
  </tbody>
</table>
<h3 id="custom-spans">Custom Spans</h3>
<p>For business logic that sits between LLM calls — pre-processing, validation, post-processing — you can add manual spans using the standard OTel tracer API:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> opentelemetry <span style="color:#f92672">import</span> trace
</span></span><span style="display:flex;"><span>tracer <span style="color:#f92672">=</span> trace<span style="color:#f92672">.</span>get_tracer(__name__)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">with</span> tracer<span style="color:#f92672">.</span>start_as_current_span(<span style="color:#e6db74">&#34;validate-user-query&#34;</span>) <span style="color:#66d9ef">as</span> span:
</span></span><span style="display:flex;"><span>    span<span style="color:#f92672">.</span>set_attribute(<span style="color:#e6db74">&#34;query.length&#34;</span>, len(user_query))
</span></span><span style="display:flex;"><span>    cleaned <span style="color:#f92672">=</span> preprocess(user_query)
</span></span></code></pre></div><p>These custom spans appear nested within the auto-instrumented LLM spans in the Phoenix UI, giving full end-to-end visibility including your non-LLM application code.</p>
<h2 id="how-do-you-trace-rag-pipelines-with-llamaindex-and-langchain">How Do You Trace RAG Pipelines with LlamaIndex and LangChain?</h2>
<p>RAG pipeline tracing is Phoenix&rsquo;s strongest differentiator versus general-purpose observability tools. A RAG pipeline involves at least four distinct operations — query embedding, vector retrieval, context stuffing, and generation — and failures at any step produce subtly wrong outputs that are invisible without span-level visibility. Phoenix captures each step as a separate span and links them into a single trace tree, making it immediately obvious whether a bad answer came from poor retrieval or poor generation. In a typical LlamaIndex or LangChain RAG setup, a user question that returns a hallucinated answer could have failed at any of three points: the wrong documents were retrieved (retrieval failure), the correct documents were retrieved but the LLM ignored them (faithfulness failure), or the question was ambiguous and the embedding model found semantically unrelated chunks (embedding failure). Without Phoenix traces, distinguishing these failure modes requires manual logging and extensive print-statement debugging. With Phoenix, you see each span&rsquo;s latency, input, and output in a hierarchical tree within seconds of the query completing.</p>
<h3 id="llamaindex-rag-tracing">LlamaIndex RAG Tracing</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> openinference.instrumentation.llama_index <span style="color:#f92672">import</span> LlamaIndexInstrumentor
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>LlamaIndexInstrumentor()<span style="color:#f92672">.</span>instrument(tracer_provider<span style="color:#f92672">=</span>tracer_provider)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> llama_index.core <span style="color:#f92672">import</span> VectorStoreIndex, SimpleDirectoryReader
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Load and index documents</span>
</span></span><span style="display:flex;"><span>documents <span style="color:#f92672">=</span> SimpleDirectoryReader(<span style="color:#e6db74">&#34;./data&#34;</span>)<span style="color:#f92672">.</span>load_data()
</span></span><span style="display:flex;"><span>index <span style="color:#f92672">=</span> VectorStoreIndex<span style="color:#f92672">.</span>from_documents(documents)
</span></span><span style="display:flex;"><span>query_engine <span style="color:#f92672">=</span> index<span style="color:#f92672">.</span>as_query_engine()
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># This query generates a full trace: embed → retrieve → generate</span>
</span></span><span style="display:flex;"><span>response <span style="color:#f92672">=</span> query_engine<span style="color:#f92672">.</span>query(<span style="color:#e6db74">&#34;What are the main risks of LLM hallucination?&#34;</span>)
</span></span></code></pre></div><p>In Phoenix, this single query appears as a trace with child spans for:</p>
<ul>
<li><code>embedding</code> — the query vector computation (model, latency, token count)</li>
<li><code>retrieval</code> — the top-k documents returned (document IDs, similarity scores)</li>
<li><code>llm</code> — the generation call (prompt, completion, token usage, cost)</li>
</ul>
<h3 id="langchain-rag-tracing">LangChain RAG Tracing</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> openinference.instrumentation.langchain <span style="color:#f92672">import</span> LangChainInstrumentor
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>LangChainInstrumentor()<span style="color:#f92672">.</span>instrument(tracer_provider<span style="color:#f92672">=</span>tracer_provider)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> langchain_openai <span style="color:#f92672">import</span> ChatOpenAI, OpenAIEmbeddings
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> langchain_community.vectorstores <span style="color:#f92672">import</span> FAISS
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> langchain.chains <span style="color:#f92672">import</span> RetrievalQA
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>embeddings <span style="color:#f92672">=</span> OpenAIEmbeddings()
</span></span><span style="display:flex;"><span>vectorstore <span style="color:#f92672">=</span> FAISS<span style="color:#f92672">.</span>from_texts(documents, embeddings)
</span></span><span style="display:flex;"><span>qa_chain <span style="color:#f92672">=</span> RetrievalQA<span style="color:#f92672">.</span>from_chain_type(
</span></span><span style="display:flex;"><span>    llm<span style="color:#f92672">=</span>ChatOpenAI(model<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;gpt-4o&#34;</span>),
</span></span><span style="display:flex;"><span>    retriever<span style="color:#f92672">=</span>vectorstore<span style="color:#f92672">.</span>as_retriever(search_kwargs<span style="color:#f92672">=</span>{<span style="color:#e6db74">&#34;k&#34;</span>: <span style="color:#ae81ff">5</span>}),
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>result <span style="color:#f92672">=</span> qa_chain<span style="color:#f92672">.</span>invoke({<span style="color:#e6db74">&#34;query&#34;</span>: <span style="color:#e6db74">&#34;What is LLM observability?&#34;</span>})
</span></span></code></pre></div><p>Phoenix captures the full LangChain chain execution including each tool call, retriever invocation, and LLM generation as nested spans.</p>
<h2 id="how-do-you-run-llm-evaluations-in-phoenix">How Do You Run LLM Evaluations in Phoenix?</h2>
<p>Phoenix evaluations use LLM-as-a-judge to score traces against quality metrics — automatically and at scale. The <code>phoenix.evals</code> module provides pre-built eval templates backed by published research, so you don&rsquo;t need to write your own judge prompts for common tasks like hallucination detection, relevance scoring, or Q&amp;A correctness.</p>
<p>Running evals takes three steps: export traces from Phoenix, run the eval function, and ship scores back to Phoenix for visualization alongside the original traces.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> phoenix <span style="color:#66d9ef">as</span> px
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> phoenix.evals <span style="color:#f92672">import</span> (
</span></span><span style="display:flex;"><span>    HallucinationEvaluator,
</span></span><span style="display:flex;"><span>    QAEvaluator,
</span></span><span style="display:flex;"><span>    RelevanceEvaluator,
</span></span><span style="display:flex;"><span>    run_evals,
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> phoenix.evals <span style="color:#f92672">import</span> OpenAIModel
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Connect to running Phoenix instance</span>
</span></span><span style="display:flex;"><span>client <span style="color:#f92672">=</span> px<span style="color:#f92672">.</span>Client()
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Export traces from a project</span>
</span></span><span style="display:flex;"><span>traces_df <span style="color:#f92672">=</span> client<span style="color:#f92672">.</span>get_spans_dataframe(project_name<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;my-rag-app&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Initialize evaluators</span>
</span></span><span style="display:flex;"><span>eval_model <span style="color:#f92672">=</span> OpenAIModel(model<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;gpt-4o&#34;</span>)
</span></span><span style="display:flex;"><span>evaluators <span style="color:#f92672">=</span> [
</span></span><span style="display:flex;"><span>    HallucinationEvaluator(eval_model),
</span></span><span style="display:flex;"><span>    QAEvaluator(eval_model),
</span></span><span style="display:flex;"><span>    RelevanceEvaluator(eval_model),
</span></span><span style="display:flex;"><span>]
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Run evals (parallelized automatically)</span>
</span></span><span style="display:flex;"><span>eval_results <span style="color:#f92672">=</span> run_evals(
</span></span><span style="display:flex;"><span>    dataframe<span style="color:#f92672">=</span>traces_df,
</span></span><span style="display:flex;"><span>    evaluators<span style="color:#f92672">=</span>evaluators,
</span></span><span style="display:flex;"><span>    provide_explanation<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>,
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Ship scores back to Phoenix</span>
</span></span><span style="display:flex;"><span>px<span style="color:#f92672">.</span>log_evaluations(<span style="color:#f92672">*</span>eval_results, project_name<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;my-rag-app&#34;</span>)
</span></span></code></pre></div><p>After running, each trace in the Phoenix UI shows inline eval scores: <code>hallucination: 0.12</code>, <code>relevance: 0.94</code>, <code>qa_correctness: 1.0</code>. You can filter and sort by any eval metric to find the worst-performing traces for debugging.</p>
<h3 id="available-evaluation-metrics">Available Evaluation Metrics</h3>
<p>Phoenix ships 50+ evaluation metrics across five categories:</p>
<table>
  <thead>
      <tr>
          <th>Category</th>
          <th>Metrics</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Retrieval quality</strong></td>
          <td>Relevance, NDCG, Precision@k, Recall@k</td>
      </tr>
      <tr>
          <td><strong>Generation quality</strong></td>
          <td>Hallucination, Faithfulness, Q&amp;A Correctness</td>
      </tr>
      <tr>
          <td><strong>Safety</strong></td>
          <td>Toxicity, PII detection, Prompt injection</td>
      </tr>
      <tr>
          <td><strong>Code</strong></td>
          <td>Code correctness, Execution success rate</td>
      </tr>
      <tr>
          <td><strong>Custom</strong></td>
          <td>Template-based LLM judge for any criteria</td>
      </tr>
  </tbody>
</table>
<h2 id="how-do-you-self-host-phoenix-with-docker-and-kubernetes">How Do You Self-Host Phoenix with Docker and Kubernetes?</h2>
<p>Self-hosting Phoenix gives teams complete data sovereignty — traces never leave your infrastructure, which matters for regulated industries or any team with sensitive data flowing through their LLM apps. Phoenix supports three self-hosting paths: Docker Compose for small teams, standalone Docker for development, and Kubernetes Helm chart for production-scale deployments.</p>
<p>The Docker Compose setup is the recommended starting point for teams moving from local development to a shared instance:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-yaml" data-lang="yaml"><span style="display:flex;"><span><span style="color:#75715e"># docker-compose.yml</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">services</span>:
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">phoenix</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">image</span>: <span style="color:#ae81ff">arizephoenix/phoenix:latest</span>
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">ports</span>:
</span></span><span style="display:flex;"><span>      - <span style="color:#e6db74">&#34;6006:6006&#34;</span>   <span style="color:#75715e"># Web UI</span>
</span></span><span style="display:flex;"><span>      - <span style="color:#e6db74">&#34;4317:4317&#34;</span>   <span style="color:#75715e"># OTLP gRPC</span>
</span></span><span style="display:flex;"><span>      - <span style="color:#e6db74">&#34;4318:4318&#34;</span>   <span style="color:#75715e"># OTLP HTTP</span>
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">volumes</span>:
</span></span><span style="display:flex;"><span>      - <span style="color:#ae81ff">phoenix-data:/mnt/data</span>
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">environment</span>:
</span></span><span style="display:flex;"><span>      - <span style="color:#ae81ff">PHOENIX_WORKING_DIR=/mnt/data</span>
</span></span><span style="display:flex;"><span>      - <span style="color:#ae81ff">PHOENIX_SECRET=your-secret-key</span>
</span></span><span style="display:flex;"><span>      
</span></span><span style="display:flex;"><span><span style="color:#f92672">volumes</span>:
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">phoenix-data</span>:
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>docker compose up -d
</span></span></code></pre></div><p>For Kubernetes, Arize provides an official Helm chart:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>helm repo add arize-phoenix https://arize-ai.github.io/phoenix
</span></span><span style="display:flex;"><span>helm install phoenix arize-phoenix/phoenix <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  --set persistence.enabled<span style="color:#f92672">=</span>true <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  --set persistence.size<span style="color:#f92672">=</span>50Gi <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  --set ingress.enabled<span style="color:#f92672">=</span>true <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  --set ingress.host<span style="color:#f92672">=</span>phoenix.yourdomain.com
</span></span></code></pre></div><h3 id="environment-variables-for-production">Environment Variables for Production</h3>
<table>
  <thead>
      <tr>
          <th>Variable</th>
          <th>Purpose</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>PHOENIX_SECRET</code></td>
          <td>Enables authentication (required for production)</td>
      </tr>
      <tr>
          <td><code>PHOENIX_WORKING_DIR</code></td>
          <td>Persistent storage path for SQLite database</td>
      </tr>
      <tr>
          <td><code>PHOENIX_ENABLE_AUTH</code></td>
          <td>Toggle basic auth (default: disabled)</td>
      </tr>
      <tr>
          <td><code>PHOENIX_SMTP_*</code></td>
          <td>Email configuration for alerts</td>
      </tr>
      <tr>
          <td><code>OTEL_EXPORTER_OTLP_ENDPOINT</code></td>
          <td>Override for custom OTLP collectors</td>
      </tr>
  </tbody>
</table>
<p>Phoenix stores traces in SQLite by default, which handles millions of spans without external database dependencies. For high-throughput production workloads (10M+ spans/day), you can configure PostgreSQL as the backend database.</p>
<h2 id="arize-phoenix-vs-langfuse-vs-langsmith-which-should-you-choose">Arize Phoenix vs Langfuse vs LangSmith: Which Should You Choose?</h2>
<p>Choosing between Phoenix, Langfuse, and LangSmith depends primarily on your stack, data sovereignty requirements, and evaluation depth needs. All three are viable for 2026 production deployments — the differences are in philosophy and depth rather than basic feature gaps.</p>
<p>Arize Phoenix wins when you need the deepest RAG evaluation capabilities, are running a mixed ML+LLM stack (since Phoenix integrates with traditional Arize model monitoring), or want 50+ pre-built eval metrics without writing judge prompts from scratch. Its OpenTelemetry-first design also makes it future-proof — your traces are portable to any OTel-compatible backend.</p>
<p>Langfuse wins for teams with strict data sovereignty requirements who want the simplest self-hosted setup under MIT license. Its pricing model is the most predictable at scale, and its API-first design integrates cleanly into non-Python stacks.</p>
<p>LangSmith wins exclusively for teams deeply invested in the LangChain/LangGraph ecosystem. Its tight integration with LangGraph agent debugging is unmatched, but it&rsquo;s a proprietary platform with limited self-hosting options and pricing that scales poorly past moderate usage.</p>
<table>
  <thead>
      <tr>
          <th>Feature</th>
          <th>Arize Phoenix</th>
          <th>Langfuse</th>
          <th>LangSmith</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>License</td>
          <td>Apache 2.0</td>
          <td>MIT</td>
          <td>Proprietary</td>
      </tr>
      <tr>
          <td>Self-hostable</td>
          <td>Yes</td>
          <td>Yes</td>
          <td>Limited</td>
      </tr>
      <tr>
          <td>Built-in eval metrics</td>
          <td>50+</td>
          <td>Custom only</td>
          <td>~10 built-in</td>
      </tr>
      <tr>
          <td>RAG evaluation depth</td>
          <td>Best-in-class</td>
          <td>Basic</td>
          <td>Good</td>
      </tr>
      <tr>
          <td>OpenTelemetry native</td>
          <td>Yes</td>
          <td>Yes</td>
          <td>No</td>
      </tr>
      <tr>
          <td>LangChain integration</td>
          <td>Good</td>
          <td>Good</td>
          <td>Native</td>
      </tr>
      <tr>
          <td>LlamaIndex integration</td>
          <td>Native</td>
          <td>Good</td>
          <td>Basic</td>
      </tr>
      <tr>
          <td>Agent tracing</td>
          <td>Yes</td>
          <td>Yes</td>
          <td>Best (LangGraph)</td>
      </tr>
      <tr>
          <td>Playground</td>
          <td>Yes</td>
          <td>Yes</td>
          <td>Yes</td>
      </tr>
      <tr>
          <td>ML model monitoring</td>
          <td>Via Arize AX</td>
          <td>No</td>
          <td>No</td>
      </tr>
      <tr>
          <td>GitHub stars (2026)</td>
          <td>9,000+</td>
          <td>8,000+</td>
          <td>6,000+</td>
      </tr>
  </tbody>
</table>
<h3 id="when-to-choose-each">When to Choose Each</h3>
<p><strong>Choose Phoenix if:</strong></p>
<ul>
<li>Your app uses LlamaIndex or a custom RAG pipeline</li>
<li>You need hallucination/faithfulness eval out of the box</li>
<li>You run both traditional ML models and LLMs and want unified monitoring</li>
<li>You may scale to Arize AX&rsquo;s enterprise features later</li>
</ul>
<p><strong>Choose Langfuse if:</strong></p>
<ul>
<li>Data sovereignty is a hard requirement and you need the simplest self-hosted setup</li>
<li>Your team uses multiple languages (Ruby, Go, Java) — Langfuse has broader SDK coverage</li>
<li>You want predictable open-source pricing with no enterprise upsell pressure</li>
</ul>
<p><strong>Choose LangSmith if:</strong></p>
<ul>
<li>Your entire stack is LangChain/LangGraph</li>
<li>You need the tightest possible agent step-debugging experience</li>
<li>You&rsquo;re comfortable with proprietary tooling and SaaS pricing</li>
</ul>
<h2 id="when-does-the-arize-ax-enterprise-upgrade-make-sense">When Does the Arize AX Enterprise Upgrade Make Sense?</h2>
<p>Arize AX is the commercial enterprise platform that sits above Phoenix, sharing the same tracing foundation but adding features that matter at organizational scale. Phoenix to AX is an upgrade path, not a migration — your existing OpenTelemetry instrumentation works unchanged, and Phoenix traces can be forwarded to AX without re-instrumenting your codebase.</p>
<p>AX adds capabilities that Phoenix does not ship: role-based access control (RBAC) for multi-team environments, SSO integration (SAML, OIDC), advanced anomaly detection with alerting, production monitoring dashboards with SLA-grade uptime guarantees, dedicated support SLAs, and compliance reporting for SOC 2 and HIPAA-regulated deployments.</p>
<p>The upgrade makes economic sense when: your team has grown past 10-15 engineers sharing a single Phoenix instance and RBAC becomes a pain point; your legal team requires audit trails and SOC 2 compliance evidence; you need PagerDuty/OpsGenie integration for production LLM quality alerts; or your data volume exceeds what a self-managed PostgreSQL backend can handle without dedicated infrastructure investment.</p>
<p>For most startups and small engineering teams, Phoenix&rsquo;s open-source version handles millions of daily spans without operational overhead. AX is targeted at enterprises with dedicated ML platform teams and organizational compliance requirements.</p>
<h2 id="faq">FAQ</h2>
<p><strong>Q: Is Arize Phoenix completely free?</strong></p>
<p>Yes. Arize Phoenix is released under the Apache 2.0 license with no feature gating. You can run it locally, on your own servers, or in your own cloud account with no usage limits, no required API keys, and no phone-home telemetry. The commercial upgrade is Arize AX, a separate product with enterprise features — Phoenix itself remains fully open source.</p>
<p><strong>Q: Does Phoenix work with non-OpenAI models like Claude, Gemini, or open-source LLMs?</strong></p>
<p>Yes. Phoenix supports any model through OpenTelemetry instrumentation. For Anthropic Claude, use <code>openinference-instrumentation-anthropic</code>. For local models via Ollama or vLLM, use <code>openinference-instrumentation-litellm</code> with LiteLLM as a proxy. For Google Gemini, use the LiteLLM integration or manual spans. The <code>openinference</code> semantic conventions are model-provider agnostic.</p>
<p><strong>Q: How does Phoenix handle trace data storage and retention?</strong></p>
<p>By default, Phoenix stores all traces in a local SQLite database at <code>~/.phoenix/</code> (or the <code>PHOENIX_WORKING_DIR</code> path in Docker). There are no built-in retention limits — traces accumulate until you delete them. In production Docker deployments, mount a persistent volume to <code>/mnt/data</code>. For large-scale production, configure PostgreSQL as the backend to handle higher write throughput and enable standard database backup/retention policies.</p>
<p><strong>Q: Can Phoenix run in CI/CD pipelines for automated LLM quality gates?</strong></p>
<p>Yes, and this is one of Phoenix&rsquo;s strongest use cases. The <code>phoenix.evals</code> Python API runs independently of the Phoenix UI server — you can run evaluations in a CI job using <code>run_evals()</code>, check scores programmatically, and fail the pipeline if quality drops below threshold. Many teams run Phoenix evals as a pytest fixture or a standalone script that gates deployments when hallucination rate exceeds a threshold.</p>
<p><strong>Q: What is the difference between Phoenix traces and traditional APM traces?</strong></p>
<p>Traditional APM traces (Datadog, Jaeger, Zipkin) capture latency, error rates, and resource usage but have no understanding of LLM-specific semantics — they see an HTTP call to <code>api.openai.com</code> but can&rsquo;t tell you what prompt was sent or whether the response was faithful to the retrieved context. Phoenix traces use OpenInference semantic conventions that embed LLM-specific data — input messages, output messages, retrieved documents, embedding vectors, token counts — directly into span attributes, making them queryable and evaluatable in LLM-specific ways.</p>
]]></content:encoded></item></channel></rss>