Opentelemetry on RockB

AI Agent Observability with OpenTelemetry: From Dev to Production in 2026

Tue, 19 May 2026 09:04:46 +0000

OpenTelemetry is the standard way to add structured tracing, metrics, and logs to AI agents in 2026 — covering token usage, tool call latency, and multi-agent context propagation with a single SDK and vendor-neutral backends.

Why Traditional Observability Fails for AI Agents

Traditional APM tools like Datadog APM or New Relic were designed for deterministic request/response cycles: a user hits an endpoint, a function runs, a database query fires, a response returns. The execution path is fixed, latency is bounded, and errors are binary. AI agents break every one of these assumptions. An agent reasoning chain is non-deterministic — the same input prompt can trigger three tool calls in one run and seven in the next. Execution duration ranges from 500ms for a fast LLM call to 3+ minutes for a multi-step agent that searches the web, queries a database, and synthesizes results. Without agent-native spans, you cannot tell which tool call caused a timeout or why a particular run cost $0.40 while a similar one cost $0.03. Traditional APM measures function latency in microseconds and ignores tokens entirely. The LLM observability platform market recognized this gap — growing to an estimated $2.69 billion in 2026 and projected to reach $9.26 billion by 2030 at a 36.2% CAGR. OpenTelemetry’s GenAI Semantic Conventions fill that gap with a purpose-built span model for LLM operations, agent reasoning loops, and tool executions that traditional APM never anticipated.

What Makes AI Agent Telemetry Different?

AI agents require three observability primitives that traditional APM lacks. First, token-based cost attribution — you need to know how many input and output tokens each LLM call consumed, mapped to a session, user, or feature. Second, reasoning chain tracing — a parent span for the agent loop with child spans for each tool call, LLM request, and decision step, linked by trace context so you can reconstruct the full execution tree. Third, non-deterministic failure modes — an agent might hallucinate a tool name, exceed its context window mid-run, or loop indefinitely; catching these requires span attributes that conventional HTTP APM never defines. GenAI conventions add gen_ai.operation.name, gen_ai.system, gen_ai.request.model, and gen_ai.usage.input_tokens to fill exactly these gaps.

The Token Economy Problem

A single user session might trigger dozens of LLM calls across multiple agents. Without per-call token tracking, your billing dashboard shows a lump sum while your engineers have no idea which feature, agent, or user is driving costs. OpenTelemetry’s gen_ai.client.token.usage metric and corresponding span attributes let you aggregate token spend by gen_ai.agent.name, session ID, or custom attribute — giving you cost observability with the same instrumentation that drives latency dashboards.

OpenTelemetry GenAI Semantic Conventions: The 2026 Standard

OpenTelemetry GenAI Semantic Conventions are the standardized attribute names, span structure, and metric definitions that give AI telemetry a common language across every vendor and framework. In early 2026, GenAI client spans and the gen_ai.client.token.usage / gen_ai.client.operation.duration metrics exited experimental status and became stable — meaning you can rely on them in production without fear of breaking changes. Agent-specific spans (gen_ai.agent.name, gen_ai.tool.name) and framework-level instrumentation remain experimental but are production-stable at most major observability vendors. The conventions define how to capture prompt and completion content safely (in span events, not span attributes, to enable opt-in content capture without leaking PII into your metrics store). Gartner predicts that by 2028, LLM observability investments will account for 50% of GenAI deployments, up from 15% in early 2026 — and OpenTelemetry’s vendor-neutral standard is what makes that investment transferable across backends.

Core GenAI Span Attributes

The stable attributes every AI agent span should carry:

Attribute	Type	Example	Purpose
`gen_ai.system`	string	`openai`, `anthropic`	Identifies the LLM provider
`gen_ai.operation.name`	string	`chat`, `execute_tool`	Type of GenAI operation
`gen_ai.request.model`	string	`gpt-5`, `claude-opus-4`	Requested model name
`gen_ai.response.model`	string	`gpt-5-2026-05`	Actual model version used
`gen_ai.usage.input_tokens`	int	`1248`	Prompt tokens consumed
`gen_ai.usage.output_tokens`	int	`342`	Completion tokens generated
`gen_ai.agent.name`	string	`research_agent`	Identifies the agent (experimental)
`gen_ai.tool.name`	string	`web_search`	Tool called by the agent (experimental)

Span Events vs Span Attributes for Content

The conventions deliberately separate prompt and completion content from the main span attribute set. Content goes into span events — specifically gen_ai.content.prompt and gen_ai.content.completion events — rather than span attributes. This design means that a) content capture is opt-in (disabled by default), b) you can strip content at the collector level without losing metrics, and c) you avoid accidentally indexing PII into your tracing backend. For GDPR compliance, this is critical: you can run full token usage and latency observability without ever storing a single user message.

Setting Up OpenTelemetry for AI Agents in Python (Step-by-Step)

Getting OpenTelemetry running for an AI agent takes about 20 minutes from zero to local Jaeger traces. The setup uses opentelemetry-sdk, a GenAI instrumentation library (openlit, openinference, or opentelemetry-instrumentation-openai depending on your framework), and a local Jaeger instance for development. In production, you swap the exporter endpoint to Grafana Cloud, Honeycomb, or any OTLP-compatible backend — the instrumentation code stays identical. 85% of organizations with GenAI deployments planned for LLM observability as key infrastructure in 2026, and OpenTelemetry’s backend-agnostic design is why they can avoid vendor lock-in at the SDK layer. The key insight is that auto-instrumentation handles the heavy lifting for LLM API calls, while manual spans wrap the agent loop itself. This two-layer approach — auto-instrumented LLM calls nested inside manually-traced agent runs — gives you complete visibility into both LLM-level metrics (tokens, latency per call) and agent-level behavior (iterations, tool success rates, end-to-end duration) without duplicating code across every model integration your agent might use. The five steps below take you from a fresh Python environment to a trace visible in Jaeger, then show the one-line change needed to point that same setup at a production backend.

Step 1: Install Dependencies

pip install opentelemetry-sdk \
            opentelemetry-exporter-otlp \
            openlit \
            openai  # or anthropic, langchain, etc.

openlit is the simplest auto-instrumentation library for 2026 — one openlit.init() call instruments OpenAI, Anthropic, LangChain, and LlamaIndex clients automatically.

Step 2: Configure the Tracer Provider

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
import openlit

# Point to local Jaeger in dev, Grafana Cloud / Honeycomb in prod
otlp_endpoint = "http://localhost:4317"

provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint=otlp_endpoint))
)
trace.set_tracer_provider(provider)

# Auto-instrument all supported LLM clients
openlit.init(
    otlp_endpoint=otlp_endpoint,
    capture_message_content=False,  # Opt-in; set True only in dev
)

Step 3: Instrument Your Agent Loop

Auto-instrumentation covers LLM API calls. For the agent loop itself, add manual spans:

tracer = trace.get_tracer("my_agent", "1.0.0")

def run_agent(task: str, session_id: str) -> str:
    with tracer.start_as_current_span(
        "agent.run",
        attributes={
            "gen_ai.agent.name": "research_agent",
            "session.id": session_id,
            "agent.task": task[:100],  # Truncate for index efficiency
        }
    ) as span:
        for iteration in range(MAX_ITERATIONS):
            span.set_attribute("agent.iterations", iteration + 1)
            
            # LLM call — auto-instrumented by openlit
            response = client.chat.completions.create(
                model="gpt-5",
                messages=messages
            )
            
            tool_calls = extract_tool_calls(response)
            if not tool_calls:
                break
                
            for tool_call in tool_calls:
                with tracer.start_as_current_span(
                    "agent.tool_call",
                    attributes={
                        "gen_ai.tool.name": tool_call.name,
                        "gen_ai.tool.call.id": tool_call.id,
                    }
                ) as tool_span:
                    result = execute_tool(tool_call)
                    tool_span.set_attribute("tool.success", result.ok)
        
        return extract_final_answer(response)

Step 4: Run Local Jaeger for Development

docker run -d --name jaeger \
  -p 4317:4317 \
  -p 16686:16686 \
  jaegertracing/all-in-one:latest

Open http://localhost:16686 to see traces. Each agent run appears as a root span with nested LLM call spans and tool call spans — you can drill into any span to see token counts, model versions, and timing.

Step 5: Switch to Production Backend

Replace the OTLP endpoint with your production backend:

# Grafana Cloud
otlp_endpoint = "https://otlp-gateway-prod-us-east-0.grafana.net/otlp"

# Add authentication header
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
exporter = OTLPSpanExporter(
    endpoint=otlp_endpoint,
    headers={"Authorization": f"Bearer {GRAFANA_API_KEY}"}
)

The instrumentation code does not change — only the exporter endpoint and auth header.

The 6 Essential Metrics Every Production AI Agent Needs

Production AI agent observability requires six distinct metrics that cover cost, reliability, performance, and capacity. These map directly to OpenTelemetry GenAI metric definitions and can be derived from spans if you do not emit them explicitly. Agent execution durations range from 500ms to 3+ minutes; without these metrics, identifying which tool call caused a timeout is nearly impossible. The six metrics form a complete diagnostic surface: token usage ties to cost, tool call success rate ties to reliability, LLM latency ties to user experience, loop iterations catch infinite loops, context window utilization prevents silent truncation, and end-to-end latency covers the full user-facing impact. Two of these — gen_ai.client.token.usage and gen_ai.client.operation.duration — are stable OTel metrics in 2026, meaning vendor-provided dashboards and alerting templates are available out of the box. The remaining four are derived from span attributes on your agent spans. Tracking all six from day one of production deployment means you have a complete baseline when something goes wrong, rather than scrambling to add instrumentation after an incident. Each metric below includes the exact OTel attribute or metric name and a concrete alert threshold that distinguishes healthy agent behavior from a problem worth waking someone up for.

1. Token Usage per Run

Metric: gen_ai.client.token.usage (histogram, stable in OTel 2026)

Emit this metric with gen_ai.token.type (input / output), gen_ai.system, gen_ai.request.model, and a custom agent.name attribute. This lets you build dashboards showing cost per agent, per session, and per feature. For a production agent handling 10,000 sessions/day, a 10% reduction in input tokens can cut monthly spend by thousands of dollars — but you cannot optimize what you do not measure.

2. Tool Call Success Rate

Track tool.success as a boolean span attribute on each tool call span. Aggregate to a success rate metric by gen_ai.tool.name. A web search tool with a 95% success rate looks fine until you check that the 5% failures all cluster around a specific query pattern — only per-tool tracing surfaces that.

3. LLM Latency Distribution (p50/p95/p99)

Metric: gen_ai.client.operation.duration (histogram, stable in OTel 2026)

Track latency distribution by model and operation type. p99 latency matters for user-facing agents — if your p99 is 12 seconds, some users experience 12-second waits even if your median is 800ms. Percentile tracking requires a histogram metric, not an average.

4. Agent Loop Iterations

Set agent.iterations on the root agent span at completion. A healthy agent typically resolves in 1-5 iterations. Runs exceeding 10 iterations usually indicate prompt issues or tool failures causing the agent to retry. An alert on agent.iterations > 8 catches runaway loops before they exhaust token budgets.

5. Context Window Utilization

Calculate (input_tokens / model_context_window) * 100 per LLM call. When utilization exceeds 85%, you risk silent context truncation where the model loses early conversation history. Track this as a gauge metric by model — it informs when to implement context compression strategies.

6. End-to-End Latency

The duration of the root agent.run span, not individual LLM calls. This is the user-facing latency that maps to actual experience. An agent might have fast LLM calls but slow tool executions; only end-to-end latency catches that. SLA alerts should be set on this metric.

Distributed Tracing for Multi-Agent and Tool-Calling Workflows

Distributed tracing across agent boundaries is the hardest part of multi-agent observability — and the part where getting it wrong makes all other telemetry useless. When a coordinator agent calls a subagent via an HTTP API or a message queue, the trace context must propagate so that the subagent’s spans appear as children of the coordinator’s span in the same trace. Without propagation, you get disconnected traces: one for the coordinator, one for the subagent, with no way to link them. OpenTelemetry’s W3C Trace Context standard (traceparent and tracestate HTTP headers) handles this automatically for HTTP-based agent communication. For async message passing, you inject the trace context into message headers and extract it on the consumer side. In a real multi-agent system — for example, a coordinator that fans out to a research subagent, a writing subagent, and a fact-checking subagent — proper context propagation means a single trace ID covers the entire execution tree. You can see in one Jaeger view that the coordinator took 45 seconds total, the research subagent took 32 of those seconds (mostly waiting on a web search tool), and the writing subagent ran in 8 seconds. Without propagation, you would have three separate 3-node traces with no causal relationship visible between them. The code examples below show propagation for both HTTP and message queue communication patterns.

HTTP-Based Agent Communication

from opentelemetry.propagate import inject, extract
from opentelemetry import trace
import httpx

tracer = trace.get_tracer("coordinator_agent")

def call_subagent(task: str, subagent_url: str) -> dict:
    with tracer.start_as_current_span("coordinator.call_subagent") as span:
        headers = {}
        inject(headers)  # Injects traceparent and tracestate headers
        
        span.set_attribute("subagent.url", subagent_url)
        span.set_attribute("gen_ai.agent.name", "coordinator")
        
        response = httpx.post(
            subagent_url,
            json={"task": task},
            headers=headers
        )
        return response.json()

On the subagent side:

from opentelemetry.propagate import extract
from opentelemetry import trace
from flask import Flask, request

app = Flask(__name__)
tracer = trace.get_tracer("subagent")

@app.post("/run")
def run_subagent():
    # Extract trace context from incoming request headers
    ctx = extract(request.headers)
    
    with tracer.start_as_current_span(
        "subagent.run",
        context=ctx,  # Links this span to coordinator's trace
        attributes={"gen_ai.agent.name": "research_subagent"}
    ) as span:
        result = execute_research_task(request.json["task"])
        return {"result": result}

The result: coordinator call + subagent execution + all LLM calls inside both appear in a single trace in Jaeger or Grafana.

Message Queue Propagation (Kafka/Redis Streams)

# Producer (coordinator)
from opentelemetry.propagate import inject

def enqueue_task(task: dict, producer):
    headers = {}
    inject(headers)
    producer.send("agent_tasks", value=task, headers=list(headers.items()))

# Consumer (subagent worker)
from opentelemetry.propagate import extract

def process_task(message):
    headers = dict(message.headers)
    ctx = extract(headers)
    with tracer.start_as_current_span("subagent.process", context=ctx):
        execute_task(message.value)

Baggage for Session-Level Context

Use OpenTelemetry Baggage to propagate session IDs, user IDs, and feature flags across agent boundaries without adding them to every span manually:

from opentelemetry.baggage import set_baggage, get_baggage
from opentelemetry import context

# Set at entry point
ctx = set_baggage("session.id", session_id)
ctx = set_baggage("user.tier", "premium", context=ctx)

# Automatically available in all descendant spans
# Retrieve in subagent
session_id = get_baggage("session.id")

Choosing Your Observability Backend (Self-Hosted vs Managed)

The choice between self-hosted and managed observability backends for AI agents comes down to three factors: data residency requirements, engineering capacity for ops, and cost at scale. OTel in production nearly doubled year-over-year from 6% to 11% among enterprises in 2026, with 89% rating vendor compliance with GenAI conventions as critical. The good news: any backend that accepts OTLP works — you are not locked to any vendor at the SDK layer. The trade-off is operational overhead vs monthly SaaS spend. A managed backend like Grafana Cloud or Honeycomb costs roughly $20–$200/month for a medium-traffic AI agent deployment and requires zero ops work. A self-hosted Jaeger + VictoriaMetrics stack requires maintaining the infrastructure but gives you full control over data retention, no per-event pricing, and no data leaving your environment — critical for healthcare or financial services applications subject to HIPAA or SOC 2 requirements. Langfuse occupies a middle ground: open source and self-hostable, but with a managed cloud tier if you want LLM-native features without the ops overhead. The comparison table below shows GenAI-specific feature support across the major options so you can match backend capabilities to your observability requirements.

Comparison Table

Backend	Type	GenAI Support	Cost Model	Best For
Grafana Cloud	Managed	Native GenAI dashboards	Free tier + usage	Most teams starting out
Honeycomb	Managed	Full attribute querying	Per event	High-cardinality debugging
Langfuse	Managed + OSS	LLM-native, 21K+ GitHub stars	Free OSS / managed	LLM-first observability
Jaeger	Self-hosted	Standard OTel traces	Infrastructure cost	Dev/test, cost-sensitive
Grafana + Tempo	Self-hosted	Custom dashboards	Infrastructure cost	Full control, data residency
VictoriaMetrics	Self-hosted	Prometheus-compatible metrics	Infrastructure cost	Metrics-heavy workloads

Managed: Grafana Cloud

Grafana Cloud accepts OTLP traces, metrics, and logs from the same endpoint. Their AI/LLM dashboard templates include token usage panels, latency percentile histograms, and cost aggregation by agent. The free tier covers 50GB of logs and 10K traces/month — enough for a medium-traffic development environment.

Self-Hosted: Langfuse

Langfuse is the most popular open-source LLM observability platform (21,000+ GitHub stars by early 2026). It provides a purpose-built UI for LLM traces with session views, prompt management, and evaluation tooling that generic APM tools lack. Deploy with Docker Compose for single-node or Kubernetes for production:

git clone https://github.com/langfuse/langfuse
cd langfuse
docker compose up -d

Then point openlit at the Langfuse OTLP endpoint. Langfuse also maintains a Python SDK for direct integration if you prefer to skip the OTel SDK layer.

Self-Hosted: Jaeger + OpenTelemetry Collector

The minimal self-hosted stack for production:

# docker-compose.yml
services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    command: ["--config=/etc/otel-collector-config.yaml"]
    volumes:
      - ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
    ports:
      - "4317:4317"  # OTLP gRPC
      - "4318:4318"  # OTLP HTTP
  
  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"  # UI

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: "0.0.0.0:4317"
      http:
        endpoint: "0.0.0.0:4318"

processors:
  # Redact prompt content for GDPR compliance
  redaction:
    allow_all_keys: true
    blocked_values:
      - "gen_ai.content.prompt"
      - "gen_ai.content.completion"

exporters:
  jaeger:
    endpoint: jaeger:14250
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [redaction]
      exporters: [jaeger]

Production Deployment Checklist: Dev to Prod in One Guide

Moving AI agent observability from a local Jaeger instance to production requires addressing four concerns that do not exist in development: authentication, cardinality, sampling, and alerting. Each of these can either degrade your observability posture (too aggressive sampling) or destabilize your backend (unbounded cardinality). The following checklist is what a production AI agent deployment looks like when observability is treated as a first-class requirement from day one — not bolted on after the first incident. Authentication is the most urgent: an unauthenticated OTLP exporter will fail silently against a production backend, leaving you with zero traces and no error in application logs. Cardinality problems typically appear 2–4 weeks after launch when someone adds a user-ID-based span attribute and your metrics cardinality explodes. Sampling decisions made before you understand your traffic patterns are almost always wrong — start with 100% traces in production for the first week, then tune down using tail-based sampling once you understand what “normal” looks like. The alerting section below maps each of the six essential metrics from the previous section to a specific alert condition, giving you a working alert configuration you can paste directly into Grafana or your alerting tool of choice.

Authentication and Transport Security

Replace unauthenticated OTLP exporter with authenticated connection using API keys or mTLS
Rotate API keys for observability backends on the same schedule as other service credentials
Ensure OTLP exporter uses TLS (InsecureSkipVerify: false in production)
Validate that capture_message_content=False is set in production (opt-in content capture only in dev/staging)

Cardinality Management

Never use unbounded values (user IDs, session IDs, full URLs) as span attribute keys — only as values
Cap agent.task attribute to 100 characters to avoid high-cardinality string fields
Use gen_ai.request.model (the standardized attribute) instead of a custom model attribute — this ensures consistent cardinality across frameworks
Review tool name attributes — if tool names are dynamically generated, normalize them to a fixed set

Sampling Strategy

For production AI agents, tail-based sampling is the right default: sample 100% of errored traces, 100% of traces exceeding your p95 latency threshold, and 5-10% of successful fast traces. Head-based sampling at 10% will randomly drop slow or errored traces, defeating the purpose.

# otel-collector sampling config
processors:
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors-policy
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow-traces-policy
        type: latency
        latency: {threshold_ms: 5000}
      - name: probabilistic-policy
        type: probabilistic
        probabilistic: {sampling_percentage: 10}

Alerting Configuration

Set up these five alerts at minimum:

Alert	Condition	Severity
High agent loop iterations	`agent.iterations > 8` for >1% of runs	Warning
Tool call failure spike	Tool success rate drops below 90%	Critical
Token budget exceeded	`gen_ai.usage.input_tokens` > model limit × 0.9	Warning
End-to-end latency p99	Agent run duration p99 > 30s	Critical
Trace loss	No traces received from agent in 5 min	Critical

Instrumentation Coverage Verification

# Add this to your CI pipeline to verify instrumentation is active
import opentelemetry.trace as trace_api

def test_tracer_configured():
    tracer = trace_api.get_tracer("test")
    assert not isinstance(
        tracer, trace_api.ProxyTracer
    ), "TracerProvider not configured — spans will be no-ops"

A no-op TracerProvider is the silent failure mode: your code runs, no errors appear, and no traces arrive. This test catches that in CI before it reaches production.

Environment Variable Configuration

# Production environment
OTEL_SERVICE_NAME=my-production-agent
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production,service.version=1.2.3
OTEL_EXPORTER_OTLP_ENDPOINT=https://your-otlp-endpoint:4317
OTEL_EXPORTER_OTLP_HEADERS=Authorization=Bearer ${OTLP_API_KEY}
OTEL_TRACES_SAMPLER=parentbased_always_on  # Let collector handle tail sampling

# Dev override
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
OPENLIT_CAPTURE_MESSAGE_CONTENT=true

Keep all observability configuration in environment variables — never hardcode endpoints in instrumentation code.

FAQ

What is OpenTelemetry GenAI semantic conventions and why does it matter?

OpenTelemetry GenAI semantic conventions are the standardized set of attribute names, span types, and metric definitions that give AI/LLM telemetry a consistent structure across every framework and vendor. They matter because without them, a LangChain trace uses different attribute names than an OpenAI trace, making aggregation and alerting across your stack impossible. In 2026, core client spans and token usage metrics are stable, meaning you can build production dashboards on them without breaking change risk.

How do I instrument an OpenAI or Anthropic client without changing my application code?

Use openlit.init() before your first API call. It patches the OpenAI and Anthropic client libraries automatically using monkey-patching, so every client.chat.completions.create() call generates a span with GenAI attributes without requiring any code changes to your existing agent logic. You only need to configure the TracerProvider once at application startup.

What is the difference between session-level and request-level observability for AI agents?

Request-level observability covers a single LLM call: model, tokens, latency. Session-level observability covers the entire user interaction: all agent runs, all tool calls, cost totals, and outcome. Session-level requires threading a session.id through every span and aggregating across multiple traces. OpenTelemetry Baggage is the standard mechanism for propagating session context across agent boundaries without adding it to every span attribute manually.

How do I avoid storing sensitive user data (PII) in my traces?

Set capture_message_content=False in openlit (which is the default). GenAI conventions separate prompt/completion content into span events rather than attributes, so stripping them at the OpenTelemetry Collector level is straightforward — add a redaction processor that blocks gen_ai.content.prompt and gen_ai.content.completion event attributes. This gives you full token usage and latency observability without storing any message content.

Which observability backend should I use for AI agents in 2026?

For most teams: Grafana Cloud for its generous free tier, native OTLP support, and pre-built LLM dashboard templates. For LLM-specific features like prompt management and evaluation: Langfuse (open source, 21K+ GitHub stars). For maximum flexibility and self-hosting: Jaeger + OpenTelemetry Collector + VictoriaMetrics. All three work with the same OTel SDK instrumentation — you switch backends by changing an endpoint URL, not rewriting instrumentation code.

Arize Phoenix Guide: Open-Source LLM Observability for Developers (2026)

Sun, 17 May 2026 15:03:42 +0000

Arize Phoenix is a free, open-source LLM observability platform that gives developers full-stack visibility into LLM applications — tracing requests, evaluating outputs, and debugging RAG pipelines — without requiring a cloud subscription or vendor account. It runs locally in a Python process or scales to Docker and Kubernetes for production deployments.

What Is Arize Phoenix and Why It Matters in 2026

Arize Phoenix is an open-source observability platform built specifically for LLM applications, agents, and retrieval-augmented generation (RAG) pipelines. Unlike generic APM tools, Phoenix understands LLM-native concepts — spans, traces, embeddings, prompts, retrieved contexts, and model outputs — and surfaces them in a UI designed for AI engineers. As of 2026, Phoenix has surpassed 9,000 GitHub stars, making it one of the most-adopted open-source observability tools in the AI ecosystem. The platform is backed by Arize AI but released under a permissive open-source license, meaning you can run it entirely on your own infrastructure with no usage caps or feature gating.

The urgency behind Phoenix adoption is clear: the LLM observability market is growing from $1.97B in 2025 to $2.69B in 2026 at a 36.3% CAGR, and Gartner predicts that by 2028, observability will be embedded in 50% of GenAI deployments — up from just 15% today. Yet 57% of organizations already running AI agents in production rate observability as the lowest-quality part of their AI stack. Phoenix exists to close that gap for teams who can’t afford to ship LLM apps blind, and who want to own their trace data rather than send it to a SaaS vendor.

What Core Features Does Phoenix Offer?

Arize Phoenix ships four interconnected capabilities that cover the full LLM development lifecycle: tracing, evaluation, dataset management, and a prompt playground. Together they form a workflow loop: trace what your app is doing, evaluate whether outputs meet quality thresholds, curate failure cases into datasets, and iterate on prompts in the playground before deploying changes. This feedback loop is the key reason teams migrate from generic logging to Phoenix — instead of reading raw JSON logs, engineers see structured span trees, latency breakdowns per retrieval step, and LLM judge scores alongside the actual model outputs.

Tracing captures every span in an LLM workflow as an OpenTelemetry trace. A single user request to a RAG pipeline generates spans for the embedding call, vector DB retrieval, context concatenation, and final LLM generation — each with token counts, latency, and input/output payloads.

Evaluation runs 50+ research-backed metrics including hallucination detection, relevance, Q&A correctness, toxicity, and faithfulness. These can run in the Phoenix UI as one-off evals or in CI via the phoenix.evals Python API.

Dataset management lets you export traces — especially failure cases — into labeled datasets for fine-tuning or regression testing.

Prompt playground connects to your LLM provider APIs and lets you replay any captured trace against modified prompts to A/B test prompt changes against real historical inputs.

How Do You Install Phoenix in 5 Minutes?

Phoenix installs via pip and launches as a local web server that requires no external dependencies for basic usage. The minimum viable setup takes under five minutes and works in any Python 3.9+ environment, including notebooks, Docker containers, and CI runners.

pip install arize-phoenix arize-phoenix-otel openinference-instrumentation-openai

Then start the Phoenix server and point your app at it:

import phoenix as px

# Start Phoenix server (opens UI at http://localhost:6006)
session = px.launch_app()

# Configure OpenTelemetry to send traces to Phoenix
from phoenix.otel import register
tracer_provider = register(
    project_name="my-llm-app",
    endpoint="http://localhost:6006/v1/traces",
)

For Docker, a single command pulls and starts the full server:

docker run -p 6006:6006 -p 4317:4317 arizephoenix/phoenix:latest

Port 6006 serves the web UI. Port 4317 is the OpenTelemetry OTLP gRPC ingest endpoint. You can persist traces across restarts by mounting a volume to /mnt/data.

Notebook Usage

In Jupyter or Colab environments, px.launch_app() renders an embedded iframe directly in the notebook cell output. No separate terminal or process management required — Phoenix starts as a background thread within the kernel, making it ideal for exploratory data analysis on LLM outputs.

How Does OpenTelemetry Auto-Instrumentation Work with Phoenix?

Phoenix uses OpenTelemetry (OTel) as its trace collection standard, which means it benefits from a growing ecosystem of vendor-neutral instrumentation libraries. Auto-instrumentation patches popular LLM SDKs at import time — you add two lines of code and Phoenix captures every API call automatically, with no manual span creation required.

OpenTelemetry instrumentation in Phoenix works through the openinference family of packages. These are OTel-compatible semantic conventions for LLM-specific data: input messages, output messages, token usage, model name, embedding vectors, retrieved documents, and tool calls. When you call OpenAIInstrumentor().instrument(), the instrumentor monkey-patches the OpenAI Python client so every client.chat.completions.create() call emits a span with the full request/response payload automatically attached.

from openinference.instrumentation.openai import OpenAIInstrumentor

OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

# Now every OpenAI call is automatically traced
import openai
client = openai.OpenAI()
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain observability in one sentence."}]
)
# Trace appears in Phoenix UI automatically

Supported auto-instrumentation packages as of 2026:

Package	Instruments
`openinference-instrumentation-openai`	OpenAI Chat, Embeddings, Responses API
`openinference-instrumentation-anthropic`	Claude Messages API
`openinference-instrumentation-langchain`	LangChain chains, agents, tools
`openinference-instrumentation-llama-index`	LlamaIndex query engines, retrievers
`openinference-instrumentation-crewai`	CrewAI agent crews and tasks
`openinference-instrumentation-litellm`	LiteLLM proxy (any provider)

Custom Spans

For business logic that sits between LLM calls — pre-processing, validation, post-processing — you can add manual spans using the standard OTel tracer API:

from opentelemetry import trace
tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("validate-user-query") as span:
    span.set_attribute("query.length", len(user_query))
    cleaned = preprocess(user_query)

These custom spans appear nested within the auto-instrumented LLM spans in the Phoenix UI, giving full end-to-end visibility including your non-LLM application code.

How Do You Trace RAG Pipelines with LlamaIndex and LangChain?

RAG pipeline tracing is Phoenix’s strongest differentiator versus general-purpose observability tools. A RAG pipeline involves at least four distinct operations — query embedding, vector retrieval, context stuffing, and generation — and failures at any step produce subtly wrong outputs that are invisible without span-level visibility. Phoenix captures each step as a separate span and links them into a single trace tree, making it immediately obvious whether a bad answer came from poor retrieval or poor generation. In a typical LlamaIndex or LangChain RAG setup, a user question that returns a hallucinated answer could have failed at any of three points: the wrong documents were retrieved (retrieval failure), the correct documents were retrieved but the LLM ignored them (faithfulness failure), or the question was ambiguous and the embedding model found semantically unrelated chunks (embedding failure). Without Phoenix traces, distinguishing these failure modes requires manual logging and extensive print-statement debugging. With Phoenix, you see each span’s latency, input, and output in a hierarchical tree within seconds of the query completing.

LlamaIndex RAG Tracing

from openinference.instrumentation.llama_index import LlamaIndexInstrumentor

LlamaIndexInstrumentor().instrument(tracer_provider=tracer_provider)

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

# Load and index documents
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

# This query generates a full trace: embed → retrieve → generate
response = query_engine.query("What are the main risks of LLM hallucination?")

In Phoenix, this single query appears as a trace with child spans for:

embedding — the query vector computation (model, latency, token count)
retrieval — the top-k documents returned (document IDs, similarity scores)
llm — the generation call (prompt, completion, token usage, cost)

LangChain RAG Tracing

from openinference.instrumentation.langchain import LangChainInstrumentor

LangChainInstrumentor().instrument(tracer_provider=tracer_provider)

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.chains import RetrievalQA

embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_texts(documents, embeddings)
qa_chain = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-4o"),
    retriever=vectorstore.as_retriever(search_kwargs={"k": 5}),
)

result = qa_chain.invoke({"query": "What is LLM observability?"})

Phoenix captures the full LangChain chain execution including each tool call, retriever invocation, and LLM generation as nested spans.

How Do You Run LLM Evaluations in Phoenix?

Phoenix evaluations use LLM-as-a-judge to score traces against quality metrics — automatically and at scale. The phoenix.evals module provides pre-built eval templates backed by published research, so you don’t need to write your own judge prompts for common tasks like hallucination detection, relevance scoring, or Q&A correctness.

Running evals takes three steps: export traces from Phoenix, run the eval function, and ship scores back to Phoenix for visualization alongside the original traces.

import phoenix as px
from phoenix.evals import (
    HallucinationEvaluator,
    QAEvaluator,
    RelevanceEvaluator,
    run_evals,
)
from phoenix.evals import OpenAIModel

# Connect to running Phoenix instance
client = px.Client()

# Export traces from a project
traces_df = client.get_spans_dataframe(project_name="my-rag-app")

# Initialize evaluators
eval_model = OpenAIModel(model="gpt-4o")
evaluators = [
    HallucinationEvaluator(eval_model),
    QAEvaluator(eval_model),
    RelevanceEvaluator(eval_model),
]

# Run evals (parallelized automatically)
eval_results = run_evals(
    dataframe=traces_df,
    evaluators=evaluators,
    provide_explanation=True,
)

# Ship scores back to Phoenix
px.log_evaluations(*eval_results, project_name="my-rag-app")

After running, each trace in the Phoenix UI shows inline eval scores: hallucination: 0.12, relevance: 0.94, qa_correctness: 1.0. You can filter and sort by any eval metric to find the worst-performing traces for debugging.

Available Evaluation Metrics

Phoenix ships 50+ evaluation metrics across five categories:

Category	Metrics
Retrieval quality	Relevance, NDCG, Precision@k, Recall@k
Generation quality	Hallucination, Faithfulness, Q&A Correctness
Safety	Toxicity, PII detection, Prompt injection
Code	Code correctness, Execution success rate
Custom	Template-based LLM judge for any criteria

How Do You Self-Host Phoenix with Docker and Kubernetes?

Self-hosting Phoenix gives teams complete data sovereignty — traces never leave your infrastructure, which matters for regulated industries or any team with sensitive data flowing through their LLM apps. Phoenix supports three self-hosting paths: Docker Compose for small teams, standalone Docker for development, and Kubernetes Helm chart for production-scale deployments.

The Docker Compose setup is the recommended starting point for teams moving from local development to a shared instance:

# docker-compose.yml
services:
  phoenix:
    image: arizephoenix/phoenix:latest
    ports:
      - "6006:6006"   # Web UI
      - "4317:4317"   # OTLP gRPC
      - "4318:4318"   # OTLP HTTP
    volumes:
      - phoenix-data:/mnt/data
    environment:
      - PHOENIX_WORKING_DIR=/mnt/data
      - PHOENIX_SECRET=your-secret-key
      
volumes:
  phoenix-data:

docker compose up -d

For Kubernetes, Arize provides an official Helm chart:

helm repo add arize-phoenix https://arize-ai.github.io/phoenix
helm install phoenix arize-phoenix/phoenix \
  --set persistence.enabled=true \
  --set persistence.size=50Gi \
  --set ingress.enabled=true \
  --set ingress.host=phoenix.yourdomain.com

Environment Variables for Production

Variable	Purpose
`PHOENIX_SECRET`	Enables authentication (required for production)
`PHOENIX_WORKING_DIR`	Persistent storage path for SQLite database
`PHOENIX_ENABLE_AUTH`	Toggle basic auth (default: disabled)
`PHOENIX_SMTP_*`	Email configuration for alerts
`OTEL_EXPORTER_OTLP_ENDPOINT`	Override for custom OTLP collectors

Phoenix stores traces in SQLite by default, which handles millions of spans without external database dependencies. For high-throughput production workloads (10M+ spans/day), you can configure PostgreSQL as the backend database.

Arize Phoenix vs Langfuse vs LangSmith: Which Should You Choose?

Choosing between Phoenix, Langfuse, and LangSmith depends primarily on your stack, data sovereignty requirements, and evaluation depth needs. All three are viable for 2026 production deployments — the differences are in philosophy and depth rather than basic feature gaps.

Arize Phoenix wins when you need the deepest RAG evaluation capabilities, are running a mixed ML+LLM stack (since Phoenix integrates with traditional Arize model monitoring), or want 50+ pre-built eval metrics without writing judge prompts from scratch. Its OpenTelemetry-first design also makes it future-proof — your traces are portable to any OTel-compatible backend.

Langfuse wins for teams with strict data sovereignty requirements who want the simplest self-hosted setup under MIT license. Its pricing model is the most predictable at scale, and its API-first design integrates cleanly into non-Python stacks.

LangSmith wins exclusively for teams deeply invested in the LangChain/LangGraph ecosystem. Its tight integration with LangGraph agent debugging is unmatched, but it’s a proprietary platform with limited self-hosting options and pricing that scales poorly past moderate usage.

Feature	Arize Phoenix	Langfuse	LangSmith
License	Apache 2.0	MIT	Proprietary
Self-hostable	Yes	Yes	Limited
Built-in eval metrics	50+	Custom only	~10 built-in
RAG evaluation depth	Best-in-class	Basic	Good
OpenTelemetry native	Yes	Yes	No
LangChain integration	Good	Good	Native
LlamaIndex integration	Native	Good	Basic
Agent tracing	Yes	Yes	Best (LangGraph)
Playground	Yes	Yes	Yes
ML model monitoring	Via Arize AX	No	No
GitHub stars (2026)	9,000+	8,000+	6,000+

When to Choose Each

Choose Phoenix if:

Your app uses LlamaIndex or a custom RAG pipeline
You need hallucination/faithfulness eval out of the box
You run both traditional ML models and LLMs and want unified monitoring
You may scale to Arize AX’s enterprise features later

Choose Langfuse if:

Data sovereignty is a hard requirement and you need the simplest self-hosted setup
Your team uses multiple languages (Ruby, Go, Java) — Langfuse has broader SDK coverage
You want predictable open-source pricing with no enterprise upsell pressure

Choose LangSmith if:

Your entire stack is LangChain/LangGraph
You need the tightest possible agent step-debugging experience
You’re comfortable with proprietary tooling and SaaS pricing

When Does the Arize AX Enterprise Upgrade Make Sense?

Arize AX is the commercial enterprise platform that sits above Phoenix, sharing the same tracing foundation but adding features that matter at organizational scale. Phoenix to AX is an upgrade path, not a migration — your existing OpenTelemetry instrumentation works unchanged, and Phoenix traces can be forwarded to AX without re-instrumenting your codebase.

AX adds capabilities that Phoenix does not ship: role-based access control (RBAC) for multi-team environments, SSO integration (SAML, OIDC), advanced anomaly detection with alerting, production monitoring dashboards with SLA-grade uptime guarantees, dedicated support SLAs, and compliance reporting for SOC 2 and HIPAA-regulated deployments.

The upgrade makes economic sense when: your team has grown past 10-15 engineers sharing a single Phoenix instance and RBAC becomes a pain point; your legal team requires audit trails and SOC 2 compliance evidence; you need PagerDuty/OpsGenie integration for production LLM quality alerts; or your data volume exceeds what a self-managed PostgreSQL backend can handle without dedicated infrastructure investment.

For most startups and small engineering teams, Phoenix’s open-source version handles millions of daily spans without operational overhead. AX is targeted at enterprises with dedicated ML platform teams and organizational compliance requirements.

FAQ

Q: Is Arize Phoenix completely free?

Yes. Arize Phoenix is released under the Apache 2.0 license with no feature gating. You can run it locally, on your own servers, or in your own cloud account with no usage limits, no required API keys, and no phone-home telemetry. The commercial upgrade is Arize AX, a separate product with enterprise features — Phoenix itself remains fully open source.

Q: Does Phoenix work with non-OpenAI models like Claude, Gemini, or open-source LLMs?

Yes. Phoenix supports any model through OpenTelemetry instrumentation. For Anthropic Claude, use openinference-instrumentation-anthropic. For local models via Ollama or vLLM, use openinference-instrumentation-litellm with LiteLLM as a proxy. For Google Gemini, use the LiteLLM integration or manual spans. The openinference semantic conventions are model-provider agnostic.

Q: How does Phoenix handle trace data storage and retention?

By default, Phoenix stores all traces in a local SQLite database at ~/.phoenix/ (or the PHOENIX_WORKING_DIR path in Docker). There are no built-in retention limits — traces accumulate until you delete them. In production Docker deployments, mount a persistent volume to /mnt/data. For large-scale production, configure PostgreSQL as the backend to handle higher write throughput and enable standard database backup/retention policies.

Q: Can Phoenix run in CI/CD pipelines for automated LLM quality gates?

Yes, and this is one of Phoenix’s strongest use cases. The phoenix.evals Python API runs independently of the Phoenix UI server — you can run evaluations in a CI job using run_evals(), check scores programmatically, and fail the pipeline if quality drops below threshold. Many teams run Phoenix evals as a pytest fixture or a standalone script that gates deployments when hallucination rate exceeds a threshold.

Q: What is the difference between Phoenix traces and traditional APM traces?

Traditional APM traces (Datadog, Jaeger, Zipkin) capture latency, error rates, and resource usage but have no understanding of LLM-specific semantics — they see an HTTP call to api.openai.com but can’t tell you what prompt was sent or whether the response was faithful to the retrieved context. Phoenix traces use OpenInference semantic conventions that embed LLM-specific data — input messages, output messages, retrieved documents, embedding vectors, token counts — directly into span attributes, making them queryable and evaluatable in LLM-specific ways.