<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Observability on RockB</title><link>https://baeseokjae.github.io/tags/observability/</link><description>Recent content in Observability on RockB</description><image><title>RockB</title><url>https://baeseokjae.github.io/images/og-default.png</url><link>https://baeseokjae.github.io/images/og-default.png</link></image><generator>Hugo</generator><language>en-us</language><lastBuildDate>Tue, 12 May 2026 09:04:54 +0000</lastBuildDate><atom:link href="https://baeseokjae.github.io/tags/observability/index.xml" rel="self" type="application/rss+xml"/><item><title>AI Agent Observability 2026: Braintrust vs Arize Phoenix vs Langfuse Compared</title><link>https://baeseokjae.github.io/posts/ai-agent-observability-tools-2026/</link><pubDate>Tue, 12 May 2026 09:04:54 +0000</pubDate><guid>https://baeseokjae.github.io/posts/ai-agent-observability-tools-2026/</guid><description>Braintrust, Arize Phoenix, and Langfuse compared on tracing depth, evaluation capabilities, pricing, and self-hosting for production AI agents in 2026.</description><content:encoded><![CDATA[<p>The fastest-moving part of AI infrastructure in 2026 is observability — and for good reason. The LLM observability platform market hit $2.69B this year (up from $1.97B in 2025), growing at a 36.3% CAGR. Three platforms dominate production use: <strong>Braintrust</strong> (SaaS-only, $80M Series B, enterprise-grade CI/CD gates), <strong>Arize Phoenix</strong> (100% open-source, OpenTelemetry-native, 9,100+ GitHub stars), and <strong>Langfuse</strong> (MIT-licensed, ClickHouse-acquired, 19,000+ GitHub stars). Choosing the wrong one means either paying for features you won&rsquo;t use or hitting invisible ceilings when your agent fleet scales.</p>
<h2 id="why-ai-agent-observability-is-now-a-production-requirement">Why AI Agent Observability Is Now a Production Requirement</h2>
<p>AI agent observability is the practice of instrumenting, tracing, and evaluating multi-step AI workflows in production so that developers can diagnose failures, measure quality degradation, and prevent silent regressions. Unlike traditional application monitoring — which tracks HTTP response times and CPU usage — agent observability must capture the semantic content of LLM calls, tool invocations, retry chains, and intermediate reasoning steps. The challenge is real: the average AI trace span is ~50KB versus ~900 bytes in traditional observability, a 55x data density gap that causes legacy monitoring tools like Datadog and New Relic to fail silently or generate 100x higher costs when ingesting LLM traces naively.</p>
<p>The stakes became clear in 2026: 57% of organizations now run AI agents in production, yet observability remains the lowest-rated part of the AI stack. Gartner predicts LLM observability investments will reach 50% of GenAI deployments by 2028, up from just 15% today. Teams that skip this infrastructure pay for it in customer-reported hallucinations, undiscovered prompt regressions after model updates, and compliance audits that fail because there&rsquo;s no trace of what an agent actually decided. Companies like Notion demonstrate the ROI directly — after adopting Braintrust, their team went from catching 3 AI quality issues per day to 30 (a 10x improvement). For any team running production agents today, observability is no longer optional.</p>
<h2 id="the-three-contenders-what-braintrust-arize-phoenix-and-langfuse-actually-do">The Three Contenders: What Braintrust, Arize Phoenix, and Langfuse Actually Do</h2>
<p>Braintrust, Arize Phoenix, and Langfuse each represent a distinct approach to the same problem: making AI agent behavior legible, measurable, and improvable. Braintrust is a vertically integrated SaaS platform that packages tracing, evaluation, prompt management, and dataset curation into a single product with a proprietary database (Brainstore) optimized for LLM workloads. Arize Phoenix is an Apache/Elastic-licensed open-source library built natively on OpenTelemetry, designed to drop into any infrastructure stack without vendor lock-in. Langfuse is an MIT-licensed open-source platform acquired by ClickHouse in January 2026, now backed by a $15B database company whose core technology powers Langfuse&rsquo;s sub-millisecond query performance.</p>
<p>The surface-level difference is deployment model: SaaS-only (Braintrust) versus self-host-first (Arize Phoenix and Langfuse). But the more important differences lie in evaluation philosophy, integration depth, and pricing structure. Braintrust makes evaluation a first-class citizen integrated directly into CI/CD pipelines. Arize Phoenix brings 50+ research-backed metrics out of the box and prioritizes OTel standardization. Langfuse offers maximum flexibility through an open pipeline model and the broadest framework compatibility. Each platform is genuinely good at what it optimizes for — the question is what your team actually needs.</p>
<h2 id="architecture--deployment-saas-vs-open-source-vs-hybrid">Architecture &amp; Deployment: SaaS vs Open Source vs Hybrid</h2>
<p>Architecture and deployment model determine your long-term data residency, ops burden, and cost trajectory — which is why it&rsquo;s often the first decision before evaluating features. Braintrust is cloud-only with no self-hosting option; your traces and evaluation data live on Braintrust&rsquo;s infrastructure with SOC2 compliance and optional HIPAA-compliant VPC deployment for enterprise contracts. Arize Phoenix is 100% open-source under Elastic License 2.0 and free to self-host on any infrastructure — AWS, Azure, GCP, or on-premise — with zero feature gates. The commercial cloud offering (Arize AX) adds real-time alerting via PagerDuty/Slack, drift detection, and an AI debugging assistant named Alyx. Langfuse ships as MIT-licensed open source deployable via Docker Compose, with a self-hosted stack requiring PostgreSQL and ClickHouse across 5+ services to manage.</p>
<table>
  <thead>
      <tr>
          <th>Platform</th>
          <th>Deployment</th>
          <th>License</th>
          <th>Self-Host Complexity</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Braintrust</td>
          <td>SaaS only</td>
          <td>Proprietary</td>
          <td>None (no self-host)</td>
      </tr>
      <tr>
          <td>Arize Phoenix</td>
          <td>OSS + Commercial cloud</td>
          <td>Elastic License 2.0</td>
          <td>Low (single container)</td>
      </tr>
      <tr>
          <td>Langfuse</td>
          <td>OSS + Managed cloud</td>
          <td>MIT</td>
          <td>Medium (5+ services)</td>
      </tr>
  </tbody>
</table>
<p>For teams with strict data residency requirements in regulated industries, Arize Phoenix&rsquo;s self-host is the simplest path — a single Docker container without licensing fees. For teams that want zero ops overhead and accept SaaS data handling, Braintrust wins. Langfuse after the ClickHouse acquisition now offers excellent cloud performance but self-hosting means owning a non-trivial infrastructure stack.</p>
<h2 id="tracing--agent-workflow-visibility-depth-of-multi-step-agent-support">Tracing &amp; Agent Workflow Visibility: Depth of Multi-Step Agent Support</h2>
<p>Tracing in agent observability means capturing not just LLM API calls but the complete execution graph: which tools fired, in what order, what each intermediate step returned, where retry loops occurred, and how long each node took. This is fundamentally harder than logging HTTP requests because agents branch, recurse, and run sub-agents. All three platforms handle single-chain LLM traces well — the differentiation shows in complex multi-agent orchestration. Braintrust auto-instruments 13+ frameworks including OpenAI Agents SDK, LangGraph, Mastra, Pydantic AI, LangChain, CrewAI, and the Vercel AI SDK, producing nested span hierarchies that visualize exactly which agent called which tool with what input/output pairs. The Loop AI assistant can auto-suggest evaluation criteria by analyzing production trace patterns, reducing the manual work of defining what &ldquo;good&rdquo; looks like.</p>
<p>Arize Phoenix provides auto-instrumentation for the same major frameworks through its OTel-native SDK, meaning trace data flows out using the OpenTelemetry Protocol (OTLP) standard — readable by any compatible backend. This is its killer feature for enterprise teams that already have OTel infrastructure: Phoenix traces plug into existing observability stacks without vendor-specific agents. The 2026 Evaluator Hub adds commit-level version control for evaluators, enabling teams to track how evaluation logic itself evolves alongside the agent code. Langfuse&rsquo;s OpenTelemetry support covers Pydantic AI, smolagents, Strands Agents, and all major frameworks, with the ClickHouse v3 backend enabling sub-millisecond queries when filtering across millions of traces by metadata, session ID, or user.</p>
<h2 id="evaluation-capabilities-built-in-metrics-llm-as-a-judge-and-cicd-gates">Evaluation Capabilities: Built-In Metrics, LLM-as-a-Judge, and CI/CD Gates</h2>
<p>Evaluation is where the three platforms diverge most sharply. Braintrust&rsquo;s core thesis is that evaluation should be automated, version-controlled, and integrated directly into your software deployment process — not a manual review step after the fact. Its CI/CD deployment blocking feature automatically fails a build when eval quality falls below a defined threshold, treating AI quality degradation the same way unit test failures block code merges. This is the feature Notion credits for the 10x improvement in issue detection. The Loop assistant generates evaluation criteria from production traces, removing the cold-start problem of &ldquo;what should I even measure?&rdquo; Braintrust&rsquo;s proprietary Brainstore database delivers sub-1-second median query times on evaluation result sets, which matters when you&rsquo;re running thousands of evaluations against a prompt change.</p>
<p>Arize Phoenix ships 50+ research-backed built-in evaluation metrics covering faithfulness, relevance, safety, toxicity, and hallucination detection — the deepest out-of-the-box eval library of the three platforms. The 2026 Evaluator Hub adds commit-level versioning so evaluation logic is treated with the same rigor as application code. This is particularly valuable for regulated industries where you need to prove that your evaluation criteria haven&rsquo;t silently changed between compliance audits. Langfuse takes a flexible, pipeline-based approach: LLM-as-a-judge, user feedback collection, manual labeling, and custom evaluation pipelines via SDKs and APIs. It doesn&rsquo;t have the broadest built-in library (that&rsquo;s Phoenix) or the deepest CI/CD integration (that&rsquo;s Braintrust), but it gives maximum flexibility for teams building custom evaluation workflows.</p>
<table>
  <thead>
      <tr>
          <th>Capability</th>
          <th>Braintrust</th>
          <th>Arize Phoenix</th>
          <th>Langfuse</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Built-in eval metrics</td>
          <td>Custom + Loop auto-gen</td>
          <td>50+ research-backed</td>
          <td>LLM-as-a-judge + custom</td>
      </tr>
      <tr>
          <td>CI/CD deployment gates</td>
          <td>Yes (native)</td>
          <td>No</td>
          <td>No</td>
      </tr>
      <tr>
          <td>Evaluator versioning</td>
          <td>Yes</td>
          <td>Yes (Evaluator Hub)</td>
          <td>Yes</td>
      </tr>
      <tr>
          <td>User feedback collection</td>
          <td>Yes</td>
          <td>Yes</td>
          <td>Yes</td>
      </tr>
      <tr>
          <td>Custom eval pipelines</td>
          <td>Yes</td>
          <td>Yes</td>
          <td>Yes (open API)</td>
      </tr>
  </tbody>
</table>
<h2 id="pricing-comparison-real-cost-for-teams-at-different-scales">Pricing Comparison: Real Cost for Teams at Different Scales</h2>
<p>Pricing structures between these three platforms are designed for fundamentally different buyers. Braintrust&rsquo;s free tier is the most generous for getting started: 1 million trace spans plus 10,000 evaluation scores per month with unlimited users. The Pro plan at $249/month removes limits for small-to-mid teams. Arize Phoenix (the open-source library) is completely free to self-host with no feature restrictions — the commercial Arize AX cloud starts at $50/month for 50,000 spans and scales to roughly $50,000/year for enterprise contracts that include real-time alerting, SOC2/HIPAA compliance, and the Alyx AI debugging assistant. Langfuse offers a Hobby tier free at 50,000 observations/month for 2 users, a Core plan at $29/month for unlimited users, a Pro plan at $199/month with SOC2/ISO27001 compliance, and Enterprise plans starting at $2,499/month.</p>
<table>
  <thead>
      <tr>
          <th>Plan</th>
          <th>Braintrust</th>
          <th>Arize Phoenix/AX</th>
          <th>Langfuse</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Free tier</td>
          <td>1M spans + 10K evals, unlimited users</td>
          <td>25K spans/month (AX cloud) or unlimited self-host</td>
          <td>50K obs/month, 2 users</td>
      </tr>
      <tr>
          <td>Entry paid</td>
          <td>$249/month (Pro)</td>
          <td>$50/month (50K spans)</td>
          <td>$29/month (unlimited users)</td>
      </tr>
      <tr>
          <td>Compliance tier</td>
          <td>Enterprise VPC (custom)</td>
          <td>~$50K/year (AX Enterprise)</td>
          <td>$199/month (SOC2/ISO27001)</td>
      </tr>
      <tr>
          <td>Self-host</td>
          <td>Not available</td>
          <td>Free (Phoenix OSS)</td>
          <td>Free (MIT OSS)</td>
      </tr>
  </tbody>
</table>
<p>One gotcha with Langfuse&rsquo;s pricing: it&rsquo;s observation-based, and agent traces with many small spans can hit tier thresholds faster than expected. A multi-step agent that fires 20 tool calls generates 20+ observations per user interaction — teams running high-volume production agents should model this carefully before choosing the Hobby or Core tier.</p>
<h2 id="enterprise-compliance--security-soc2-hipaa-and-data-residency">Enterprise Compliance &amp; Security: SOC2, HIPAA, and Data Residency</h2>
<p>Compliance requirements often narrow the decision before any feature comparison happens. Braintrust is SOC2 Type II compliant, with HIPAA-compliant VPC deployment available for enterprise contracts — meaning healthcare and fintech teams can keep traces in their own cloud account. The trade-off is that Braintrust&rsquo;s SaaS architecture means data transits their infrastructure regardless; VPC deployment just controls where it lands at rest. Arize AX (the commercial platform) holds SOC2, GDPR, and HIPAA certifications; the Phoenix open-source library has no compliance certifications because it&rsquo;s self-hosted infrastructure you control directly. Langfuse&rsquo;s Pro cloud tier ($199/month) includes SOC2 Type II and ISO 27001 — a notable combination that satisfies most enterprise procurement checklists without requiring the $2,499+/month Enterprise contract. Langfuse&rsquo;s MIT license also means enterprise legal teams face no licensing risk from the open-source component.</p>
<p>For regulated industries specifically, Arize Phoenix self-hosted eliminates data residency concerns entirely — your traces never leave your network. The cost is operational: your platform team owns uptime, upgrades, and backup. Teams choosing Langfuse self-hosted get the same data control at the cost of managing ClickHouse + PostgreSQL in production, which is non-trivial but well-documented.</p>
<h2 id="integration-coverage-which-frameworks-and-providers-are-supported">Integration Coverage: Which Frameworks and Providers Are Supported</h2>
<p>Integration breadth determines whether you can instrument your stack on day one or spend weeks writing custom instrumentation. Braintrust supports 13+ frameworks natively: OpenAI Agents SDK, LangGraph, Mastra, Pydantic AI, LangChain, CrewAI, Vercel AI SDK, and others via auto-instrumentation. Any LLM provider (Anthropic, OpenAI, Mistral, Cohere, Bedrock) works through the OTEL-compatible SDK. Arize Phoenix&rsquo;s OTel-native design means any framework with an OpenTelemetry exporter integrates automatically — which covers LangChain, LlamaIndex, OpenAI Agents SDK, Claude Agent SDK, CrewAI, DSPy, LangGraph, Mastra, and the Vercel AI SDK. Python, TypeScript, and Java are all supported. Langfuse similarly covers all major frameworks through its OpenTelemetry support, with explicit integrations for Pydantic AI, smolagents, and Strands Agents alongside the standard LangChain/LlamaIndex ecosystem.</p>
<p>The practical difference: if your team is using a niche or proprietary framework, Arize Phoenix&rsquo;s full OTel compliance means you only need to add standard OTLP export to get traces flowing — no Phoenix-specific SDK required. Braintrust and Langfuse both have broad but SDK-specific integrations that require wrapping API calls through their libraries.</p>
<h2 id="head-to-head-verdict-which-tool-wins-for-each-team-type">Head-to-Head Verdict: Which Tool Wins for Each Team Type</h2>
<p>No single platform wins across all use cases in 2026. The right choice maps directly to your team&rsquo;s architecture, compliance requirements, and engineering bandwidth.</p>
<p><strong>Choose Braintrust if:</strong> You&rsquo;re a product-led AI team (startups, scale-ups, B2B SaaS) that wants a fully managed solution with native CI/CD quality gates. The Notion use case is the template: engineering teams that want evaluation to be automatic, blocking, and integrated into existing deployment workflows without writing infrastructure. The $249/month Pro plan covers most teams that aren&rsquo;t at enterprise scale yet. The SaaS-only limitation only becomes a problem if legal/compliance prohibits cloud data handling.</p>
<p><strong>Choose Arize Phoenix if:</strong> You&rsquo;re a platform-engineering-led organization that has already standardized on OpenTelemetry and needs strict data residency or on-premise deployment. The 50+ built-in eval metrics are genuinely the best out-of-the-box evaluation library available. For enterprise teams at $50K+/year, Arize AX adds real-time alerting and AI debugging that no other platform matches on the commercial side.</p>
<p><strong>Choose Langfuse if:</strong> You want maximum open-source flexibility with the lowest cost entry point. The MIT license, ClickHouse-backed performance, and $29/month Core plan for unlimited users make it the default for cost-sensitive teams and open-source-native organizations. The ClickHouse acquisition adds long-term performance guarantees that weren&rsquo;t there a year ago.</p>
<h2 id="quick-decision-guide-pick-braintrust-arize-phoenix-or-langfuse-in-60-seconds">Quick Decision Guide: Pick Braintrust, Arize Phoenix, or Langfuse in 60 Seconds</h2>
<p>Picking an AI observability platform in 2026 comes down to three questions answered in sequence: (1) Can you use SaaS, or do you require self-hosting? (2) Do you need CI/CD deployment gates integrated into your build pipeline? (3) What&rsquo;s your monthly budget per team? If SaaS is acceptable and CI/CD quality gates are on your roadmap, start with Braintrust&rsquo;s free tier — it&rsquo;s the fastest path to automated evaluation with zero infrastructure overhead. If self-hosting is required or you&rsquo;re already running OpenTelemetry infrastructure, Arize Phoenix is the correct choice; deploy the OSS library in a day and upgrade to Arize AX only when you need managed alerting and compliance certifications. If cost is the primary constraint and you want MIT-licensed flexibility with production-grade ClickHouse performance, Langfuse Core at $29/month for unlimited users is genuinely hard to beat.</p>
<p>The market will consolidate further — Gartner&rsquo;s prediction of 50% GenAI deployment coverage by 2028 means the tools getting integrated into developer workflows now will become the default infrastructure layer. All three platforms are actively maintained, well-funded (directly or through acquisition), and improving rapidly. The switching cost grows over time as evaluation datasets, prompt versions, and trace history accumulate on a platform. Pick based on your current architecture and compliance reality, not hypothetical future requirements.</p>
<hr>
<h2 id="faq">FAQ</h2>
<p><strong>Q: What is AI agent observability, and why does it matter?</strong>
AI agent observability is the practice of instrumenting, tracing, and evaluating multi-step AI workflows to detect quality degradation, diagnose failures, and prevent regressions. It differs from traditional monitoring because AI traces capture semantic LLM outputs, not just latency and error rates. With 57% of organizations running production agents in 2026, observability is the difference between catching hallucinations proactively and discovering them from customer complaints.</p>
<p><strong>Q: Can I self-host Braintrust for free?</strong>
No. Braintrust is SaaS-only — there is no self-hosted option. All trace data is stored on Braintrust&rsquo;s infrastructure, with HIPAA-compliant VPC deployment available for enterprise contracts. If self-hosting is a requirement, Arize Phoenix (Elastic License 2.0) or Langfuse (MIT) are the alternatives.</p>
<p><strong>Q: What did ClickHouse acquiring Langfuse change for self-hosters?</strong>
For self-hosters, the acquisition changed nothing immediately — Langfuse remains MIT-licensed with no feature restrictions. The benefit is long-term: ClickHouse, whose database now powers Langfuse&rsquo;s v3 backend, has a $15B valuation and deep expertise in high-throughput analytical queries. The ClickHouse migration delivers sub-millisecond query performance on millions of traces, something the previous PostgreSQL-only backend couldn&rsquo;t match at scale.</p>
<p><strong>Q: Does Arize Phoenix work with the Anthropic Claude API?</strong>
Yes. Arize Phoenix provides auto-instrumentation for the Claude Agent SDK and the Anthropic API through its OpenTelemetry-native SDK. Because Phoenix is built on standard OTLP, any Anthropic client library that supports OTel export will work automatically without Phoenix-specific wrappers.</p>
<p><strong>Q: How does Braintrust&rsquo;s CI/CD deployment blocking actually work?</strong>
Braintrust integrates evaluation runs directly into CI/CD pipelines (GitHub Actions, CircleCI, and others). You define evaluation thresholds — for example, &ldquo;faithfulness score must stay above 0.85&rdquo; — and run evaluations against your new prompt version as part of the build process. If scores fall below threshold, the pipeline fails and the deployment is blocked, the same way a failing unit test blocks a merge. This requires defining evaluation criteria and maintaining an evaluation dataset, which the Loop AI assistant can help auto-generate from production traces.</p>
]]></content:encoded></item></channel></rss>