As AI applications move from prototype to production, the gap between what your LLM is doing and what you can actually observe grows dangerously wide. OpenObserve is an open-source, Apache 2.0-licensed observability platform built in Rust that unifies logs, metrics, and traces under a single roof — making it a compelling choice for teams who need full visibility into their AI stack without handing over their data or their budget. In this guide, you’ll get a complete walkthrough of OpenObserve’s LLM monitoring capabilities: from initial setup to cost dashboards, integrations, alerting, and a clear comparison against the major commercial alternatives.
Why LLM Observability Has Become a Business Priority in 2026
The LLM observability market is projected to reach $2.3 billion by 2027, growing at a 68% CAGR according to Lunary Market Research 2026 — and the underlying pressure driving that growth is impossible to ignore. A Gartner AI Adoption Survey from 2025 found that 67% of enterprises cite “lack of visibility into LLM costs and performance” as their top operational challenge with AI. Without proper monitoring, teams are flying blind: they don’t know which prompts are consuming the most tokens, why latency spikes on certain request types, or when a model update silently degrades output quality. The average LLM application in production makes roughly 15,000 API calls per day (Portkey Usage Statistics 2025), each of which represents a cost, a latency event, and a potential failure point. At that volume, a 5% error rate means 750 failed requests every single day. Observability is no longer a “nice to have” — it is the operational foundation that separates teams shipping reliable AI products from those drowning in support tickets and runaway cloud bills.
What Is OpenObserve and Why Does It Matter for AI Teams
OpenObserve is an open-source observability platform that, as of Q1 2026, has accumulated 8,500+ GitHub stars and contributions from 300+ contributors — a strong signal of community adoption and long-term maintenance health. Unlike purpose-built LLM tools that only capture model-level data, OpenObserve gives you a unified view: infrastructure logs, application traces, custom metrics, and LLM-specific signals all in one queryable store. The platform is built in Rust, which is the core reason it can handle 500,000 events per second per node while maintaining a dramatically smaller memory footprint than Java or Go alternatives. OpenObserve processes over 2 petabytes of observability data monthly across 1,200+ organizations according to its 2026 Transparency Report. Licensing is Apache 2.0, meaning you can self-host with no usage caps, no per-seat costs, and no vendor lock-in. For AI teams that generate enormous volumes of trace data from LLM chains, agents, and retrieval pipelines, that scale-at-cost advantage is decisive.
Installing and Configuring OpenObserve for LLM Monitoring
Getting OpenObserve running for LLM workloads takes under 10 minutes with Docker. The fastest path is a single container deployment pointed at a local or cloud object store for data persistence.
# Pull and start OpenObserve
docker run -d \
--name openobserve \
-p 5080:5080 \
-e ZO_ROOT_USER_EMAIL="admin@yourcompany.com" \
-e ZO_ROOT_USER_PASSWORD="YourSecurePassword123" \
-v $HOME/openobserve-data:/data \
public.ecr.aws/zinclabs/openobserve:latest
Once the container is up, navigate to http://localhost:5080 and log in with the credentials you set. For production deployments on Kubernetes, the official Helm chart is the recommended path:
helm repo add openobserve https://charts.openobserve.ai
helm repo update
helm install openobserve openobserve/openobserve \
--namespace openobserve \
--create-namespace \
--set auth.rootUserEmail="admin@yourcompany.com" \
--set auth.rootUserPassword="YourSecurePassword123" \
--set storage.type=s3 \
--set storage.s3.bucketName=your-observability-bucket
For LLM monitoring specifically, you’ll want to create a dedicated organization and stream in OpenObserve’s UI. Navigate to Streams, create a new log stream called llm_traces, and note your ingestion endpoint — typically http://localhost:5080/api/{org_id}/llm_traces/_json. This endpoint accepts JSON-formatted log payloads that you’ll populate from your application instrumentation.
Integrating OpenObserve with LangChain and LlamaIndex
Connecting LangChain to OpenObserve requires a custom callback handler that intercepts chain events and ships them as structured logs. The integration captures token counts, prompt content, completion content, latency, and any errors at each chain step.
import time
import requests
from langchain.callbacks.base import BaseCallbackHandler
from langchain_openai import ChatOpenAI
from langchain.schema import HumanMessage
OPENOBSERVE_URL = "http://localhost:5080/api/default/llm_traces/_json"
OPENOBSERVE_AUTH = ("admin@yourcompany.com", "YourSecurePassword123")
class OpenObserveCallbackHandler(BaseCallbackHandler):
def __init__(self, stream: str = "llm_traces"):
self.stream = stream
self._start_times = {}
def on_llm_start(self, serialized, prompts, **kwargs):
run_id = str(kwargs.get("run_id", ""))
self._start_times[run_id] = time.time()
def on_llm_end(self, response, **kwargs):
run_id = str(kwargs.get("run_id", ""))
latency_ms = (time.time() - self._start_times.pop(run_id, time.time())) * 1000
for generation in response.generations:
for gen in generation:
usage = response.llm_output.get("token_usage", {})
payload = [{
"_timestamp": int(time.time() * 1_000_000),
"event": "llm_completion",
"model": response.llm_output.get("model_name", "unknown"),
"prompt_tokens": usage.get("prompt_tokens", 0),
"completion_tokens": usage.get("completion_tokens", 0),
"total_tokens": usage.get("total_tokens", 0),
"latency_ms": round(latency_ms, 2),
"finish_reason": gen.generation_info.get("finish_reason", "unknown"),
"content_preview": gen.text[:200],
}]
requests.post(OPENOBSERVE_URL, json=payload, auth=OPENOBSERVE_AUTH)
def on_llm_error(self, error, **kwargs):
payload = [{
"_timestamp": int(time.time() * 1_000_000),
"event": "llm_error",
"error_type": type(error).__name__,
"error_message": str(error),
}]
requests.post(OPENOBSERVE_URL, json=payload, auth=OPENOBSERVE_AUTH)
# Usage
handler = OpenObserveCallbackHandler()
llm = ChatOpenAI(model="gpt-4o", callbacks=[handler])
response = llm.invoke([HumanMessage(content="Summarize the GDPR in three bullet points.")])
For LlamaIndex, the integration follows the same pattern using LlamaIndex’s CallbackManager and CBEventType. You attach the callback manager to your ServiceContext at initialization, and every query, retrieval, and synthesis event flows automatically to OpenObserve with token counts and timing data attached.
Monitoring LLM Costs, Token Usage, and Latency
Cost visibility is the single most impactful thing a monitoring layer can provide for LLM applications. Teams using dedicated LLM monitoring tools report a 40-60% reduction in unexpected API costs according to Helicone’s Customer Survey 2025 — and without visibility, 53% of AI teams spend more than 20 hours monthly just manually tracking LLM expenses (LangWatch Developer Survey 2026). OpenObserve makes cost tracking concrete by letting you compute spend directly from token logs using SQL-style queries against your llm_traces stream.
In OpenObserve’s Logs view, run this query to calculate daily GPT-4o spend:
SELECT
DATE_TRUNC('day', TO_TIMESTAMP(_timestamp / 1000000)) AS day,
SUM(prompt_tokens) AS total_prompt_tokens,
SUM(completion_tokens) AS total_completion_tokens,
SUM(prompt_tokens) * 0.0000025 + SUM(completion_tokens) * 0.00001 AS estimated_cost_usd
FROM "llm_traces"
WHERE model = 'gpt-4o'
AND _timestamp >= NOW() - INTERVAL '30 days'
GROUP BY day
ORDER BY day DESC
For latency monitoring, create a dashboard panel using a histogram query:
SELECT
histogram(_timestamp, '1 hour') AS time_bucket,
AVG(latency_ms) AS avg_latency,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY latency_ms) AS p95_latency,
PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY latency_ms) AS p99_latency,
COUNT(*) AS request_count
FROM "llm_traces"
WHERE event = 'llm_completion'
GROUP BY time_bucket
ORDER BY time_bucket
These queries form the backbone of your cost and performance dashboard. Pin them to a dedicated “LLM Operations” dashboard in OpenObserve and set the refresh interval to 5 minutes for near-real-time awareness.
Setting Up Alerts for AI Application Reliability
Proactive alerting is what separates a monitoring setup from a true observability practice. OpenObserve supports scheduled and real-time alerts that fire when query thresholds are breached, with notification delivery to Slack, PagerDuty, email, and any webhook target. For LLM applications, you need at minimum four alert types: error rate spikes, latency degradation, cost anomalies, and token quota proximity.
To configure an error rate alert, navigate to Alerts in the OpenObserve UI, create a new scheduled alert, and use this query:
SELECT
COUNT(*) FILTER (WHERE event = 'llm_error') * 100.0 / COUNT(*) AS error_rate_pct
FROM "llm_traces"
WHERE _timestamp >= NOW() - INTERVAL '15 minutes'
Set the condition to trigger when error_rate_pct > 5 and assign a severity of Critical. For cost anomalies, a daily budget alert that fires when estimated spend crosses 80% of your monthly allocation gives your team time to investigate before an incident becomes a budget overrun. OpenObserve alert templates support Jinja2-style variable substitution, so your Slack notifications can include the actual metric value, the threshold that was crossed, and a deep link back to the relevant dashboard panel — making it easy for on-call engineers to jump straight to the data without navigating through the UI.
OpenObserve vs. Helicone, LangSmith, Portkey, and Langfuse
Choosing between OpenObserve and purpose-built LLM observability platforms is fundamentally a question of what you’re optimizing for: breadth and control versus depth and LLM-specific features out of the box.
Helicone is a proxy-based LLM gateway that sits between your application and any LLM provider, capturing every request and response automatically with zero code changes. It has excellent cost dashboards and prompt management, but it requires routing all your LLM traffic through Helicone’s infrastructure — a non-starter for many enterprise security teams.
LangSmith (from LangChain) provides deep tracing for LangChain-native applications, including step-by-step chain visualization and dataset management for evaluation. It’s the best tool for LangChain-heavy shops, but it has no story for infrastructure observability, and its pricing scales quickly with trace volume.
Portkey functions as an AI gateway with built-in load balancing, fallback routing, and observability. It’s strong on operational resilience features but is a commercial SaaS product with no self-hosted option.
Langfuse is the closest open-source competitor to OpenObserve for LLM-specific use cases. It offers session tracking, prompt versioning, and human annotation workflows that OpenObserve doesn’t have natively. For teams whose observability requirements are exclusively LLM-focused, Langfuse may be the better fit.
OpenObserve wins when your requirements include unified infrastructure and LLM observability, strict data residency requirements, high-volume ingest at low cost, or a Rust-based performance profile. It does not yet have native prompt versioning or annotation workflows, so teams needing those features may need to combine OpenObserve with a lightweight prompt management tool.
| Feature | OpenObserve | Helicone | LangSmith | Portkey | Langfuse |
|---|---|---|---|---|---|
| Open source | Yes (Apache 2.0) | No | No | No | Yes (MIT) |
| Self-hosted | Yes | No | No | No | Yes |
| Infra + LLM unified | Yes | No | No | No | No |
| Prompt versioning | No | Yes | Yes | Yes | Yes |
| Cost tracking | Yes (custom) | Yes (native) | Limited | Yes (native) | Yes (native) |
| Scale (events/sec/node) | 500,000 | N/A | N/A | N/A | ~10,000 |
Enterprise Features: RBAC, Multi-Tenancy, and Data Retention
Enterprise deployments demand controls that go beyond basic metric collection. OpenObserve ships with role-based access control (RBAC), multi-tenancy through organization isolation, and configurable data retention policies — all available in the open-source tier, not gated behind a commercial license. For AI teams operating in regulated industries, these features are frequently the deciding factor in platform selection. RBAC in OpenObserve lets you assign viewer, editor, and admin roles at the organization level, and create custom roles with fine-grained permissions scoped to specific streams or dashboards. A common pattern is granting finance stakeholders read-only access to cost dashboards while restricting prompt content to ML engineers only.
Multi-tenancy is handled through OpenObserve’s organization model: each organization gets its own data namespace, ingestion endpoints, and user management. This maps cleanly to multi-team AI platforms where different product teams share infrastructure but should not have visibility into each other’s LLM traces.
Data retention is configured per-stream via the stream settings UI. A typical LLM monitoring configuration might retain raw trace logs for 30 days (balancing storage cost against debugging needs), aggregate cost metrics for 12 months, and error logs for 90 days. OpenObserve’s partitioned storage model on S3-compatible object stores means retention is almost entirely a storage cost question rather than a licensing question — you pay for bytes stored, not for a retention tier.
Self-Hosted vs. SaaS Cost Analysis
Running OpenObserve self-hosted versus subscribing to a commercial SaaS LLM monitoring tool is a calculation that changes dramatically depending on your data volume. For a team ingesting 100GB of trace data per month — roughly the volume generated by an application making 15,000 API calls per day — the cost landscape is stark. Commercial platforms like LangSmith charge on a per-trace basis; at 15,000 traces per day and $0.01 per 1,000 traces, that’s $4.50 per day or $135 per month before any premium features. Helicone’s Team plan starts at $80 per month for up to 1 million requests, with costs scaling beyond that.
Self-hosting OpenObserve on a single AWS EC2 c6i.2xlarge instance (8 vCPU, 16GB RAM, $0.34/hour on-demand) plus S3 storage at $0.023/GB runs approximately $245 per month in compute plus $2.30 in storage for 100GB — totaling around $247/month. At first glance that appears more expensive, but this estimate covers the entire observability stack: LLM traces, application logs, infrastructure metrics, and distributed traces. Commercial LLM monitoring tools don’t replace your existing observability stack; you’d pay their fees on top of whatever you currently spend on Datadog, Grafana Cloud, or New Relic. Teams processing over 500GB/month of observability data — common for production AI applications with retrieval pipelines and agent loops — typically find OpenObserve costs 60-75% less than equivalent commercial coverage across the full stack. Reserved instance pricing on AWS reduces the compute cost further to approximately $155/month on a one-year commitment.
FAQ
OpenObserve is a capable platform for LLM monitoring, but teams evaluating it alongside LangSmith, Langfuse, and Portkey consistently encounter the same practical questions: can it replace a purpose-built LLM observability tool, what does self-hosted infrastructure actually cost, and which integration path is fastest for their specific stack? With 8,500+ GitHub stars and deployments processing over 2PB of observability data monthly across 1,200+ organizations, OpenObserve has reached production maturity — the key question is fit, not capability. The LLM observability market growing at 68% CAGR and projected to reach $2.3B by 2027 means teams that get their monitoring stack right in 2026 are building on a foundation that will need to scale significantly. The answers below address the most common decision points for teams considering OpenObserve specifically for LLM application monitoring, based on real deployment patterns and the platform’s current feature set as of Q2 2026.
How does OpenObserve handle OpenAI SDK instrumentation directly, without LangChain?
You can instrument the OpenAI Python SDK directly using a wrapper function that captures request and response metadata before sending it to OpenObserve. Create a thin wrapper around openai.chat.completions.create that records _timestamp, model name, token counts from response.usage, latency in milliseconds, and any exception details. POST the resulting dict as a JSON array to your OpenObserve ingestion endpoint. This approach gives you identical visibility to the LangChain callback handler without the LangChain dependency, and it works equally well with the Anthropic, Mistral, and Cohere SDKs.
Can OpenObserve monitor LLM agents and multi-step pipelines, not just single API calls?
Yes. The key is using a shared session_id or trace_id field across all log entries belonging to the same agent run. When you start an agent execution, generate a UUID and pass it through every LLM call, tool call, and retrieval step as a structured field. In OpenObserve, you can then filter by that ID to reconstruct the full execution trace and see token consumption, latency, and errors at each step. This is more manual than LangSmith’s native chain visualization, but the data model is identical once you’re disciplined about propagating the trace ID.
What are the hardware requirements for running OpenObserve at production LLM monitoring scale?
OpenObserve’s Rust-based architecture is notably efficient. For a team generating up to 10GB of trace data per day, a single c6i.xlarge instance (4 vCPU, 8GB RAM) with S3 as the object store is sufficient. At 50GB/day, move to a c6i.4xlarge (16 vCPU, 32GB RAM) or a three-node cluster for high availability. OpenObserve’s stateless query nodes and stateful ingestor nodes can be scaled independently, which means you can add query capacity for analytics-heavy workloads without proportionally increasing ingest infrastructure. The 500,000 events/second per node benchmark from OpenObserve’s performance documentation gives you headroom well beyond what most single-application LLM monitoring use cases require.
How does OpenObserve compare to running the ELK stack for LLM monitoring?
Elasticsearch + Logstash + Kibana (ELK) is the legacy choice for log aggregation, but it carries significant operational overhead: Elasticsearch is memory-hungry, schema management is complex, and costs scale steeply with data volume. OpenObserve was explicitly designed to be a lower-complexity, lower-cost alternative. In OpenObserve’s own benchmarks, it uses roughly 140x less storage than Elasticsearch for equivalent datasets due to its columnar compression. For LLM monitoring specifically, OpenObserve’s SQL query interface is more familiar to data-oriented ML engineers than KQL (Kibana Query Language), which lowers the barrier to writing cost and performance queries. If you’re already running ELK and considering a migration, OpenObserve accepts OpenTelemetry and Fluentd payloads natively, so you can run both in parallel during a transition period.
Is OpenObserve suitable for monitoring fine-tuned or self-hosted LLMs, not just cloud API providers?
Absolutely. Since OpenObserve monitors at the application layer — capturing whatever structured data your code sends to it — it works identically for self-hosted models running on vLLM, Ollama, or TGI (Text Generation Inference) as it does for OpenAI or Anthropic API calls. For self-hosted models, you’ll instrument your inference server to emit token counts, TTFT (time to first token), total generation time, and GPU utilization metrics, then forward those to OpenObserve. OpenObserve can ingest Prometheus metrics from vLLM’s built-in /metrics endpoint directly, giving you both application-level trace data and infrastructure-level GPU and memory metrics in the same dashboards.
