Observability for LLM Apps 2026: Tracing, Metrics, Debugging

LLM observability is the discipline that turns opaque AI applications into systems you can actually debug, monitor, and improve. In 2026, observability has graduated from “nice-to-have” to “the reason a feature can ship to production” — without traces, you cannot diagnose why a customer’s specific interaction failed; without metrics, you cannot detect quality drift before customers complain; without integration into your existing observability stack, you cannot meet enterprise SLOs. The companies running reliable AI features in 2026 have invested heavily in observability infrastructure: structured traces that capture every step of every request, metrics that surface latency and quality and cost continuously, integrations with their existing APM (Datadog, Honeycomb, New Relic), and dedicated LLM-aware tools (Langfuse, Phoenix, LangSmith, Arize) that handle the LLM-specific quirks general APM doesn’t. The companies still pretending observability is optional are the ones whose support teams field “why did the AI tell me X?” tickets they can’t answer. This eguide is the comprehensive playbook for LLM observability in 2026 — the taxonomy, the tools, the integrations, and the operational practices that make AI applications observable, debuggable, and improvable.

Table of Contents

  1. Why LLM observability matters in 2026 — the production reliability story
  2. The observability taxonomy — traces, metrics, logs, evals
  3. Tracing fundamentals — spans, contexts, propagation
  4. OpenTelemetry for LLM apps
  5. Observability tool landscape — Langfuse, Phoenix, LangSmith, Arize, generic APM
  6. What to instrument — prompts, retrieval, tools, outputs, costs
  7. Trace inspection and debugging workflows
  8. Metrics for LLM apps — latency, cost, quality, error
  9. Output classifiers and quality monitoring
  10. Multi-agent and multi-step observability
  11. Streaming responses and partial state observability
  12. Privacy and PII handling in traces
  13. Alerting and incident response
  14. Observability + Evals + FinOps — the production triad
  15. Common observability mistakes
  16. FAQ

Chapter 1: Why LLM observability matters in 2026 — the production reliability story

The case for LLM observability used to require argument. Two years ago, most LLM-powered features were experimental enough that “look at the chat log if something seems off” was a serviceable debugging strategy. That world is gone. In 2026, production AI features serve millions of requests per day across customer support, coding assistants, search interfaces, agent workflows, and embedded features in dozens of other product categories. When something goes wrong — a customer complaint, a regression after a model update, a cost spike, a latency event — the team needs to answer specific questions quickly: which request failed; what was in the prompt; what was retrieved; what tools fired; what the model output was; how confident the model was; what the cost was; how long it took. Without LLM observability, these questions take hours or days. With it, they take minutes.

Three forces have made observability essential. First, scale. A feature serving 10K conversations per day produces 10K traces. A team can’t manually inspect them; they need search, filter, alerting, and aggregation. Tools designed for traditional APM (Datadog, New Relic, Honeycomb) capture some of this signal but miss LLM-specific dimensions (prompt versions, retrieved context, model versions, token counts). Tools designed for LLMs (Langfuse, Phoenix, LangSmith) handle the LLM dimensions but historically lacked the operational depth of full APM. The 2026 standard combines both — an LLM-aware tool integrated into the team’s existing APM via OpenTelemetry.

Second, the agent revolution. Single-shot chat is easy to debug; multi-step agent workflows are not. An agent that called tools five times to complete a task produces a complex tree of spans that requires structured tracing to make sense of. Without it, debugging an agent failure is detective work that can take hours. With proper tracing, you click the failing trace and walk through each step in seconds. The shift to agents has made the absence of observability immediately painful in a way the chat era didn’t.

Third, regulatory and compliance pressure. The EU AI Act, NIST AI RMF, ISO 42001, and various sector-specific rules increasingly require demonstrable audit trails for AI system behavior. “We have logs somewhere” doesn’t satisfy auditors. “Every interaction produces a structured trace stored for the contractual retention period, queryable by request ID, with field-level lineage from input to output” does. Observability infrastructure is increasingly the evidence base for compliance, not just an internal debugging tool.

What good LLM observability looks like in 2026 has converged across leading teams. End-to-end traces for every request, tying together input, retrieval, prompt construction, model calls, tool invocations, output processing, and final response. Structured logs alongside the trace with consistent field schemas. Latency, cost, error rate, and quality metrics aggregated and dashboarded. Alerts on anomalies (latency spikes, cost spikes, error rate, quality regression). Integration with the broader observability stack so AI traces correlate with infrastructure traces. PII handling that respects privacy constraints. Sampling rates that balance fidelity against cost. Retention policies aligned with compliance requirements.

The audiences for this eguide are ML platform leaders building LLM observability infrastructure, application engineers wiring up observability for their specific features, SREs extending their existing observability practice into AI workloads, security and compliance partners verifying that observability meets audit needs, and AI product leaders trying to understand what’s broken when users complain. The patterns described here are not specific to any one model family — they apply equally to Claude, GPT, Gemini, Llama, Mistral, and open-weight self-hosted models — though specific tooling and integrations vary.

One framing note before diving in. LLM observability isn’t a separate discipline from general application observability — it’s an extension. The teams that succeed treat LLM traces, metrics, and logs as first-class citizens in the same observability platform that captures their HTTP requests, database queries, and infrastructure events. The teams that struggle silo LLM observability into a separate tool that doesn’t talk to anything else; when something breaks in the broader system, they have to correlate manually. Pick tools that integrate, not tools that wall themselves off.

A second framing note about the maturity curve. Most organizations move through recognizable stages. Stage 0: print debugging — the team has no observability and reads chat logs when something goes wrong. Stage 1: basic logging — structured logs capture each interaction but with no trace structure. Stage 2: tracing — OTel or proprietary tracing captures spans across the request lifecycle. Stage 3: integration — traces feed dashboards, alerts, and eval pipelines. Stage 4: AI-assisted operations — observability data is mined for patterns automatically, with LLM-powered summaries of recent incidents. Most teams in 2026 sit at stage 2 or 3; the leading teams are pushing into stage 4. The progression usually takes 6-12 months and is the most reliable predictor of which AI products will ship reliably at scale.

The economics of observability deserve a note. Observability infrastructure costs money — storage, compute for query, vendor subscriptions, engineering time. The right framing is not “observability is expensive” but “observability is dramatically cheaper than the alternatives”. The cost of a single production incident debugged without traces (engineer hours, customer trust loss, possible SLA breach) typically exceeds a year of observability spend. Mature teams have explicit budget for observability as a fraction of total platform cost; 5-10% is a common range for AI-heavy products in 2026.

Chapter 2: The observability taxonomy — traces, metrics, logs, evals

Observability has a standard three-pillar model: traces, metrics, logs. LLM observability adds a fourth pillar — evals — that’s specific to AI systems. Understanding what each pillar captures and when to use which keeps the system coherent.

Pillar What it captures Cardinality Retention Use case
Traces End-to-end request flows with timing and context One per request Days to weeks Debugging specific failures, root-cause analysis
Metrics Aggregated numerical signals (latency, count, cost) Cardinality-limited Months to years SLO monitoring, capacity planning, trends
Logs Structured event records, often associated with traces High (every event) Days to months Search by content, freeform investigation
Evals Quality scores against curated datasets or live samples Per case per run Indefinite Regression detection, benchmarking, gating

Traces are the dominant tool for LLM debugging. A trace captures the full life of a request — when it arrived, what happened in each phase, how long each phase took, what data flowed between phases, and how it concluded. For an LLM application, a trace typically includes spans for: request ingestion; authentication and permission check; retrieval (vector store query); prompt construction; LLM call; tool invocations (potentially multiple); output processing; response delivery. Each span has structured fields (timing, tokens, model name, status) plus optional attributes (cache hits, parameter values).

Metrics summarize what’s happening in aggregate. Total requests per second, p50/p95/p99 latency, error rate, cost per request, cache hit rate. Metrics are cheap to store and query (you can keep years of metric history at low cost) but they’re aggregated — you can see that p95 latency increased without knowing which specific requests caused the increase. Metrics drive dashboards and alerts; traces drive root cause.

Logs capture freeform structured events. For LLM systems, logs typically duplicate some of what’s in traces (the input, the output, the model version) but with looser structure that makes ad-hoc querying easier. Many teams treat logs and traces as overlapping rather than separate — the same data is captured, but logs go to a search tool (Elastic, Loki) for free-text queries while traces go to a tracing tool (Jaeger, Tempo, Honeycomb) for waterfall views.

Evals are unique to ML and AI systems. Evals capture quality scores against curated datasets (formal evals, see the LLM Evals 2026 eguide) or against live production samples. The eval pillar overlaps with observability — you can think of evals as “metrics that measure quality, not just performance.” In practice, evals run on a separate cadence and produce different artifacts than observability data, so it’s clearer to treat them as a distinct pillar that integrates with the other three.

The four pillars work together. A user complains about response quality. You search logs by user ID to find their conversation. You click into the trace for the specific request to see what happened step by step. You correlate against latency metrics to see if something was unusually slow. You run the inputs through your eval suite to confirm whether the model’s response was an outlier. Each pillar answers a different question; together they let you debug end-to-end.

Beyond the four pillars, several auxiliary signals matter for LLM apps. Tool-call outputs (when an agent invokes a tool, the tool’s response is part of the agent’s reasoning chain and should be observable). Prompt history (which prompt version was used for each call, especially during prompt experimentation). User feedback signals (thumbs up/down, conversation drop-off, repeat queries). Cost attribution (per-call cost, per-feature roll-up, per-team chargeback). Each of these complements the four core pillars and adds dimensions that LLM apps need that traditional applications don’t.

The integration between pillars produces compounding value. Traces tagged with user IDs correlate with user-feedback events; you can ask “which traces correspond to negative feedback?”. Metrics aggregating output classifier scores feed alerts; alerts trigger investigations that drill into specific traces. Eval runs sampled from production produce diffs against historical baselines; the diffs show up in observability dashboards. The system as a whole is observable in a way no single pillar provides on its own.

# Example structured trace event for an LLM request
{
  "trace_id": "abc123",
  "span_id": "span001",
  "parent_span_id": null,
  "name": "process_user_request",
  "timestamp": "2026-05-20T14:30:14.123Z",
  "duration_ms": 1842,
  "attributes": {
    "user_id_hash": "sha256:...",
    "tenant_id": "corp.example",
    "endpoint": "/api/chat",
    "model": "claude-opus-4-7",
    "prompt_version": "kb_lookup_v17",
    "input_tokens": 1245,
    "output_tokens": 312,
    "cached_input_tokens": 980,
    "cost_usd": 0.0124,
    "tools_called": ["search_kb"],
    "status": "ok"
  },
  "events": [
    {"name": "retrieval_complete", "ts": "+45ms", "attrs": {"docs_returned": 5}},
    {"name": "llm_first_token", "ts": "+312ms"},
    {"name": "llm_complete", "ts": "+1820ms", "attrs": {"finish_reason": "stop"}}
  ]
}

Chapter 3: Tracing fundamentals — spans, contexts, propagation

Tracing is the most useful LLM observability tool and the one most worth investing in first. A trace is composed of spans; each span represents a unit of work (a function call, an external API call, a database query). Spans nest to form a tree showing the call hierarchy. Context propagation carries the trace identity across process boundaries so an end-to-end request can be reconstructed from independent service traces.

Spans for LLM applications. The canonical span types you want to capture:

  • Request span — the root span, capturing the entire request lifecycle
  • Retrieval spans — one per vector store or knowledge base query
  • Prompt construction span — assembling system prompt, context, user message
  • LLM call span — the actual API call to the model provider
  • Tool invocation spans — one per tool the agent calls
  • Output processing span — parsing, validation, output filtering
  • Response delivery span — sending the response back to the user

For agent workflows, the span tree gets deeper. A planning span contains LLM call spans for the planner; sub-agent spans contain their own LLM and tool spans; the orchestrator span contains all of the above. The tree structure makes it easy to see at a glance how the agent decomposed and executed the task.

# Instrumenting a Python LLM application with OpenTelemetry
from opentelemetry import trace
from opentelemetry.trace import SpanKind, Status, StatusCode

tracer = trace.get_tracer(__name__)

def handle_request(user_input, user_id):
    with tracer.start_as_current_span(
        "process_user_request",
        kind=SpanKind.SERVER,
        attributes={
            "user.id": user_id,
            "input.length": len(user_input),
        }
    ) as request_span:
        try:
            with tracer.start_as_current_span("retrieval") as retrieval_span:
                docs = retriever.search(user_input)
                retrieval_span.set_attribute("docs.count", len(docs))

            with tracer.start_as_current_span("prompt_construction"):
                prompt = build_prompt(user_input, docs)

            with tracer.start_as_current_span("llm_call",
                attributes={"llm.model": "claude-opus-4-7"}) as llm_span:
                response = model.call(prompt)
                llm_span.set_attribute("llm.input_tokens", response.usage.input_tokens)
                llm_span.set_attribute("llm.output_tokens", response.usage.output_tokens)

            with tracer.start_as_current_span("output_processing"):
                result = parse_and_validate(response)

            return result
        except Exception as e:
            request_span.set_status(Status(StatusCode.ERROR, str(e)))
            request_span.record_exception(e)
            raise

Context propagation. When a request crosses service boundaries (your API → a separate vector service → your model proxy → the LLM provider), the trace context must travel with it for the spans on the downstream services to be associated with the original request. OpenTelemetry handles this automatically via HTTP headers (traceparent, tracestate) — instrument both services and the spans link automatically.

# Make outbound HTTP requests carry trace context
from opentelemetry.instrumentation.requests import RequestsInstrumentor
RequestsInstrumentor().instrument()

# Now any requests.get(...) call automatically propagates trace context
# Downstream service receives traceparent header and creates child spans

# For LLM SDKs that don't auto-instrument
import requests
from opentelemetry import propagate
def call_llm_via_proxy(prompt):
    headers = {}
    propagate.inject(headers)  # add traceparent header
    return requests.post("https://proxy.internal/llm",
        json={"prompt": prompt}, headers=headers)

Span attributes vs events. Attributes are key-value pairs on the span itself (one-time, set when the span is created or before it closes). Events are timestamped occurrences within the span. Use attributes for metadata that describes the span (“this LLM call used model X with N input tokens”). Use events for things that happened during the span (“first token at +312ms”, “rate limited at +500ms, retrying”).

Span status. Every span has a status (UNSET, OK, ERROR). Set ERROR with a meaningful message when a span fails; downstream observability tools filter and alert on error-status spans. Don’t conflate HTTP status codes with span status — a span can be successful even if the underlying HTTP call returned 4xx (e.g., a 404 is “not found”, a meaningful response, not necessarily an error).

Span links. Sometimes spans relate to each other in ways that aren’t strict parent-child. A retry span might link to the original failed span; a sub-agent invocation might link to a planning span without being a direct child. OTel supports span links for these cases — use them to capture relationships that the parent-child tree doesn’t express.

# Span links example
from opentelemetry import trace
tracer = trace.get_tracer(__name__)

# When retrying after a failure, link the new span to the failed one
def retry_with_link(original_span_context):
    with tracer.start_as_current_span(
        "llm.call.retry",
        links=[trace.Link(original_span_context)]
    ) as span:
        # ... retry logic
        pass
# The link is preserved in the trace; viewers show "this span retried after <original>"

Naming conventions. Spans need consistent naming so dashboards and queries work. Common patterns: verb.object (e.g., llm.call, retrieval.search); service.operation (e.g., chat.handle_request); hierarchical dotted notation (e.g., agent.tool.search). Pick one and stick to it. Inconsistent naming is the single most common reason observability becomes hard to query at scale.

Chapter 4: OpenTelemetry for LLM apps

OpenTelemetry (OTel) is the standard for observability instrumentation across modern applications. By 2026 it’s the dominant choice for both general APM and LLM-specific observability. The advantage is vendor neutrality — you instrument once with OTel, then export to any compatible backend (Datadog, Honeycomb, Tempo, Langfuse, Phoenix, etc.) or to multiple backends simultaneously.

OTel for LLM apps has specific semantic conventions that the community has been standardizing. The OpenInference and OpenLLMetry conventions (both built on OTel) define standard attribute names for LLM-specific data: llm.model, llm.input_tokens, llm.output_tokens, llm.system_prompt, llm.completion, etc. Using these conventions means your traces are interoperable with tools that understand them; using custom attribute names means each tool you adopt may need configuration to find the right fields.

# OpenLLMetry instrumentation (Traceloop SDK)
pip install traceloop-sdk

from traceloop.sdk import Traceloop
Traceloop.init()

# Now LLM SDK calls are automatically traced with semantic conventions
# - Anthropic, OpenAI, Cohere, Mistral, LangChain, LlamaIndex, etc.
# All produce spans with standardized attributes

# Manual span creation with OpenLLMetry conventions
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("llm_call") as span:
    span.set_attribute("llm.request.model", "claude-opus-4-7")
    span.set_attribute("llm.request.type", "chat")
    span.set_attribute("llm.prompts.0.role", "user")
    span.set_attribute("llm.prompts.0.content", user_message)
    response = model.call(...)
    span.set_attribute("llm.completions.0.role", "assistant")
    span.set_attribute("llm.completions.0.content", response.content)
    span.set_attribute("llm.usage.prompt_tokens", response.usage.input_tokens)
    span.set_attribute("llm.usage.completion_tokens", response.usage.output_tokens)

Exporting traces. OTel collectors receive traces from your application and forward them to one or more backends. The standard pattern: applications use the OTLP (OpenTelemetry Protocol) exporter to send traces to an OTel Collector; the collector applies sampling, filtering, and enrichment; the collector forwards to backends (your APM, your LLM-observability tool, your data warehouse).

# Application config (Python)
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter

provider = TracerProvider()
exporter = OTLPSpanExporter(endpoint="http://otel-collector:4318/v1/traces")
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)

# OTel collector config (otel-collector-config.yaml)
receivers:
  otlp:
    protocols:
      http: {}
      grpc: {}

processors:
  batch:
    timeout: 5s
  resource:
    attributes:
      - key: service.environment
        value: production
        action: upsert

exporters:
  otlphttp/datadog:
    endpoint: "https://otel.datadoghq.com"
    headers:
      DD-API-KEY: "${DD_API_KEY}"
  otlphttp/langfuse:
    endpoint: "https://us.cloud.langfuse.com/api/public/otel"
    headers:
      authorization: "Basic ${LANGFUSE_AUTH}"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [otlphttp/datadog, otlphttp/langfuse]

Sampling. At scale, capturing every trace is impractically expensive. The OTel Collector’s tail sampling processor lets you keep all “interesting” traces (errors, high latency, specific user/tenant) and sample the rest. For LLM applications, common sampling policies: 100% of error traces; 100% of high-latency traces (p95+); 10% of normal traces; 100% of traces for specific tenants under investigation.

# Tail sampling configuration in OTel Collector
processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 50000
    expected_new_traces_per_sec: 100
    policies:
      - name: errors-policy
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow-policy
        type: latency
        latency: {threshold_ms: 5000}
      - name: probabilistic-policy
        type: probabilistic
        probabilistic: {sampling_percentage: 10}
      - name: priority-tenant-policy
        type: string_attribute
        string_attribute:
          key: tenant.id
          values: [vip-customer-1, vip-customer-2]
          invert_match: false

Sampling strategy is one of the most important LLM observability decisions. Too aggressive (1% of traces) and you’ll lack signal for debugging routine issues; too lenient (100% of all traces) and the storage bill dominates infrastructure cost. Most mature teams use tail sampling with the above shape — comprehensive on errors and high-latency, probabilistic on the rest — and tune the probabilistic rate based on volume and storage budget. For very high-volume systems (millions of requests/day), 1-5% probabilistic plus 100% errors is typical; for lower volumes, 20-50% probabilistic is fine.

Head sampling vs tail sampling. Head sampling decides at request start whether to keep the trace; cheap but loses information on whether the trace ended up interesting. Tail sampling decides at trace completion based on full context; more expensive but produces better samples. For LLM apps where the “interesting” signal often only becomes apparent after the response (was the output classifier alarmed? did the user thumbs-down?), tail sampling is generally preferred.

Sample-rate observability. When sampling is in effect, you need to know what your sample rate is in order to extrapolate. Maintain a metric for traces sampled and traces seen so you can compute the rate; surface this in dashboards. Without it, “we processed 100 traces matching this condition” might mean 100 of 100 if not sampling or 100 of 10,000 if sampling at 1%.

Chapter 5: Observability tool landscape — Langfuse, Phoenix, LangSmith, Arize, generic APM

The LLM observability tool market in 2026 has several established players plus general APM vendors that have added LLM features. Understanding the trade-offs lets you pick the right combination for your team.

Tool Type Strengths Best for
Langfuse LLM-native (OSS + Cloud) Tracing + evals + datasets in one tool; OpenTelemetry support Application teams wanting an integrated LLM-ops platform
Phoenix (Arize) LLM-native (OSS + Cloud) Strong RAG analytics, embedding visualization RAG-heavy applications, embedding-quality work
LangSmith LLM-native (LangChain-centric) Tight integration with LangChain ecosystem LangChain-based applications
Arize AI ML/LLM observability platform Production-grade scale, drift detection Enterprise scale, ML platform teams
Datadog LLM Observability General APM + LLM features Integration with rest of Datadog stack Teams already on Datadog
Honeycomb General APM Best-in-class trace exploration UX Teams that value query flexibility
New Relic General APM Mature platform, broad integrations Existing New Relic customers
Weights & Biases Weave ML platform + LLM observability Tied to W&B experiment tracking Teams using W&B for ML training
OpenLLMetry + DIY backend OSS instrumentation Maximum flexibility Teams with strong observability practice

The pragmatic recommendation for most teams in 2026: pick one LLM-native tool (Langfuse if you want OSS + integrated evals; Phoenix if RAG analytics matter; LangSmith if you’re deep in LangChain) AND export the same traces to your existing APM via OpenTelemetry. The LLM-native tool gives you LLM-specific features (eval integration, prompt management, automatic LLM detection); the APM correlates with your broader application traces.

Self-hosted vs cloud. Most LLM observability tools offer both self-hosted (open source or commercial license) and managed cloud options. Self-hosted gives you data sovereignty and is required by some compliance regimes; cloud is operationally simpler and includes features (managed scaling, integrated dashboards) that require setup work on self-hosted.

# Langfuse self-hosted via Docker Compose
git clone https://github.com/langfuse/langfuse.git
cd langfuse
docker compose up -d

# Now Langfuse UI is at http://localhost:3000
# Get API keys from the UI and configure your application:

# Python SDK
pip install langfuse

from langfuse import Langfuse
langfuse = Langfuse(
    public_key="pk-lf-...",
    secret_key="sk-lf-...",
    host="http://localhost:3000"
)

# Or use the Langfuse OpenTelemetry endpoint with any OTel-instrumented app

Phoenix from Arize is the strongest tool for RAG observability — it ingests both your traces and your retrieved documents and surfaces patterns like “queries that retrieve poor documents” or “embeddings that are clustering oddly”. For RAG-heavy applications, Phoenix is worth running alongside whatever your primary tool is.

LangSmith deserves a deeper note for teams using LangChain. The integration is tight enough that LangSmith picks up tracing automatically from LangChain’s Runnable framework — no manual instrumentation required. Beyond tracing, LangSmith offers prompt management (versioning prompts as code-like artifacts), dataset management for evals, and a hosted eval runner. The lock-in concern is real (switching away from LangSmith means re-implementing prompt management and eval workflows), but for teams committed to LangChain, the productivity is genuinely high.

Arize AI (the company behind Phoenix) also offers Arize AX — a hosted ML observability platform aimed at enterprise customers. It scales further than Phoenix and adds features like data drift detection, embedding monitoring, and integration with model registries. The pricing reflects the enterprise positioning; for application teams, Phoenix (free, self-hosted) often suffices.

Weights & Biases Weave is the LLM observability product from W&B, the company best known for ML experiment tracking. The strength is integration with the broader W&B platform — if your team uses W&B for model training and experiment tracking, Weave gives you LLM observability in the same UI. The weakness is that it’s still relatively newer compared to dedicated LLM observability tools; feature parity with Langfuse/Phoenix is closing but not complete.

Datadog LLM Observability launched in 2024 and has matured rapidly. The strength is integration with the rest of Datadog (APM, log search, metrics, infrastructure monitoring all in one place). For teams already on Datadog, this is the path of least resistance. The weakness is that LLM-specific features (eval integration, prompt management) are less mature than dedicated tools; Datadog is still catching up on the LLM-native side.

New Relic, Dynatrace, and Splunk all have LLM observability stories — most are general APM with LLM-specific dashboards and integrations. For teams already invested in those platforms, extend rather than replace. For new setups, the dedicated LLM-native tools tend to be ahead on LLM-specific UX.

The DIY approach. Some teams build their own observability stack using OpenTelemetry SDK + Tempo (or Jaeger) for traces + Prometheus + Grafana for metrics + Elasticsearch (or Loki) for logs. This works but requires engineering investment to maintain. For most teams, vendor tools save enough time to be worth their cost.

# Phoenix in a notebook or local process
pip install arize-phoenix

import phoenix as px
session = px.launch_app()
# Opens a local UI; traces sent to it appear in the dashboard

# Or self-host with the Docker image
docker run -p 6006:6006 -p 4317:4317 arizephoenix/phoenix:latest

Chapter 6: What to instrument — prompts, retrieval, tools, outputs, costs

Instrumentation decisions determine what you can debug. Under-instrument and you’ll be guessing when something breaks; over-instrument and you’ll drown in noise and pay too much for storage. The right baseline for production LLM apps in 2026 captures the following dimensions on every trace.

Request-level attributes. Request ID, user ID (hashed for privacy), tenant ID, endpoint, timestamp, environment (prod/staging/dev), service version. These let you correlate traces with broader system events.

Input attributes. The full user input (or a hash if privacy-sensitive). Input length, language, content type. Whether the input was from an authenticated session or anonymous.

Retrieval attributes. Query embedding (or its hash). Number of documents retrieved. IDs of retrieved documents. Retrieval scores. Source of each retrieved document. Time taken to retrieve. Whether retrieval used cache.

Prompt construction attributes. The full prompt sent to the model (or a hash). Prompt version (which template). System prompt version. Any dynamic content injected. Token count of the assembled prompt.

Model call attributes. Model name and version. Input tokens (uncached, cached). Output tokens. Cost in USD. Latency. First-token latency. Finish reason (stop, length, tool_use, etc.). Provider name. API region.

Tool invocation attributes. Tool name. Parameters sent (sanitized for secrets). Result returned. Latency. Whether the call succeeded. Error message if it failed.

Output attributes. The full output (or a hash). Output length. Parse success (if structured output expected). Output classifier scores (if running). Whether the response was returned to the user or replaced with a fallback.

# Comprehensive instrumentation example
from opentelemetry import trace
import hashlib

def hash_for_privacy(text):
    return hashlib.sha256(text.encode()).hexdigest()[:16]

tracer = trace.get_tracer(__name__)

def handle_chat(user_input, user_id, tenant_id):
    with tracer.start_as_current_span("chat.handle") as span:
        # Request attributes
        span.set_attributes({
            "user.id_hash": hash_for_privacy(user_id),
            "tenant.id": tenant_id,
            "input.length": len(user_input),
            "input.hash": hash_for_privacy(user_input),
            "service.version": SERVICE_VERSION,
        })

        # Retrieval
        with tracer.start_as_current_span("retrieval") as r:
            docs = retriever.search(user_input, k=5)
            r.set_attributes({
                "retrieval.docs.count": len(docs),
                "retrieval.docs.ids": [d.id for d in docs],
                "retrieval.scores": [d.score for d in docs],
            })

        # Prompt construction
        with tracer.start_as_current_span("prompt.build") as p:
            prompt = build_prompt(user_input, docs)
            p.set_attributes({
                "prompt.version": PROMPT_VERSION,
                "prompt.tokens": count_tokens(prompt),
                "prompt.hash": hash_for_privacy(prompt),
            })

        # Model call
        with tracer.start_as_current_span("llm.call") as l:
            l.set_attributes({"llm.model": "claude-opus-4-7"})
            response = model.call(prompt)
            l.set_attributes({
                "llm.input_tokens": response.usage.input_tokens,
                "llm.output_tokens": response.usage.output_tokens,
                "llm.cached_input_tokens": response.usage.cache_read_input_tokens,
                "llm.cost_usd": calculate_cost(response),
                "llm.finish_reason": response.stop_reason,
            })

        # Output
        with tracer.start_as_current_span("output.process") as o:
            result = process(response)
            o.set_attributes({
                "output.length": len(result.text),
                "output.hash": hash_for_privacy(result.text),
                "output.classifier_score": classify_output(result.text),
            })

        return result

What NOT to capture (or capture carefully). Raw passwords, API keys, or session tokens — these should never appear in traces. Customer PII unless you have explicit consent and PII-aware storage. Sensitive document contents if the user has restricted access — capture the document ID and metadata, not the content. Confidential business data per your data classification policy.

Hash-based fingerprinting. For prompts, outputs, and other long strings, store a content hash as an attribute (deterministic, useful for grouping) and store the full content as a separate event or in a side table with access controls. The trace becomes searchable by hash without exposing the content; investigators with proper access can resolve the hash to the content when needed.

# Hash-based privacy pattern
import hashlib
def content_hash(text):
    return hashlib.sha256(text.encode()).hexdigest()

# In the trace
span.set_attribute("prompt.hash", content_hash(prompt))
span.set_attribute("output.hash", content_hash(output))

# In a separate PII-aware store
pii_store.insert(
    request_id=trace_id,
    prompt_hash=content_hash(prompt),
    prompt_full=prompt,        # restricted access
    output_hash=content_hash(output),
    output_full=output,         # restricted access
    expires_at=now() + retention_period
)

# Investigators query traces by hash, then resolve to content via pii_store
# Two-tier access: most engineers see hashes; PII-cleared engineers see content

Sampling of payload content. For high-volume systems where storing every prompt and output is too expensive, sample. Store full content for 5-10% of traces and rely on those for debugging unusual patterns; store only hash + metadata for the rest. The sampled population is statistically representative; full content is available for the cases most likely to need investigation.

Schema evolution. The fields you instrument will change over time as you learn what matters. Plan for backwards-compatible schema evolution: add new attribute names freely; rename only when necessary; never reuse a name with different semantics. Some observability tools handle schema migrations better than others; check before committing to a specific platform.

Chapter 7: Trace inspection and debugging workflows

Traces are most valuable when developers can quickly find the one they need. Workflows for trace inspection typically follow this pattern: a user reports an issue with a specific timestamp or session ID; an engineer searches traces by that identifier; the matching trace shows the waterfall of spans; the engineer drills into the relevant span to understand what happened; if the issue is a regression, they compare the trace to a similar successful trace from a previous version.

# Common trace search patterns

# By user ID (hashed)
trace_id = traces.search(attribute="user.id_hash", value=hash_user("alice"))

# By request ID (when you have it from a support ticket)
trace = traces.get_by_id("req_abc123")

# By error
errors = traces.search(filter="status.error = true", since="1h")

# By model version (to see all traces using a specific model)
recent = traces.search(filter="llm.model = 'claude-opus-4-7'", since="24h")

# By high latency
slow = traces.search(filter="duration_ms > 5000", since="1h")

The trace inspection UI matters. Tools like Langfuse, Phoenix, LangSmith, Honeycomb, and Datadog all offer trace views, but they vary in usability. Key features to look for: waterfall display showing parent-child span hierarchy; click-to-expand attributes; ability to compare two traces side by side; quick links from a trace to related traces (same user, same session, same prompt version); inline display of LLM prompts and responses with syntax highlighting.

Debugging workflow patterns. The most useful pattern for LLM apps: start at the customer’s complaint; find their trace; look at the full prompt and the full response; if the response was bad, examine retrieval (was the right context retrieved?); examine prompt construction (was the prompt assembled correctly?); examine model output (did the model produce reasonable output for that prompt?); examine output processing (did anything post-model alter the result?). Five minutes with a good trace usually identifies which layer broke; without traces, the same investigation can take hours.

# A useful pattern: "compare-and-explain" traces
# Given a failing trace and a similar successful one, diff them
def compare_traces(failing_id, working_id):
    fail = trace_store.get(failing_id)
    work = trace_store.get(working_id)
    diff = {
        "input_length": fail.attrs["input.length"] - work.attrs["input.length"],
        "retrieval_docs": set(fail.retrieval.doc_ids) - set(work.retrieval.doc_ids),
        "prompt_version": (fail.attrs["prompt.version"], work.attrs["prompt.version"]),
        "model": (fail.attrs["llm.model"], work.attrs["llm.model"]),
        "output_classifier": (fail.attrs["output.classifier"], work.attrs["output.classifier"]),
    }
    return diff
# The differences usually highlight what caused the failure

Saved queries and dashboards. After the first few debugging sessions, you’ll find yourself running the same queries repeatedly. Save them. Most observability tools support saved searches; build a team library of “useful queries” for common investigations. Common entries: “all errors in the last 24 hours”, “traces with latency > 10 seconds”, “traces from a specific user in the last week”, “traces using a specific prompt version”, “agent runs that hit the step-limit”.

Trace correlation across systems. When a customer reports an issue, you often need to correlate the AI trace with traces from non-AI parts of your system (the API gateway, the database, the auth service). Trace context propagation makes this possible — the same trace ID appears in every system the request touched. The investigation flow becomes: customer reports issue → find the trace by user ID → see the entire system’s view of that request, not just the AI parts.

AI-assisted debugging. The latest LLM observability tools include LLM-powered features to summarize traces, suggest likely root causes, and even propose fixes. Langfuse, Phoenix, and Honeycomb all offer some variant of this. Treat the suggestions as hypotheses, not conclusions — but they often save significant time for routine investigations.

# Example trace summary prompt for AI-assisted debugging
SUMMARY_PROMPT = """You are debugging an LLM application. Given this trace,
identify what went wrong and propose a fix.

Trace:
{trace_json}

Recent traces in the same prompt version (for comparison):
{baseline_traces}

Output:
- One-sentence summary of what happened
- Most likely root cause
- Suggested next investigation steps
- Recommended fix (if obvious)"""

Chapter 8: Metrics for LLM apps — latency, cost, quality, error

Metrics summarize what’s happening in aggregate. The minimum metric set for production LLM apps:

Latency metrics. p50, p95, p99 of total request latency. p50, p95, p99 of LLM call latency (excluding retrieval and processing). First-token latency for streaming responses. These metrics drive SLO monitoring and capacity planning.

Cost metrics. Total cost per minute/hour/day, broken down by model, feature, and team. Cost per request. Cache hit rate (affects effective cost). Tokens consumed per request.

Error metrics. Error rate. Errors by class (provider errors, parse failures, timeouts, tool errors). Retry rate. Error rate by model version.

Quality metrics. Output classifier scores aggregated. User feedback (thumbs up/down) rates. Refusal rate (how often the model refuses to respond). Format compliance rate (for structured outputs).

Volume metrics. Requests per second. Active users. Active sessions. Tokens per minute.

# Prometheus-style metric definitions (illustrative)
llm_request_duration_seconds{model, feature, status}     # histogram
llm_tokens_total{model, kind="input|output|cached"}       # counter
llm_cost_usd_total{model, team, feature}                  # counter
llm_errors_total{model, error_class}                      # counter
llm_quality_score{model, dimension, percentile}            # gauge
llm_cache_hit_ratio{model}                                 # gauge

# Example PromQL queries

# p95 latency by model (last 5 minutes)
histogram_quantile(0.95,
    sum(rate(llm_request_duration_seconds_bucket[5m]))
        by (model, le))

# Error rate by feature
sum(rate(llm_errors_total[5m])) by (feature)
    / sum(rate(llm_requests_total[5m])) by (feature)

# Cost burn rate per team
sum(rate(llm_cost_usd_total[1h])) by (team)

Dashboards. The minimum dashboard set: an overview dashboard with the top-level metrics (latency, cost, errors, volume); per-feature dashboards drilling into specific applications; per-model dashboards for comparing model versions; SLO dashboards showing actual vs target performance.

SLOs (Service Level Objectives) for LLM apps differ from traditional APIs because quality matters alongside speed and availability. Useful SLOs: 95% of requests complete within 5 seconds; error rate < 1% over 30-day window; cache hit rate > 70%; quality classifier score > 0.85 on production samples. Each SLO needs a measurement plan, an alerting threshold, and an action when violated.

Per-feature vs aggregate metrics. Aggregate metrics tell you the overall system health; per-feature metrics tell you which specific feature is degrading. Both matter. A 5% increase in p95 latency aggregated across all features could be one bad feature dragging the average; you can only tell with per-feature breakdowns. Build dashboards that allow drilling from aggregate to per-feature to per-prompt-version.

# Metric tagging strategy
# Every metric should be tagged with at least:
# - service (which microservice / app)
# - feature (which user-facing feature)
# - model (which LLM)
# - prompt_version (which prompt template)
# - environment (prod, staging, dev)

# Avoid: tagging with user_id (cardinality explosion)
# Avoid: tagging with full input or output (cardinality explosion)
# Do: tag with category aggregations (user_tier=free|pro|enterprise)

Histograms vs gauges. Use histograms for distributions (latency, token counts, costs) — you need percentiles, not just averages. Use gauges for instantaneous state (current connection count, current queue depth). Use counters for accumulating totals (total tokens, total errors). Picking the wrong metric type leads to poor dashboards.

Cardinality control. Each unique combination of metric labels creates a separate time series. With high cardinality (millions of users × 10 features × 5 models × 50 prompt versions), the metric store becomes unusable. Limit cardinality by aggregating user-level dimensions and capturing fine-grained data in traces instead. A useful rule: total time series per metric should stay under ~1 million; if you cross that, your aggregation strategy needs adjustment.

Real-time vs batch metric pipelines. Most observability tools support both real-time metrics (updated continuously) and batch metrics (aggregated periodically from traces or logs). Real-time is essential for alerting; batch is fine for trend dashboards. Don’t double-implement; pick the right pattern per metric.

Chapter 9: Output classifiers and quality monitoring

Quality monitoring for LLM apps requires more than just latency and error counts. The hardest question — “is the output good?” — needs continuous measurement, not just one-time evaluation. Output classifiers (small specialized models or rules that score model outputs) run on every production request and produce quality signals that feed into observability.

Common output classifier dimensions: relevance (does the response address the user’s question?); safety (does the response avoid harmful content?); format compliance (does structured output match the schema?); hallucination detection (does the response make claims not grounded in retrieved context?); refusal detection (did the model refuse to help when it should have, or vice versa?); confidence calibration (is the model’s expressed confidence appropriate?).

# Lightweight output classifier on every production response
from typing import Dict
import openai

def classify_output(user_input: str, response: str) -> Dict[str, float]:
    """Run a cheap classifier on production responses."""
    prompt = f"""Score the following AI response on these dimensions (0-1 each):
- relevance: how well does it address the user's question
- safety: avoids harmful, biased, or inappropriate content
- groundedness: claims are supported by the provided context
- format: response is well-structured

User: {user_input[:500]}
Response: {response[:1000]}

Return JSON: {{"relevance": 0-1, "safety": 0-1, "groundedness": 0-1, "format": 0-1}}"""

    classifier_response = openai.chat.completions.create(
        model="claude-haiku-4-5",  # cheap, fast
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
        max_tokens=200,
    )
    return json.loads(classifier_response.choices[0].message.content)

# In your request handler:
@trace_decorator
def serve_request(user_input):
    response = generate_response(user_input)
    # Async classify (don't block the response)
    asyncio.create_task(record_classifier_score(user_input, response))
    return response

async def record_classifier_score(user_input, response):
    scores = classify_output(user_input, response)
    metrics.record_gauge("llm.quality.relevance", scores["relevance"])
    metrics.record_gauge("llm.quality.safety", scores["safety"])
    metrics.record_gauge("llm.quality.groundedness", scores["groundedness"])
    # Tag with current model version for comparison
    metrics.record_gauge("llm.quality.score_by_model",
        sum(scores.values()) / len(scores),
        tags={"model": current_model_version})

Cost considerations. Running a classifier on every production request adds cost (a small LLM call per response). For high-volume apps, sample — classify 5-10% of responses rather than all of them. The sample rate must be high enough to detect quality drift in your time window of interest; statistical power calculations help pick the right rate.

Calibration. Output classifiers are themselves LLMs and can be wrong. Before relying on them, validate against human-labeled samples. If your classifier scores correlate < 0.6 with human judgment, the classifier needs improvement (better rubric, stronger model, better prompting) before its scores are useful for production monitoring.

Drift detection on classifier scores. Over time, classifier score distributions can drift. Two reasons. First, the system under test changes (a new prompt version, a new model) and the classifier picks up real changes — this is desired. Second, the classifier itself drifts (the underlying model gets updated by the provider, the rubric becomes less appropriate for new use cases) — this is noise. Distinguish the two by periodically re-validating the classifier against human-labeled samples; if the classifier-human agreement drifts, the classifier needs attention.

Multiple classifiers as ensemble. A single classifier is noisy; an ensemble of three to five classifiers (different prompts or different rubrics) provides more reliable signal. Cost is higher but for production monitoring it’s usually worth it. The classifiers can be different models (Claude judging GPT, GPT judging Gemini) to reduce self-preference bias on judgments.

# Classifier-ensemble pattern
async def ensemble_classify(user_input, response):
    results = await asyncio.gather(
        classifier_a(user_input, response),  # rubric A
        classifier_b(user_input, response),  # rubric B
        classifier_c(user_input, response),  # rubric C (different model)
    )
    # Average score per dimension
    dimensions = results[0].keys()
    averaged = {dim: sum(r[dim] for r in results) / len(results) for dim in dimensions}
    return averaged

Quality monitoring action paths. When classifier scores degrade, what happens? Alerting is one action. Other useful patterns: automatic rollback to a previous prompt version if scores fall below threshold; automatic switch to a more conservative model; automatic notification to the responsible team. Don’t just monitor — define the response to degradation, so when alerts fire, the right action is clear.

Chapter 10: Multi-agent and multi-step observability

Single-shot chat is easy to observe; multi-step agent workflows are not. An agent that planned, called five tools, recursed into a sub-agent, and synthesized an answer produces a complex span tree. Without good observability, debugging such an agent is detective work; with good observability, the trace tells the story.

Key patterns for agent observability. Trace the planner separately from the executor — knowing what the agent intended to do vs what it actually did is essential. Capture the agent’s full plan as a span attribute or event. For each tool invocation, capture the tool name, parameters, and result; if the tool itself triggers downstream calls, instrument those too so the trace is end-to-end. For sub-agents, use the same span hierarchy — a sub-agent’s spans become children of the parent agent’s span.

# Agent workflow instrumentation
@tracer.start_as_current_span("agent.run")
def run_agent(task):
    span = trace.get_current_span()
    span.set_attribute("agent.task", task)

    # Planning phase
    with tracer.start_as_current_span("agent.plan") as plan_span:
        plan = planner.create_plan(task)
        plan_span.set_attribute("plan.steps", len(plan.steps))
        plan_span.set_attribute("plan.serialized", json.dumps(plan.steps))

    # Execution phase — each step gets its own span
    results = []
    for i, step in enumerate(plan.steps):
        with tracer.start_as_current_span(f"agent.step.{i}") as step_span:
            step_span.set_attribute("step.action", step.action)
            step_span.set_attribute("step.input", str(step.params))

            if step.action == "tool":
                result = execute_tool(step.tool_name, step.params)
            elif step.action == "subagent":
                # Recursive agent call — its spans become children
                result = run_agent(step.subtask)
            elif step.action == "llm":
                result = call_llm(step.prompt)

            step_span.set_attribute("step.result_summary", str(result)[:200])
            results.append(result)

    # Synthesis
    with tracer.start_as_current_span("agent.synthesize"):
        final = synthesizer.compose(results)

    return final

Cross-agent correlation. In multi-agent systems where agents communicate (via shared memory, message queues, or RPC), the trace context must propagate. OTel handles this if you instrument the communication channels; otherwise trace identity is lost at each handoff. Worth investing in early; retrofitting cross-agent tracing onto an existing system is painful.

Agent failure modes specific to observability: infinite loops (track step count per agent; alert if exceeds threshold); cost runaway (track cumulative cost per task; alert if it exceeds budget); tool selection failures (track tool-call success rate; alert if degraded); planning quality (track whether plans get executed successfully).

Agent-specific dashboards. Beyond general observability, agent operations benefit from agent-specific views. Common entries: distribution of step counts per task (where on the histogram is your average task?); distribution of total tokens per task; tool-call patterns (which tools are most-used? which fail most often?); plan-to-execution success rate; sub-agent depth (how deeply does the agent recurse on average?).

# Agent task summary metric
def emit_agent_summary(task_id, agent_run):
    metrics.histogram("agent.steps_per_task",
        len(agent_run.steps), tags={"agent_type": agent_run.type})
    metrics.histogram("agent.total_tokens",
        agent_run.total_tokens, tags={"agent_type": agent_run.type})
    metrics.histogram("agent.duration_seconds",
        agent_run.duration_seconds, tags={"agent_type": agent_run.type})
    metrics.counter("agent.tool_calls_total",
        len(agent_run.tool_calls), tags={"agent_type": agent_run.type})
    metrics.counter("agent.tasks_completed",
        1 if agent_run.completed else 0, tags={"agent_type": agent_run.type})

Replay capability. For complex agent workflows, the ability to replay a recorded trace is invaluable for debugging. Storing the trace plus all inputs (user input, retrieved context, tool outputs) means engineers can re-run the agent on the original inputs to reproduce the issue. Some observability tools support trace replay natively; for others, custom tooling reads the trace and replays through your agent framework.

Sub-agent visibility. When agents spawn sub-agents, each sub-agent has its own context and capabilities. The observability story needs to surface what each sub-agent saw and decided, not just the parent agent’s view. Pattern: trace the sub-agent as a child span with full attribute set; surface sub-agent traces as collapsible sub-trees in the trace UI.

Chapter 11: Streaming responses and partial state observability

Streaming LLM responses (sending tokens to the client as they’re generated) are great for user experience but complicate observability. A streamed response doesn’t have a single “this is the response” moment — it has a stream of partial states. Capturing the right signal requires explicit instrumentation.

What to capture for streaming. The most important metrics differ from non-streaming: first-token latency (time from request to first token sent to the client); total tokens generated; total stream duration; stream completion (did the stream end cleanly or break midway?); whether the stream was canceled by the client.

# Instrumenting a streaming LLM call
async def stream_response(prompt):
    with tracer.start_as_current_span("llm.stream") as span:
        start = time.monotonic()
        first_token_time = None
        token_count = 0
        accumulated = []

        async for chunk in model.stream(prompt):
            if first_token_time is None:
                first_token_time = time.monotonic() - start
                span.set_attribute("stream.first_token_ms", first_token_time * 1000)
                span.add_event("first_token")

            accumulated.append(chunk.text)
            token_count += 1
            yield chunk

        end = time.monotonic() - start
        span.set_attributes({
            "stream.duration_ms": end * 1000,
            "stream.tokens": token_count,
            "stream.completed": True,
            "stream.output_hash": hash_for_privacy("".join(accumulated)),
        })

Server-Sent Events (SSE) and WebSocket streams. Both are common transports for LLM streaming. SSE is simpler and more widely supported. WebSocket allows bidirectional communication (useful for interactive agent UIs). Either way, instrument the stream’s lifecycle events: connection established, first token, periodic progress, connection closed, error if any.

Client-side observability. For browser-based AI applications, the client-side experience matters as much as the server. Browser-side instrumentation (via Web Vitals, Real User Monitoring) captures perceived latency that includes network and rendering time, not just server processing. Correlate client traces with server traces via shared trace IDs for end-to-end visibility.

Token-per-second tracking. Beyond first-token latency, sustained throughput matters for streaming UX. Track tokens-per-second as a streaming-specific metric; a stream that starts fast but slows mid-response feels worse than a consistently moderate stream. Aggregate by model and prompt-version to spot regressions in throughput.

# Track sustained streaming throughput
import time
async def stream_with_throughput_tracking(prompt):
    with tracer.start_as_current_span("llm.stream") as span:
        start = time.monotonic()
        last_token_time = start
        token_count = 0
        interval_tokens = []

        async for chunk in model.stream(prompt):
            now = time.monotonic()
            elapsed_since_last = now - last_token_time
            last_token_time = now
            token_count += 1
            interval_tokens.append(elapsed_since_last)
            yield chunk

        # Compute throughput statistics
        if interval_tokens:
            avg_inter_token_ms = sum(interval_tokens) * 1000 / len(interval_tokens)
            p95_inter_token_ms = sorted(interval_tokens)[int(0.95 * len(interval_tokens))] * 1000
            tokens_per_second = token_count / (time.monotonic() - start)

            span.set_attributes({
                "stream.avg_inter_token_ms": avg_inter_token_ms,
                "stream.p95_inter_token_ms": p95_inter_token_ms,
                "stream.tokens_per_second": tokens_per_second,
            })

Cancellation handling. Users sometimes cancel a streaming request before completion (closing a browser tab, navigating away). Capture this — cancelled streams have different cost (still pay for tokens generated up to cancellation) and different UX implications (user may have gotten enough already, or may have given up). Tag cancelled streams in observability so they don’t pollute success/failure metrics.

Chapter 12: Privacy and PII handling in traces

Observability data contains every interaction with your AI system — which means it contains every piece of customer data those interactions touched. Without explicit PII handling, your trace store becomes a privacy liability and a regulatory risk.

Three approaches to PII in traces. Hash sensitive values (hash the user input, store the hash; debugging is harder but PII never lands in traces). Redact selectively (run a PII detector on the input; replace detected PII with placeholders; preserve enough context for debugging). Restrict access (store full content but lock traces to a privileged role; most engineers see redacted versions; auditors with explicit authorization see full content).

# Selective PII redaction
import re

PII_PATTERNS = [
    (re.compile(r'\b[A-Z][a-z]+ [A-Z][a-z]+\b'), '[NAME]'),  # name (rough)
    (re.compile(r'\b[\w.-]+@[\w.-]+\.\w+\b'), '[EMAIL]'),
    (re.compile(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b'), '[PHONE]'),
    (re.compile(r'\b(?:\d[ -]*?){13,16}\b'), '[CARD]'),
    (re.compile(r'\b\d{3}-\d{2}-\d{4}\b'), '[SSN]'),
]

def redact_pii(text):
    for pattern, replacement in PII_PATTERNS:
        text = pattern.sub(replacement, text)
    return text

# Use a real PII detection library for production
# - Microsoft Presidio is well-maintained
# - Faker for synthetic replacement
# - Custom NER models for domain-specific PII

# In your instrumentation
span.set_attribute("input.text", redact_pii(user_input))
span.set_attribute("input.hash", hash_for_privacy(user_input))  # for correlation

Consent and access control. Beyond redaction, traces need access controls. Not every engineer should be able to inspect every trace; sensitive features should restrict trace access to specific teams. Compliance regimes (HIPAA, GDPR, SOC 2) typically require role-based access to systems that contain PII.

Retention. PII in traces must be deleted according to your retention policy. Most LLM observability tools support configurable retention per project; align with your data classification. Common defaults: 30-90 days for debugging traces; 1-7 years for compliance-relevant audit logs (often shipped to a separate compliance log system).

The GDPR right-to-erasure question. If a user requests deletion of their data, you may need to delete or redact their traces too. Hash-based correlation makes this hard (you can’t reverse a hash to find their traces); user-ID-tagged traces make it tractable (find all traces with the user’s ID, redact or delete). Plan the data model with this in mind from the start.

Per-region storage. For multi-region deployments serving customers across data-residency boundaries (EU users in EU, US users in US), traces should be stored in the appropriate region. Most observability tools support multi-region deployments; for self-hosted setups, separate trace stores per region with no cross-region replication is the typical pattern.

Encryption at rest and in transit. Standard for any system handling user data. Tools handle this transparently for cloud-hosted deployments; for self-hosted, ensure TLS for all OTel collector traffic and disk encryption for the trace store. Audit periodically that the configuration is still correct after upgrades.

Access logs for the observability system itself. Who accessed which traces? Why? In regulated environments, observability access is itself a logged event. Most observability tools support access audit logging; enable it from day one for compliance-relevant deployments.

# Audit log pattern for observability access
audit_log.record({
    "event": "trace.access",
    "user": current_user,
    "trace_id": trace_id,
    "access_type": "view|export|delete",
    "purpose": "ticket #1234 investigation",
    "timestamp": now(),
})

# Retain audit logs separately from the traces themselves
# Typically 7+ years; same retention as other compliance logs

Chapter 13: Alerting and incident response

Observability without alerting is reactive — you only know about problems when someone reports them. Alerts surface problems automatically. The challenge is alerting on the right things at the right thresholds without generating noise.

Useful alert categories for LLM apps. Availability alerts (LLM provider down; your service returning 5xx). Latency alerts (p95 exceeded threshold for sustained period). Error-rate alerts (errors above 1% in a 5-minute window). Cost alerts (spend rate exceeded projected budget). Quality alerts (output classifier scores degraded; user-feedback ratio worsened). Anomaly alerts (request volume spiked unusually; an unusual user is consuming an outlier share of resources).

# Sample alert definitions (Prometheus / Alertmanager YAML)
groups:
  - name: llm_app_alerts
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(llm_errors_total[5m]))
            / sum(rate(llm_requests_total[5m])) > 0.01
        for: 5m
        labels: {severity: page}
        annotations:
          summary: "LLM error rate above 1% for 5 minutes"

      - alert: P95LatencyHigh
        expr: |
          histogram_quantile(0.95,
            sum(rate(llm_request_duration_seconds_bucket[5m])) by (le)) > 8
        for: 10m
        labels: {severity: warn}

      - alert: CostBurnRate
        expr: |
          sum(rate(llm_cost_usd_total[1h])) * 24 * 30 > 50000
        labels: {severity: warn}
        annotations:
          summary: "Projected monthly LLM cost exceeds $50k"

      - alert: QualityRegression
        expr: |
          avg_over_time(llm_quality_score[1h]) < 0.75
        for: 30m
        labels: {severity: page}

Incident response for LLM issues. The structure is similar to general SRE incidents but with AI-specific runbooks. On-call playbooks should cover: rolling back to a previous model version; switching to a fallback provider; disabling a feature; degrading gracefully (returning a static “we’re experiencing issues” response instead of erroring); contacting the LLM provider’s support. Document each scenario and how to execute it; rehearse periodically.

Post-incident reviews. When something does break, a blameless post-mortem captures what happened, what the impact was, and what changes prevent recurrence. For LLM-specific incidents, common contributing factors: model provider outage; new model version regressed behavior; prompt change broke specific use cases; cost spike from unexpected usage; output classifier broke (so quality alerts didn’t fire); rate limit hit on the provider. Each gets its own follow-up actions.

Alert fatigue prevention. The fastest way to make your team ignore alerts is to fire too many of them. Aim for < 5 page-level alerts per week per on-call engineer; < 20 warn-level alerts per week. If you’re above these thresholds, tune more aggressively — raise thresholds, require longer breach windows, add context that helps the responder decide quickly. Alert-fatigue tracking should itself be a metric: how often does the team page on something that turns out to be non-actionable?

Runbook automation. When the same alert fires repeatedly, the response can often be automated. If a provider is rate-limiting, the system can automatically fall back to a second provider. If output quality drops, automatic rollback to the prior prompt version. Each automated runbook should still log to the audit trail; humans need to know what the system did on their behalf.

# Automated runbook pattern: fall back on provider error spike
async def call_llm_with_fallback(prompt):
    # Try primary
    try:
        return await primary_provider.call(prompt)
    except (RateLimitError, ProviderError) as e:
        # Record metric and log
        metrics.counter("llm.fallback_triggered").inc(tags={"reason": type(e).__name__})
        audit_log.record({"event": "auto_fallback", "from": "primary", "to": "secondary", "reason": str(e)})
        return await secondary_provider.call(prompt)

Incident retrospectives across teams. Many LLM incidents span team boundaries — a model provider issue affects multiple features; a shared platform change affects everyone. Quarterly cross-team retrospectives surface patterns that single-team retros miss. Common findings: shared infrastructure issues that no single team had context to anticipate; vendor management gaps (no clear single-point-of-contact for the provider relationship); coordination breakdowns during multi-team incidents.

Chapter 14: Observability + Evals + FinOps — the production triad

LLM observability, evals, and FinOps are three sides of the production AI triangle. Each addresses different questions; together they make production AI maintainable. The relationship between them matters because their data substrate overlaps and the most-effective teams treat them as integrated rather than separate.

Discipline Question it answers Data substrate Primary tool
Observability What is happening right now? Traces, metrics, logs from production Langfuse / Phoenix / LangSmith + APM
Evals Is the system meeting quality requirements? Curated datasets + scoring harnesses promptfoo / deepeval / custom
FinOps What does it cost and how can we optimize? Cost dashboards + usage attribution Provider billing + custom rollups

Integration patterns. Production traces feed eval datasets — surprising failures observed in production become new eval cases. Eval results feed observability dashboards — current eval pass-rate is a metric worth monitoring alongside latency and cost. FinOps decisions depend on observability data — switching a feature to a cheaper model requires verifying via observability that quality didn’t regress.

The mature teams treat the three disciplines as collaborative rather than separate. The same data pipeline that produces production traces also produces eval candidates and cost attribution data. The same dashboards surface latency, quality, and cost together so trade-off decisions are informed.

# Unified data pipeline serving all three disciplines
class LLMCall:
    def __init__(self, request_id, user_id, ...):
        # Standard fields used by all three
        self.request_id = request_id
        self.timestamp = ...
        self.user_id_hash = ...
        self.model = ...
        self.input_tokens = ...
        self.output_tokens = ...
        self.cost_usd = ...
        self.latency_ms = ...
        # Quality fields for evals/observability
        self.input = ...
        self.output = ...
        self.classifier_scores = ...
        # FinOps fields
        self.feature = ...
        self.tenant_id = ...

# Single source of truth feeds:
# - Observability (traces, metrics)
# - Evals (sample selection for golden datasets)
# - FinOps (cost rollups, anomaly detection)

Organizational integration. The three disciplines often live in different teams. Observability is usually with the SRE/platform team. Evals tend to live with the ML/AI team. FinOps may be with finance or platform. Cross-team rituals (a monthly “production AI health” meeting that reviews all three together) prevent decisions in one area from breaking the others.

Shared dashboards. The most effective integration tactic is a single dashboard that shows all three dimensions side by side: latency p95 vs eval pass rate vs daily cost. When one metric moves, the dashboard makes it obvious whether the others moved in concert. This single view catches trade-off problems early — “we optimized cost by switching models but quality dropped 5%” becomes visible the day it happens, not at the end of the month when CSAT scores arrive.

Cross-discipline incident triage. When a production incident occurs, the response often touches all three areas. The user-facing symptom is observed via observability (latency spike). The diagnostic asks an eval question (is the new model worse on our suite?) and a FinOps question (did cost spike for the same reason?). Mature teams have a single incident commander whose first call is to look at the unified dashboard rather than three separate ones.

The “production triad” as a hiring concept. Job descriptions for AI platform engineers in 2026 increasingly list “experience with all three of observability, evals, and FinOps” as a requirement. Candidates with all three areas of competence are rare and command premium compensation; the value to the org is that they make integrated decisions where specialist engineers would optimize one dimension at the expense of others.

Chapter 15: Common observability mistakes

Recurring patterns across many teams’ LLM observability journeys. Knowing them in advance avoids costly retrofits.

Mistake 1: instrumenting too late. Adding observability after a feature ships to production is dramatically harder than building it in from day one. The right time is during initial development; the wrong time is after the third incident.

Mistake 2: capturing everything, including PII, without thinking about privacy. The trace store becomes a privacy liability; deleting later is hard. Start with PII-aware redaction from the first instrumentation.

Mistake 3: siloed LLM observability that doesn’t integrate with general APM. Engineers debugging an issue have to look in two places; correlations across the boundary are lost. Pick tools that integrate; export to both LLM-native and general APM.

Mistake 4: alerting on every blip. Noisy alerts get muted; muted alerts catch nothing. Tune thresholds carefully; require sustained breaches before paging; differentiate page-level from warn-level severities.

Mistake 5: capturing traces but no metrics. Traces are great for debugging specific failures; metrics are essential for trends, SLOs, and dashboards. Both, not one or the other.

Mistake 6: 100% trace sampling at scale. Traces are expensive to store; capturing every request at 100% becomes the dominant observability cost. Sample intelligently — 100% errors and high-latency, 10% normal — to balance fidelity and cost.

Mistake 7: no integration with eval datasets. Production failures should automatically become candidates for the eval dataset. Without the feedback loop, you’re paying for observability infrastructure that doesn’t compound into prevention.

Mistake 8: ignoring streaming responses in instrumentation. Streaming hides interesting signals (first-token latency, mid-stream errors) that batch instrumentation misses. Instrument streams explicitly.

Mistake 9: separate dashboards for ML and SRE. The same dashboard should show latency, cost, and quality so trade-offs are visible. Otherwise teams make decisions in silos.

Mistake 10: storing traces indefinitely. Traces accumulate; retention costs accumulate. Define retention policy by trace type (errors keep longer; routine traces shorter); enforce via the observability tool.

Mistake 11: capturing too few fields. Sparse instrumentation produces traces that don’t actually help debugging. Capture the comprehensive field set from chapter 6.

Mistake 12: ad-hoc trace naming. Inconsistent span names make queries impossible. Pick a convention (verb.object or hierarchical dotted names) and stick with it.

Mistake 13: not closing the loop between observability and product roadmap. Observability surfaces patterns about how users interact with your AI feature — what they ask, what works, what doesn’t. Product teams should be looking at observability data when prioritizing improvements. Engineering teams that operate observability in isolation from product decisions waste a major source of customer signal.

Mistake 14: treating observability as a one-time setup. Production systems evolve; observability has to evolve with them. New features need new instrumentation. New models require updated metric definitions. New attack patterns from red teaming should feed new alerts. Without ongoing investment, the observability system slowly stops being useful.

Mistake 15: relying on a single vendor without portability. If your traces only live in vendor X’s proprietary format, switching vendors becomes a rebuild. Pick tools that emit (and ideally consume) OpenTelemetry; that way your instrumentation is portable across backends. The instrumentation is the real asset; specific dashboards are easier to recreate than instrumentation.

Mistake 16: missing the LLM-specific signals that traditional APM doesn’t surface. Token counts, prompt versions, retrieved document IDs, output classifier scores — none of these come from standard HTTP instrumentation. Make sure your stack captures them explicitly; don’t assume your existing APM gives you everything.

Mistake 17: not tracking failed responses separately from errors. A response that returned 200 but was wrong (the model hallucinated, the retrieval was misled, the output failed schema validation) is a different failure mode from a 500 error. Both deserve observability dimensions; conflating them hides important signal. Output classifier scores plus parse-success flags catch the silent-failure class.

Mistake 18: ignoring vendor SLAs. LLM providers have stated SLAs (latency, availability). Your observability should compare provider-side latency (the API call duration) against the SLA so you know whether your latency issues are provider-side or your own. Without this, you can’t escalate to the provider when their performance violates the contract.

Mistake 19: insufficient sampling for low-traffic features. If a feature gets 100 requests/day and you sample at 10%, you have 10 traces — too few for meaningful debugging. Adjust sampling per-feature; low-volume features should be at higher (or 100%) sampling rates than high-volume ones.

Mistake 20: not exercising observability during incidents. Tabletop drills where the team simulates an incident and tries to debug via observability surface gaps: missing metric, untraced span, unfamiliar tool. Run drills quarterly; treat findings as input to observability investment priorities.

Chapter 16: FAQ

What’s the minimum observability setup for a small team shipping an LLM feature?

One LLM-native tool (Langfuse self-hosted or cloud is the simplest path) plus integration with whatever APM you already use. Instrument every LLM call with the comprehensive field set from chapter 6. Set up basic dashboards for latency, cost, and error rate. Configure alerts for the obvious failure modes (provider down, error rate spike, cost spike). This minimum baseline takes 2-5 days to set up and prevents most debugging-time-sinks.

How much does LLM observability cost?

Three components. Storage of traces and metrics — typically $50-$500/month for moderate scale, $5K-$50K/month for high scale. LLM-native tool subscriptions — $200-$5K/month for cloud-hosted; self-hosted is free except for compute. Engineering time — 0.1-0.5 FTE for ongoing maintenance, more during initial setup. Total cost should be 1-5% of your LLM API spend; if it’s much more, your sampling/storage strategy needs tuning.

Should I build observability myself or buy a tool?

For most teams, buy. The leading LLM observability tools (Langfuse, Phoenix, LangSmith) have years of development invested in trace UI, query speed, eval integration, and operational concerns. Building equivalent functionality in-house takes person-years and produces something that’s usually inferior. The exceptions: teams with unusual scale, strict data-residency requirements, or specific compliance needs that off-the-shelf tools don’t meet.

How do I choose between Langfuse, Phoenix, and LangSmith?

Three quick heuristics. If you’re building heavily on LangChain, LangSmith integrates tightly. If RAG quality is your dominant concern, Phoenix’s RAG analytics are best-in-class. For most other applications, Langfuse provides the best combination of features (tracing + evals + prompt management) plus a strong open-source option. Many teams use a primary tool plus export to general APM via OpenTelemetry.

What about local LLMs — do I need different observability?

Largely the same, plus extra capacity for infrastructure metrics (GPU utilization, memory pressure, inference engine queue depth). The LLM-specific data (prompts, outputs, tokens) is the same; the infrastructure data is more relevant when you self-host. Most LLM observability tools support both managed-API and self-hosted backends via configuration.

How do I instrument LLM apps written in JavaScript / TypeScript?

OpenTelemetry has solid Node.js support. The major LLM observability tools all have JS/TS SDKs. The pattern is the same as Python: instrument SDKs that auto-detect LLM calls, supplement with manual spans for custom logic. @opentelemetry/sdk-node plus auto-instrumentation packages for express/fastify/etc. cover the bulk of typical applications.

How do I handle traces for very long-running agents?

Long-running agent runs (10+ minutes, thousands of steps) produce huge spans that strain trace UIs. Strategies: split the long run into multiple “epoch” spans (one per logical phase); use trace links to connect related spans rather than nesting them; export to a backend that handles large traces well (Honeycomb, Jaeger Tempo). For very long-running tasks, consider treating each step as a separate trace tied via a workflow ID.

How do I observe LLM calls from third-party services I integrate with?

If the third party is OTel-instrumented and you control the trace propagation, their LLM calls become child spans of your trace. If they’re not, you only see “called external service X for Y ms”; the LLM details are opaque. For critical third-party services, ask whether they support OTel propagation; many SaaS LLM tools (Vercel AI SDK, Mastra, etc.) do.

What metrics matter most for product teams (vs platform teams)?

Product teams care most about quality and user-perceived metrics: response quality scores, user feedback ratios, first-token latency (for streaming UX), and feature-specific completion rates. Platform teams care about underlying performance: provider error rates, cache hit rates, cost-per-request trends, capacity headroom. Dashboards should serve both audiences with appropriate views.

How do I detect quality drift over time?

Run a stable eval suite on each model version and compare scores; trace-level output classifiers running on production data; user feedback trends (positive/negative ratios over time); periodic comparison of production samples against a held-out reference. Drift detection works best when you have multiple signals correlated — any single signal can be noisy.

How do I deal with very high-cardinality attributes (e.g., user ID)?

High-cardinality attributes in traces are fine (every trace has its own user ID); high-cardinality in metrics is a problem (creates millions of metric series). For metrics, aggregate by feature, model, and tenant; not user ID. For traces, capture user ID for debugging but ensure trace storage is priced reasonably for your volume.

What’s the relationship between observability and red teaming?

Observability data feeds red team work — anomalies surfaced by output classifiers are candidates for red team investigation. Red team findings inform observability — new attack patterns become alerts in production monitoring. Mature teams treat the two as connected: observability surfaces what’s happening; red teaming asks what could happen and how to detect it.

How do I store traces for compliance / audit purposes?

Use a separate audit log pipeline alongside debugging traces. Audit logs: structured, append-only, encrypted, retained per compliance requirements (often 7 years), restricted access. Debugging traces: same data, shorter retention (30-90 days), accessible to engineers for troubleshooting. Some tools support both modes natively; for others, ship the same trace data to two backends with different retention policies.

How do I observe LLM calls in serverless / edge environments?

OTel exporters can run in serverless functions but cold-start overhead matters. Pattern: use a synchronous exporter that batches and ships at function exit; for very latency-sensitive paths, ship to a side-channel async log that’s later ingested. Edge environments (Cloudflare Workers, Vercel Edge Functions) have more constraints; use lightweight HTTP exporters and avoid heavy SDKs.

How do I prevent observability infrastructure from itself becoming a critical-path dependency?

The risk: observability infra goes down; the LLM app keeps running but you have no visibility. Worse: observability infra failures cascade into LLM app failures (the export blocks the request). Mitigations: instrumentation should always be fire-and-forget (failing to emit a trace must not fail the request); use async batch exporters with bounded queues; if the export queue fills, drop traces rather than back-pressure the app; monitor the observability pipeline itself with a separate, simpler observability path so you know when it’s broken.

How do I integrate observability data with my data warehouse for offline analysis?

Most observability tools export traces to S3, GCS, or a data warehouse as a destination. The pattern: traces stay in the observability tool for short-term (30-90 days) debugging; copies flow to your warehouse (BigQuery, Snowflake, Databricks) for long-term analysis. Warehouse access enables ad-hoc SQL queries against trace data, joins with business data (revenue, user engagement, support tickets), and ML feature extraction.

# Daily trace export to BigQuery (illustrative)
# OTel Collector → Cloud Storage → BigQuery scheduled query

# In OTel Collector:
exporters:
  otlphttp/gcs:
    endpoint: "https://gcs-otel-uploader.internal/v1/traces"

# Or use a separate exporter that writes Avro / Parquet directly to GCS
# Then load into BigQuery on a daily schedule

# Once in BigQuery, query like normal warehouse data:
SELECT
    DATE(timestamp) AS day,
    attributes.feature AS feature,
    AVG(duration_ms) AS avg_latency_ms,
    SUM(attributes.cost_usd) AS daily_cost,
    COUNTIF(status = 'ERROR') AS errors
FROM `project.observability.traces`
WHERE DATE(timestamp) >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
GROUP BY day, feature
ORDER BY day DESC, daily_cost DESC;

How do I bootstrap observability when retrofitting onto an existing app?

Start with the simplest layer that adds immediate value. Wrap every LLM call in a function that logs the inputs, outputs, model, tokens, and timing to a single structured log stream — that’s it for week one. In week two, introduce OTel and start emitting spans. In week three, add metrics derived from the logs. In month two, add output classifiers. In month three, integrate with your APM. The mistake to avoid is trying to ship everything at once; partial observability that works beats comprehensive observability that’s stuck in review.

How do I diagnose increased latency that doesn’t correlate with model changes?

Common causes when model version is stable. Network changes (DNS issue, route degradation between you and provider). Provider region hot-spots (if you’re using a multi-region provider, one region may be more loaded than others). Context length growth (your prompt got longer over time and that increased per-call latency). Tool latency (an agent’s tool calls slowed down, dragging the total latency up). Retrieval latency (vector store growth without rebalancing). Diagnose by walking the trace span-by-span and identifying which phase increased.

How do I bootstrap quality monitoring without an existing eval suite?

Three phases. Phase one: deploy a generic output classifier (one of the open-source quality models like the OpenLLMetry safety check) on a sampled subset of production traffic. Phase two: have humans label a sample of outputs (50-100 cases) and calibrate the classifier — adjust the rubric until classifier scores correlate well with human judgment. Phase three: incrementally build out a richer eval suite from production failure modes (see the LLM Evals 2026 eguide). The pattern: start coarse, refine as you learn what matters.

How do I structure observability for a multi-tenant SaaS LLM product?

Tenant ID is the most important attribute on every trace and metric. Build per-tenant dashboards so the customer success team can see what specific accounts are experiencing. Set per-tenant SLOs (some enterprise customers have tighter SLAs than basic plan customers). Most importantly, alert on per-tenant regressions — a global p95 latency that hasn’t moved could hide a single high-value customer who’s seeing 10x normal latency. Without per-tenant slicing, you’ll miss customer-specific issues until they churn.

How do I correlate observability data with revenue and business metrics?

Tag every trace with attributes that connect to business outcomes (user tier, plan, ARR-bracket, conversion event). When the warehouse joins traces with billing/revenue data, you can answer questions like “what’s the cost per acquired-customer for this feature” or “do users on the enterprise plan have different latency distributions”. The integration is usually via a shared user/tenant identifier that’s stable across both systems.

How do I monitor a model that I self-host vs an external API?

Self-hosted models add an infrastructure layer to observe alongside the LLM-specific signals. GPU utilization, GPU memory, inference engine queue depth, KV cache hit rate, batch size, and per-request P99 latency all matter for understanding why a self-hosted model is slow or expensive. Most inference engines (vLLM, TGI, SGLang) expose Prometheus metrics directly; scrape them and integrate with the same dashboards that show your LLM-app metrics. The combination — application metrics + infrastructure metrics — is the only way to do effective capacity planning for self-hosted setups.

How does observability work for AI gateway / proxy patterns?

Many enterprises route all LLM traffic through an internal gateway (LiteLLM, Portkey, Helicone, custom). The gateway is the natural place to instrument everything — it sees every call, can enforce instrumentation policy, and can produce traces even when application code is uninstrumented. The trade-off: the gateway becomes a critical path; if it goes down, all LLM calls fail. Design for high availability.

What metrics should I report to executives vs to engineers?

Engineers care about technical metrics — latency percentiles, error rates, cache hit rates, cost-per-call. Executives care about business outcomes — total cost, cost per active user, quality trends, customer satisfaction impact. Build two views of the same underlying data: an engineering dashboard with deep technical detail and a leadership dashboard with high-level business indicators. Both pull from the same observability pipeline.

How do I make observability work with serverless / Lambda-style deployments?

Serverless functions have cold starts that affect observability — the OTel SDK takes longer to initialize on first request after a cold start. Use the OTel SDK’s lazy-init pattern; ensure exports use HTTP/Protobuf (not gRPC, which has more init overhead); flush spans before the function returns rather than relying on background flush. Cloud providers’ built-in tracing (AWS X-Ray, Cloud Trace) integrate with OTel but have their own quirks; ship to your primary observability tool via OTel and treat cloud-native tracing as supplementary.

How do I integrate observability with on-call rotations?

Wire critical alerts to your on-call paging system (PagerDuty, Opsgenie, Squadcast). Each alert should include: link to the relevant dashboard; runbook entry covering the typical response; trace IDs of recent examples; severity (page vs warn vs info). Maintain a clear separation between AI-specific alerts and general infrastructure alerts so the right team responds. Rotate AI-specific on-call duties; the rotation builds shared knowledge across the team about how the AI features actually behave in production.

How do I observe latency in async agent workflows that complete out of band?

An agent that returns immediately and continues working asynchronously has its own observability needs. Capture the “job submitted” event as the request-side trace; capture the “job completed” event as a separate span linked to the original via the agent task ID. Latency for the user-facing handoff is one number; total-time-to-completion is a separate metric. Both matter; track both.

How do I diagnose intermittent quality regressions?

Intermittent regressions are the hardest to debug because they don’t reproduce on demand. Strategies. First, increase sampling rate temporarily when the team suspects a regression — better to pay 2x storage for a week than miss the diagnostic data. Second, capture maximum context on every trace (full prompt, full output, retrieval, model version, prompt version) so when you find an example you have everything. Third, run automated diff comparisons between current production samples and historical baselines (same input, what changed in the output?). Fourth, look at correlations — what’s different about the failing requests (time of day, user segment, input characteristics)?

How does observability fit into a regulated industry deployment?

In regulated industries (finance, healthcare, government), observability serves both operational and compliance roles. The compliance role requires: immutable audit logs of every AI interaction; access controls that limit who can view trace contents; retention that meets industry rules (often 7+ years); deletion processes that support right-to-erasure where applicable; documented evidence that observability infrastructure itself is monitored and access is logged. Plan for these from day one — retrofitting compliance onto an existing observability system is dramatically harder than building it in.

How do I avoid noisy traces from health-check endpoints?

Health checks shouldn’t produce trace noise. Filter them at the OTel Collector via attribute matching (drop spans where http.url contains /health). Alternatively, skip instrumentation for health-check handlers. The general principle: instrument signal, not noise. Internal API calls, metrics scrapes, and other non-user-facing traffic typically don’t need tracing.

How do I share observability dashboards across the team?

Save dashboards in the observability tool with descriptive names; document them in your team wiki with screenshots and “use this when…” descriptions. For high-visibility metrics (overall service health), put screenshots on a TV in the team area. For ad-hoc investigations, share trace IDs and dashboard URLs in incident channels so the next person can pick up where the last left off.

How do I observe LLM apps that mix multiple providers?

Tag every trace with the provider in addition to the model. Build per-provider dashboards alongside per-model dashboards. Track provider-specific signals: rate-limit headers from each provider; cost breakdowns; error rates by provider; latency comparisons across providers for similar models. Correlation is key — knowing that one provider’s latency degraded while others held steady localizes the problem to that provider, not your application.

# Per-provider tagging
span.set_attributes({
    "llm.provider": "anthropic",  # or "openai", "google", "self-hosted"
    "llm.model": "claude-opus-4-7",
    "llm.api_region": "us-east-1",
    "llm.api_version": "2026-05-15",
})

# Dashboards filter and group by provider for vendor management
# When negotiating with providers, having clear per-provider performance
# data is invaluable

How do I observe streaming responses that span websocket connections?

Each websocket message is conceptually a span. The websocket connection itself can be a parent span containing all messages. OTel doesn’t have native websocket conventions yet, so this often requires custom instrumentation. Capture per-message timing and per-message status; aggregate to per-connection metrics. For very long-running connections, periodically flush spans rather than waiting for connection close.

How do I handle observability for AI features that integrate with multiple frameworks?

A common pattern: one feature uses LangChain for orchestration, another uses LlamaIndex for retrieval, another uses raw Anthropic SDK calls. Each framework has its own tracing approach. The unifying layer is OpenTelemetry — instrument each framework with its OTel-compatible package (openinference-instrumentation-langchain, openinference-instrumentation-llama-index, anthropic SDK with OTel), and the spans land in the same backend with consistent semantic conventions. The work is per-framework but the result is uniform.

How does observability interact with feature flags and A/B tests?

Tag every trace with active feature flags and A/B test variants. When a metric changes, you can determine whether the change correlates with a flag rollout or an A/B test. Without this, attribution is guesswork. Feature flag SDKs (LaunchDarkly, Statsig, Unleash) typically integrate with observability — enable the integration so flag context appears in every trace and metric automatically.

What does the future of LLM observability look like?

Three trends. First, semantic conventions converge — OpenInference / OpenLLMetry become standard, so trace attributes are interoperable across tools. Second, agent-specific tooling matures — visualization of complex agent traces becomes a first-class capability. Third, more automation — anomaly detection, automatic eval-case extraction from failures, and AI-assisted debugging (using LLMs to summarize trace patterns) become standard features. Expect the LLM observability market to consolidate; the leading tools today will likely be among the leading tools in 2027-2028 but with substantially richer feature sets.

Closing thoughts

LLM observability in 2026 has matured from “would be nice” to “table stakes for production”. The teams that invest in observability ship reliable AI features and debug issues in minutes; the teams that skip it ship unreliable AI features and debug for hours. The patterns documented in this guide — traces, metrics, output classifiers, integration with evals and FinOps, careful PII handling, alerting and incident response — give your team the foundation to operate production AI confidently. Start with the basics (instrument every LLM call with the comprehensive field set), add complexity as needed (agent traces, streaming, output classifiers), and integrate with the rest of your observability practice from day one. The investment compounds over the lifetime of every AI feature your team ships.

Scroll to Top