Observability

Observability goes beyond traditional monitoring. While monitoring tells you when something is broken (alerting on known failure modes), observability lets you investigate why something is broken and discover unknown failure modes. The three pillars are metrics (numerical time-series data like request rates and error counts), logs (structured event records), and traces (request paths through distributed services).

Modern observability platforms like Datadog, Grafana Cloud, and Honeycomb correlate data across all three pillars. When a latency spike appears in metrics, you can drill down to the specific traces that were slow, then examine the logs from those requests to identify the root cause, all within a unified interface.

For AI systems, observability requires additional dimensions: model performance metrics (accuracy, hallucination rates), prompt/completion logging, token usage tracking, embedding quality metrics, and data drift detection. LLM-specific observability tools like LangSmith and Helicone provide these AI-native observability capabilities, complementing traditional infrastructure observability.

Related Terms

A/B Testing

Feature Flag

MLOps

Model Serving

Semantic Search

CI/CD (Continuous Integration / Continuous Deployment)