Back to glossary

Observability

The ability to understand a system's internal state from its external outputs, achieved through the three pillars of metrics, logs, and traces working together to enable effective debugging and monitoring.

Observability goes beyond traditional monitoring. While monitoring tells you when something is broken (alerting on known failure modes), observability lets you investigate why something is broken and discover unknown failure modes. The three pillars are metrics (numerical time-series data like request rates and error counts), logs (structured event records), and traces (request paths through distributed services).

Modern observability platforms like Datadog, Grafana Cloud, and Honeycomb correlate data across all three pillars. When a latency spike appears in metrics, you can drill down to the specific traces that were slow, then examine the logs from those requests to identify the root cause, all within a unified interface.

For AI systems, observability requires additional dimensions: model performance metrics (accuracy, hallucination rates), prompt/completion logging, token usage tracking, embedding quality metrics, and data drift detection. LLM-specific observability tools like LangSmith and Helicone provide these AI-native observability capabilities, complementing traditional infrastructure observability.

Related Terms