Back to glossary

Infrastructure & DevOps

Error Budget

The acceptable amount of unreliability allowed for a service within a given period, calculated as one minus the Service Level Objective. Error budgets create a quantitative framework for balancing the competing priorities of reliability and feature velocity.

If a service has a 99.9% availability SLO, the error budget is 0.1% of total time, roughly 43 minutes per month. When the error budget is healthy, the team can take risks: deploy more aggressively, run risky experiments, and accept some instability. When the error budget is depleted, the team shifts focus to reliability improvements until the budget recovers. This mechanism replaces subjective arguments about reliability with objective data-driven decisions.

For AI product teams, error budgets prevent both over-engineering and under-investing in reliability. Without an error budget, reliability-focused engineers might block every deployment for extensive testing, slowing iteration to a crawl. With an error budget, the team can quantify the cost of a risky model deployment: if it burns through the remaining budget, reliability work takes priority next. Growth teams should understand error budgets because aggressive experimentation consumes reliability budget through increased deployment frequency and the inherent risk of untested variations. When the error budget runs low, experiment velocity must decrease until stability is restored, making it important to design experiments that are safe to run in production.

Related Terms

Content Delivery Network

A geographically distributed network of proxy servers that caches and delivers content from locations closest to end users. CDNs reduce latency, improve load times, and absorb traffic spikes by serving content from edge nodes rather than a single origin server.

Edge Computing

A distributed computing paradigm that processes data closer to the source of generation rather than in a centralized data center. Edge computing reduces latency, conserves bandwidth, and enables real-time processing for latency-sensitive applications.

Serverless Computing

A cloud execution model where the provider dynamically manages server allocation and scaling. Developers deploy functions or containers without provisioning infrastructure, paying only for actual compute time consumed rather than reserved capacity.

Function as a Service

A serverless computing category where developers deploy individual functions that execute in response to events. FaaS platforms like AWS Lambda, Google Cloud Functions, and Azure Functions handle all infrastructure management, scaling each function independently.

Platform as a Service

A cloud computing model that provides a complete development and deployment environment without managing underlying infrastructure. PaaS offerings like Heroku, Vercel, and Google App Engine handle servers, storage, networking, and runtime configuration.

Infrastructure as a Service

A cloud computing model that provides virtualized computing resources over the internet. IaaS offerings like AWS EC2, Google Compute Engine, and Azure Virtual Machines give teams full control over servers, storage, and networking without owning physical hardware.