Site Reliability Engineering
A discipline that applies software engineering principles to infrastructure and operations work. SRE, pioneered by Google, treats operations as a software problem, using automation, monitoring, and error budgets to maintain reliable systems at scale.
SRE bridges the traditional gap between development and operations by having engineers who write software to manage production systems. Key practices include defining Service Level Objectives that quantify reliability targets, using error budgets to balance reliability with feature velocity, automating repetitive operational tasks, and conducting blameless postmortems to learn from incidents.
For AI product teams, SRE practices are essential because AI systems have unique reliability challenges: model drift causes silent degradation, data pipeline failures produce stale features, and non-deterministic model behavior complicates incident diagnosis. SRE teams define SLOs for AI services that include both availability and quality metrics, such as the recommendation service must return results within 200ms for 99.5% of requests, with relevance scores above a threshold. Growth teams benefit from SRE practices because reliable infrastructure is the foundation of trustworthy experiment results. If production instability introduces noise into experiment metrics, it becomes impossible to detect real treatment effects, wasting both user traffic and engineering time.
Related Terms
Content Delivery Network
A geographically distributed network of proxy servers that caches and delivers content from locations closest to end users. CDNs reduce latency, improve load times, and absorb traffic spikes by serving content from edge nodes rather than a single origin server.
Edge Computing
A distributed computing paradigm that processes data closer to the source of generation rather than in a centralized data center. Edge computing reduces latency, conserves bandwidth, and enables real-time processing for latency-sensitive applications.
Serverless Computing
A cloud execution model where the provider dynamically manages server allocation and scaling. Developers deploy functions or containers without provisioning infrastructure, paying only for actual compute time consumed rather than reserved capacity.
Function as a Service
A serverless computing category where developers deploy individual functions that execute in response to events. FaaS platforms like AWS Lambda, Google Cloud Functions, and Azure Functions handle all infrastructure management, scaling each function independently.
Platform as a Service
A cloud computing model that provides a complete development and deployment environment without managing underlying infrastructure. PaaS offerings like Heroku, Vercel, and Google App Engine handle servers, storage, networking, and runtime configuration.
Infrastructure as a Service
A cloud computing model that provides virtualized computing resources over the internet. IaaS offerings like AWS EC2, Google Compute Engine, and Azure Virtual Machines give teams full control over servers, storage, and networking without owning physical hardware.