Glossary
Infrastructure & DevOps

Infrastructure & DevOps Glossary

Cloud infrastructure, CDNs, serverless, containers, caching, databases, and the operational foundations for scalable systems.

API Gateway

A server that acts as the single entry point for all API requests, handling routing, authentication, rate limiting, and request transformation. API gateways decouple client applications from the internal microservice topology and centralize cross-cutting concerns.

Application Performance Monitoring

The practice of measuring and analyzing application behavior from the end-user perspective, tracking response times, error rates, throughput, and transaction traces. APM tools provide visibility into code-level performance issues that infrastructure monitoring cannot detect.

Auto-Scaling

The automatic adjustment of compute resources based on real-time demand metrics. Auto-scaling adds instances when traffic increases and removes them when demand drops, maintaining performance during peaks while minimizing costs during quiet periods.

Backup Strategy

A comprehensive plan for creating, storing, verifying, and restoring copies of data to protect against loss from hardware failure, software bugs, human error, or security breaches. An effective backup strategy defines backup frequency, retention periods, and storage locations.

Blue-Green Deployment

A deployment strategy that maintains two identical production environments, blue and green. One environment serves live traffic while the other receives the new deployment. Traffic is switched atomically once the new version is verified, enabling instant rollback.

Cache Invalidation

The process of removing or updating stale data from caches when the underlying source data changes. Cache invalidation is notoriously difficult because it requires knowing exactly when cached data becomes stale across distributed systems.

Caching Strategy

A systematic approach to storing frequently accessed data in fast-access storage layers to reduce latency and backend load. Effective caching strategies define what to cache, where to cache it, how long to keep it, and when to invalidate stale entries.

Capacity Planning

The process of determining the computing resources needed to meet current and future demand while balancing performance, cost, and reliability. Capacity planning uses traffic projections, load testing, and resource utilization data to make informed infrastructure decisions.

Connection Pooling

A technique that maintains a pool of reusable database or network connections rather than creating and destroying connections for each request. Connection pooling reduces the overhead of connection establishment and improves response times for database-heavy applications.

Container Orchestration

The automated management of containerized applications across a cluster of machines, handling deployment, scaling, networking, and health monitoring. Kubernetes is the dominant orchestration platform, providing declarative configuration for complex distributed systems.

Content Delivery Network

A geographically distributed network of proxy servers that caches and delivers content from locations closest to end users. CDNs reduce latency, improve load times, and absorb traffic spikes by serving content from edge nodes rather than a single origin server.

Cost Optimization

The ongoing practice of reducing infrastructure spending while maintaining required performance and reliability levels. Cost optimization involves right-sizing resources, leveraging pricing models, eliminating waste, and aligning spending with business value.

CQRS

Command Query Responsibility Segregation is an architectural pattern that separates read and write operations into distinct models. Write operations use command models optimized for validation and business logic, while read operations use query models optimized for data retrieval.

Database Indexing

The creation of data structures that speed up data retrieval operations by providing efficient lookup paths to rows matching specific query conditions. Indexes trade additional storage space and slower write performance for dramatically faster read queries.

Database Migration

The process of transforming a database schema or moving data between databases in a controlled, versioned manner. Migration tools track which changes have been applied, enabling reproducible database evolution across development, staging, and production environments.

Database Replication

The process of copying data from one database server to one or more replicas to improve read performance, provide geographic distribution, and ensure data durability through redundancy. Replication can be synchronous or asynchronous.

Disaster Recovery

The set of policies, tools, and procedures designed to restore critical systems and data after a catastrophic failure. Disaster recovery planning defines Recovery Time Objectives and Recovery Point Objectives that determine acceptable downtime and data loss.

DNS

The Domain Name System translates human-readable domain names into IP addresses that computers use to route network traffic. DNS is a hierarchical, distributed naming system that underpins virtually all internet communication and is a critical factor in application performance.

Edge Computing

A distributed computing paradigm that processes data closer to the source of generation rather than in a centralized data center. Edge computing reduces latency, conserves bandwidth, and enables real-time processing for latency-sensitive applications.

Error Budget

The acceptable amount of unreliability allowed for a service within a given period, calculated as one minus the Service Level Objective. Error budgets create a quantitative framework for balancing the competing priorities of reliability and feature velocity.

Event Sourcing

An architectural pattern that stores the full history of state changes as an immutable sequence of events rather than only the current state. The current state is derived by replaying events, providing a complete audit trail and enabling temporal queries.

Function as a Service

A serverless computing category where developers deploy individual functions that execute in response to events. FaaS platforms like AWS Lambda, Google Cloud Functions, and Azure Functions handle all infrastructure management, scaling each function independently.

Horizontal Scaling

The practice of increasing capacity by adding more machines to a system rather than upgrading existing ones. Horizontal scaling distributes load across multiple instances, providing better fault tolerance and theoretically unlimited growth potential.

HTTP/2

A major revision of the HTTP protocol that improves performance through multiplexing, header compression, server push, and stream prioritization. HTTP/2 enables multiple concurrent requests over a single TCP connection, eliminating head-of-line blocking at the application layer.

HTTP/3

The latest version of the HTTP protocol that replaces TCP with QUIC as the transport layer. HTTP/3 eliminates TCP head-of-line blocking, reduces connection establishment latency, and provides built-in encryption for improved performance on unreliable networks.

Hybrid Cloud

An architecture that combines on-premises data center infrastructure with public cloud services, connected through networking and orchestration. Hybrid cloud allows organizations to keep sensitive data on-premises while leveraging cloud scalability for other workloads.

Infrastructure as a Service

A cloud computing model that provides virtualized computing resources over the internet. IaaS offerings like AWS EC2, Google Compute Engine, and Azure Virtual Machines give teams full control over servers, storage, and networking without owning physical hardware.

Infrastructure Monitoring

The practice of tracking the health, performance, and availability of computing resources including servers, networks, databases, and cloud services. Infrastructure monitoring provides the foundation for understanding whether the systems supporting applications are functioning correctly.

Load Balancing

The process of distributing incoming network traffic across multiple servers to ensure no single server becomes overwhelmed. Load balancers improve application availability, reliability, and responsiveness by spreading requests evenly across healthy backend instances.

Log Aggregation

The practice of collecting, centralizing, and indexing log data from multiple sources into a unified system for search, analysis, and visualization. Log aggregation tools like the ELK stack, Datadog, and Grafana Loki enable teams to troubleshoot issues across distributed systems.

Monitoring and Alerting

The practice of continuously observing system health through metrics, logs, and traces, and automatically notifying the team when predefined thresholds are breached. Effective monitoring provides real-time visibility into system behavior and enables rapid incident response.

Multi-Cloud

An architecture strategy that uses services from multiple cloud providers to avoid vendor lock-in, leverage best-of-breed capabilities, and improve resilience. Multi-cloud deployments distribute workloads across providers like AWS, Google Cloud, and Azure.

Multi-Region Deployment

An architecture pattern that deploys application instances across multiple geographic regions to reduce latency for global users, improve availability through geographic redundancy, and comply with data residency requirements.

Network Security

The practices and technologies that protect network infrastructure, data in transit, and connected systems from unauthorized access, misuse, and attacks. Network security encompasses firewalls, intrusion detection, access controls, encryption, and segmentation.

Object Storage

A storage architecture that manages data as discrete objects in a flat namespace rather than as files in a hierarchical directory. Object storage services like Amazon S3 provide virtually unlimited scalability, high durability, and cost-effective storage for large data volumes.

Platform as a Service

A cloud computing model that provides a complete development and deployment environment without managing underlying infrastructure. PaaS offerings like Heroku, Vercel, and Google App Engine handle servers, storage, networking, and runtime configuration.

QUIC

A transport protocol originally developed by Google that provides multiplexed connections over UDP with built-in TLS encryption. QUIC eliminates head-of-line blocking, supports connection migration across network changes, and reduces connection establishment latency.

Redis

An open-source, in-memory data structure store used as a cache, message broker, and database. Redis supports strings, hashes, lists, sets, sorted sets, and streams, providing sub-millisecond latency for read and write operations.

Reserved Instances

Cloud compute capacity purchased at discounted rates in exchange for a commitment to use specific instance types for one to three years. Reserved instances provide 30-75% savings over on-demand pricing for predictable, steady-state workloads.

Saga Pattern

A pattern for managing distributed transactions across multiple microservices by breaking them into a sequence of local transactions, each with a compensating action for rollback. Sagas maintain data consistency without requiring distributed locks.

Serverless Computing

A cloud execution model where the provider dynamically manages server allocation and scaling. Developers deploy functions or containers without provisioning infrastructure, paying only for actual compute time consumed rather than reserved capacity.

Service Mesh

A dedicated infrastructure layer that handles service-to-service communication within a microservices architecture. It provides observability, traffic management, and security features like mutual TLS without requiring changes to application code.

Site Reliability Engineering

A discipline that applies software engineering principles to infrastructure and operations work. SRE, pioneered by Google, treats operations as a software problem, using automation, monitoring, and error budgets to maintain reliable systems at scale.

Spot Instances

Cloud compute instances available at steep discounts, typically 60-90% off on-demand pricing, in exchange for the possibility that the cloud provider can reclaim them with short notice when capacity is needed. Spot instances are ideal for fault-tolerant and flexible workloads.

TLS

Transport Layer Security is a cryptographic protocol that provides secure communication over networks by encrypting data in transit, authenticating server identity, and ensuring data integrity. TLS is the standard security layer for HTTPS, email, and API communication.

Toil Reduction

The systematic elimination of manual, repetitive, automatable operational work that scales linearly with service growth. Toil reduction is a core SRE practice that frees engineering time for high-value improvements by replacing recurring human labor with software.

Vertical Scaling

The practice of increasing capacity by adding more resources like CPU, memory, or GPU to an existing machine rather than adding more machines. Vertical scaling is simpler to implement but has physical limits and creates single points of failure.

WAF

A Web Application Firewall monitors, filters, and blocks HTTP traffic between the internet and a web application based on security rules. WAFs protect against common web attacks including SQL injection, cross-site scripting, and API abuse by inspecting request content.

Webhook

A mechanism for one application to send real-time notifications to another via HTTP POST requests when specific events occur. Unlike polling, webhooks push data as soon as events happen, enabling event-driven integrations between systems.

Zero-Downtime Deployment

A deployment strategy that updates production systems without any period of unavailability. Zero-downtime deployments use techniques like rolling updates, blue-green switching, or canary releases to transition traffic between versions seamlessly.

Browse other categories