Back to glossary

Load Balancing

The process of distributing incoming network traffic across multiple servers to ensure no single server becomes overwhelmed. Load balancers improve application availability, reliability, and responsiveness by spreading requests evenly across healthy backend instances.

Load balancers operate at different layers of the network stack. Layer 4 balancers distribute traffic based on IP address and port, while Layer 7 balancers can make routing decisions based on HTTP headers, URLs, and content type. Common algorithms include round-robin, least connections, weighted distribution, and IP hash. Health checks continuously verify backend availability, automatically removing unhealthy instances from the rotation.

For AI product teams, load balancing is particularly important because model inference workloads can be computationally intensive and variable in duration. A complex query might take 10 times longer than a simple one, making naive round-robin distribution inefficient. Least-connections or weighted algorithms help distribute work more evenly across GPU-equipped inference servers. Growth teams should monitor load balancer metrics to identify when AI endpoints become bottlenecks during traffic spikes, since slow inference directly impacts user experience. Advanced load balancing strategies like request queuing with backpressure prevent cascading failures when AI services are temporarily overwhelmed.

Related Terms

Content Delivery Network

A geographically distributed network of proxy servers that caches and delivers content from locations closest to end users. CDNs reduce latency, improve load times, and absorb traffic spikes by serving content from edge nodes rather than a single origin server.

Edge Computing

A distributed computing paradigm that processes data closer to the source of generation rather than in a centralized data center. Edge computing reduces latency, conserves bandwidth, and enables real-time processing for latency-sensitive applications.

Serverless Computing

A cloud execution model where the provider dynamically manages server allocation and scaling. Developers deploy functions or containers without provisioning infrastructure, paying only for actual compute time consumed rather than reserved capacity.

Function as a Service

A serverless computing category where developers deploy individual functions that execute in response to events. FaaS platforms like AWS Lambda, Google Cloud Functions, and Azure Functions handle all infrastructure management, scaling each function independently.

Platform as a Service

A cloud computing model that provides a complete development and deployment environment without managing underlying infrastructure. PaaS offerings like Heroku, Vercel, and Google App Engine handle servers, storage, networking, and runtime configuration.

Infrastructure as a Service

A cloud computing model that provides virtualized computing resources over the internet. IaaS offerings like AWS EC2, Google Compute Engine, and Azure Virtual Machines give teams full control over servers, storage, and networking without owning physical hardware.