Back to glossary

Auto-Scaling

The automatic adjustment of compute resources based on real-time demand metrics. Auto-scaling adds instances when traffic increases and removes them when demand drops, maintaining performance during peaks while minimizing costs during quiet periods.

Auto-scaling policies define when and how to scale based on metrics like CPU utilization, request queue depth, memory usage, or custom application metrics. Scaling can be reactive, responding to current conditions, or predictive, anticipating demand based on historical patterns. Effective auto-scaling requires correctly identifying the bottleneck metric and setting appropriate thresholds and cooldown periods to prevent oscillation.

For AI product teams, auto-scaling is essential but challenging because AI inference workloads have unique scaling characteristics. GPU instances take longer to provision than CPU instances, model loading adds startup time, and inference latency may not correlate linearly with CPU or memory utilization. Teams often use custom metrics like inference queue depth or p99 latency as scaling triggers. Growth teams should understand auto-scaling behavior because growth campaigns, product launches, and viral moments can generate sudden traffic spikes that test scaling limits. Pre-warming inference capacity before predictable traffic events like marketing campaigns prevents degraded AI feature performance during the moments when user first impressions matter most.

Related Terms

Content Delivery Network

A geographically distributed network of proxy servers that caches and delivers content from locations closest to end users. CDNs reduce latency, improve load times, and absorb traffic spikes by serving content from edge nodes rather than a single origin server.

Edge Computing

A distributed computing paradigm that processes data closer to the source of generation rather than in a centralized data center. Edge computing reduces latency, conserves bandwidth, and enables real-time processing for latency-sensitive applications.

Serverless Computing

A cloud execution model where the provider dynamically manages server allocation and scaling. Developers deploy functions or containers without provisioning infrastructure, paying only for actual compute time consumed rather than reserved capacity.

Function as a Service

A serverless computing category where developers deploy individual functions that execute in response to events. FaaS platforms like AWS Lambda, Google Cloud Functions, and Azure Functions handle all infrastructure management, scaling each function independently.

Platform as a Service

A cloud computing model that provides a complete development and deployment environment without managing underlying infrastructure. PaaS offerings like Heroku, Vercel, and Google App Engine handle servers, storage, networking, and runtime configuration.

Infrastructure as a Service

A cloud computing model that provides virtualized computing resources over the internet. IaaS offerings like AWS EC2, Google Compute Engine, and Azure Virtual Machines give teams full control over servers, storage, and networking without owning physical hardware.