Back to glossary

Infrastructure Monitoring

The practice of tracking the health, performance, and availability of computing resources including servers, networks, databases, and cloud services. Infrastructure monitoring provides the foundation for understanding whether the systems supporting applications are functioning correctly.

Infrastructure monitoring tracks low-level system metrics: CPU utilization, memory usage, disk I/O, network throughput, and process health. It also monitors cloud service health, container orchestration state, and network connectivity. Tools like Prometheus, Nagios, Datadog, and CloudWatch collect metrics from infrastructure components and trigger alerts when resources approach limits or exhibit anomalous behavior.

For AI product teams, infrastructure monitoring must include GPU utilization, VRAM usage, model loading times, and inference queue depths alongside standard metrics. Under-utilized GPUs represent wasted spend, while saturated GPUs indicate capacity constraints that will degrade user experience. Growth teams indirectly depend on infrastructure monitoring because degraded infrastructure silently affects experiment results: if the recommendation service is slow due to CPU saturation during a test, the experiment measures the impact of latency rather than the feature itself. Infrastructure monitoring data also feeds capacity planning decisions, helping teams predict when scaling events are needed and budget accordingly for GPU and compute costs.

Related Terms

Content Delivery Network

A geographically distributed network of proxy servers that caches and delivers content from locations closest to end users. CDNs reduce latency, improve load times, and absorb traffic spikes by serving content from edge nodes rather than a single origin server.

Edge Computing

A distributed computing paradigm that processes data closer to the source of generation rather than in a centralized data center. Edge computing reduces latency, conserves bandwidth, and enables real-time processing for latency-sensitive applications.

Serverless Computing

A cloud execution model where the provider dynamically manages server allocation and scaling. Developers deploy functions or containers without provisioning infrastructure, paying only for actual compute time consumed rather than reserved capacity.

Function as a Service

A serverless computing category where developers deploy individual functions that execute in response to events. FaaS platforms like AWS Lambda, Google Cloud Functions, and Azure Functions handle all infrastructure management, scaling each function independently.

Platform as a Service

A cloud computing model that provides a complete development and deployment environment without managing underlying infrastructure. PaaS offerings like Heroku, Vercel, and Google App Engine handle servers, storage, networking, and runtime configuration.

Infrastructure as a Service

A cloud computing model that provides virtualized computing resources over the internet. IaaS offerings like AWS EC2, Google Compute Engine, and Azure Virtual Machines give teams full control over servers, storage, and networking without owning physical hardware.