Chaos Engineering

Chaos engineering proactively tests resilience by introducing controlled failures: killing random servers, injecting network latency, simulating database outages, or exhausting disk space. The goal is to discover vulnerabilities before they cause real incidents, and to verify that fallback mechanisms actually work under pressure.

Netflix pioneered the practice with Chaos Monkey (randomly terminates production instances) and expanded it into the Simian Army. Modern tools like Gremlin, Litmus, and AWS Fault Injection Simulator make chaos experiments accessible to any team. Experiments should start small (one instance, one availability zone) and expand as confidence grows.

For AI systems, chaos engineering reveals critical failure modes: what happens when the LLM API is unavailable for 30 seconds, when vector database latency doubles, when the feature store returns stale data, or when a model returns malformed output. Running these experiments in controlled conditions ensures your fallback paths work correctly rather than discovering they are broken during a real incident.

Related Terms

A/B Testing

Feature Flag

MLOps

Model Serving

Semantic Search

CI/CD (Continuous Integration / Continuous Deployment)