Back to glossary

CUPED Variance Reduction

A statistical technique (Controlled-experiment Using Pre-Experiment Data) that reduces metric variance in online experiments by adjusting for pre-experiment user behavior, increasing statistical power by 20-50% without requiring larger sample sizes.

CUPED, introduced by Microsoft Research, is a variance reduction technique that leverages pre-experiment data to increase the precision of experiment estimates. The core idea is that much of the variance in user behavior during an experiment is predictable from the user's behavior before the experiment. By adjusting the outcome metric using a pre-experiment covariate, the residual variance is reduced, and the treatment effect can be detected with fewer users or in less time. For growth teams, CUPED is one of the highest-impact methodological improvements available because it effectively multiplies the statistical power of every experiment without requiring any additional traffic, enabling faster experimentation cycles and the ability to detect smaller effects.

The CUPED adjustment works as follows: for each user, compute a pre-experiment covariate X (typically the same metric measured in a pre-experiment window, e.g., last week's page views). The adjusted outcome is Y_cuped = Y - theta * (X - X_bar), where theta = Cov(Y, X) / Var(X) is the coefficient that minimizes the variance of the adjusted outcome. The treatment effect is then estimated using Y_cuped instead of Y. Because X was measured before randomization, it is independent of treatment assignment, so the adjustment does not introduce bias. The variance reduction factor is 1 - rho^2, where rho is the correlation between Y and X. If the pre-experiment metric correlates at rho = 0.7 with the experiment metric, CUPED reduces variance by 51%, effectively doubling the sample size. The technique extends naturally to multiple covariates using multivariate regression adjustment: Y_cuped = Y - X * theta, where X is a vector of pre-experiment covariates and theta is the OLS coefficient vector. Modern implementations at companies like Airbnb, Netflix, and Uber use machine learning models to predict user outcomes from rich pre-experiment features, achieving even larger variance reductions.

CUPED should be applied to every experiment where pre-experiment data is available, which is nearly always the case for logged-in users in digital products. The primary requirement is a pre-experiment covariate that is correlated with the outcome metric and measured before randomization. Using the same metric from a prior period (e.g., last week's purchases to adjust this week's purchases) is the simplest and often most effective approach. Common pitfalls include using covariates measured during or after randomization (which introduces bias), not accounting for users without pre-experiment data (new users need separate handling), and applying CUPED to ratio metrics without proper delta method adjustments. The technique is sometimes confused with ANCOVA (analysis of covariance), which is equivalent in the two-group case but CUPED specifically emphasizes the use of pre-experiment rather than baseline covariates.

Advanced variance reduction techniques extend CUPED in several directions. CUPAC (Controlled-experiment Using Predictions as Covariates), used at Airbnb, trains a machine learning model on pre-experiment data to predict user outcomes, then uses these predictions as the covariate, often achieving larger variance reductions than simple CUPED because the model captures nonlinear relationships. Stratified CUPED applies the adjustment within strata to handle heterogeneous correlation structures. For sequential experiments, the CUPED adjustment must be applied carefully to maintain the validity of sequential stopping boundaries. Some platforms like Statsig and Eppo implement CUPED automatically, applying the pre-experiment adjustment to all metrics by default. The technique has become standard practice at top experimentation organizations and represents one of the few free lunches in statistics: more precise estimates with no additional data collection cost.

Related Terms

Power Analysis

A statistical calculation performed before an experiment to determine the minimum sample size required to detect a meaningful effect with a specified probability, balancing the risk of false negatives against practical constraints like traffic and experiment duration.

Split Testing

The practice of randomly dividing users into two or more groups and exposing each group to a different version of a product experience to measure which version performs better on a target metric, commonly known as A/B testing.

Stratified Randomization

A randomization technique that first divides the user population into homogeneous subgroups (strata) based on important characteristics, then randomizes independently within each stratum to ensure treatment groups are balanced on known confounders and to improve statistical precision.

Multivariate Testing

An experimentation method that simultaneously tests multiple variables and their combinations to determine which combination of changes produces the best outcome, unlike A/B testing which typically varies a single element at a time.

Holdout Testing

An experimental design where a small percentage of users are permanently excluded from receiving a new feature or set of features, serving as a long-term control group to measure the cumulative impact of product changes over time.

Effect Size

A quantitative measure of the magnitude of a treatment's impact, expressed either as an absolute difference between groups or as a standardized metric like Cohen's d that allows comparison across different experiments and metrics.