Holdout Testing

An experimental design where a small percentage of users are permanently excluded from receiving a new feature or set of features, serving as a long-term control group to measure the cumulative impact of product changes over time.

Holdout testing addresses a fundamental challenge in experimentation programs: while individual A/B tests measure the incremental impact of single changes, teams rarely measure whether the sum of all changes over months or years actually delivers the expected cumulative benefit. A holdout group, typically 1-5% of users, remains on an older version of the product for an extended period, providing a stable baseline against which the aggregate impact of all shipped changes can be measured. For growth teams, holdout tests are essential for validating that the experimentation program as a whole is delivering value and that the cumulative effect of many small wins actually materializes in long-term metrics like retention and revenue.

Implementing a holdout test requires careful infrastructure planning. The holdout group must be defined at the randomization unit level (usually user ID) and persist across all experiments. When a new feature ships after winning an A/B test, the holdout group does not receive it. This means the experimentation platform must support layered assignment where holdout membership takes precedence over individual experiment assignments. Platforms like Statsig and Eppo provide built-in holdout group management. The analysis compares key business metrics between the holdout group and the rest of the user population over time, using the same statistical methods as standard A/B tests but with the holdout as the control. Because the holdout group is typically small, the analysis has lower statistical power for detecting small effects, but since the expected cumulative effect should be large, this is usually acceptable. Teams should track a comprehensive set of metrics including engagement, retention, revenue, and performance indicators.

Holdout tests should be established when an experimentation program matures to the point where dozens of experiments are shipping per quarter. The primary use case is auditing the experimentation program itself: if the holdout group performs similarly to the rest of the population despite dozens of winning experiments being shipped, it signals that either the individual experiment analyses are flawed (perhaps due to peeking or metric gaming) or that the effects are not durable. Common pitfalls include making the holdout group too large, which limits the number of users benefiting from improvements, or too small, which reduces analytical power. Another challenge is managing the user experience for the holdout group, which over time receives an increasingly degraded product. Teams must decide on a refresh cadence, typically every 6-12 months, where the current holdout is dissolved and a new random sample is selected.

Advanced holdout strategies include maintaining multiple holdout groups at different refresh points to create a continuous measurement baseline, using propensity score weighting to adjust for any compositional drift that occurs if holdout users churn at different rates, and implementing partial holdouts that exclude users from specific categories of changes while receiving others. Some organizations use holdouts to validate machine learning model improvements specifically, keeping a group on the previous model version. The concept extends to advertising, where brand lift holdout studies withhold ads from a random group to measure the true incremental impact of advertising beyond organic behavior. Netflix famously uses holdout groups to validate that their personalization algorithms collectively deliver substantial engagement gains, publishing research showing the holdout approach catches cases where individual test results overstate long-term impact.

Related Terms

Split Testing

The practice of randomly dividing users into two or more groups and exposing each group to a different version of a product experience to measure which version performs better on a target metric, commonly known as A/B testing.

Long-Running Experiment

An experiment maintained for weeks, months, or even years beyond the standard analysis period to measure the long-term and cumulative effects of a treatment, capturing delayed impacts on retention, revenue, and user behavior that short-term experiments miss.

Guardrail Metric Testing

The practice of monitoring a set of critical business metrics during every experiment to detect unintended negative side effects, even when the primary experiment metric shows a positive result, ensuring that optimizing one metric does not degrade overall user experience or business health.

Multivariate Testing

An experimentation method that simultaneously tests multiple variables and their combinations to determine which combination of changes produces the best outcome, unlike A/B testing which typically varies a single element at a time.

Power Analysis

A statistical calculation performed before an experiment to determine the minimum sample size required to detect a meaningful effect with a specified probability, balancing the risk of false negatives against practical constraints like traffic and experiment duration.

Effect Size

A quantitative measure of the magnitude of a treatment's impact, expressed either as an absolute difference between groups or as a standardized metric like Cohen's d that allows comparison across different experiments and metrics.