Back to glossary

Crossover Design

An experimental design where the same subjects receive both the treatment and control conditions in different time periods, with each subject serving as their own control, reducing variance from between-subject differences.

In a crossover design, users are exposed to both experimental conditions in sequence. A simple two-period crossover randomly assigns users to two groups: Group 1 receives treatment A then treatment B, while Group 2 receives treatment B then treatment A. By comparing each user's outcomes under both conditions, between-subject variability is eliminated from the treatment effect estimate, dramatically increasing statistical power. For growth teams, crossover designs are particularly valuable when user-to-user variability is high relative to the expected treatment effect, which is common in digital experimentation where user behavior varies widely. The within-subject comparison can reduce variance by 50-80% compared to a between-subject design, enabling smaller sample sizes or shorter experiment durations.

The analysis of a crossover design accounts for period effects (outcomes may differ between the first and second period regardless of treatment) and sequence effects (the order of treatments may matter). The standard model is: Y_ijk = mu + pi_j + tau_k + lambda_{j-1} + s_i + epsilon_ijk, where pi_j is the period effect, tau_k is the treatment effect, lambda is the carryover effect from the previous treatment, s_i is the subject random effect, and epsilon is the residual error. The treatment effect is typically estimated using a paired analysis: for each subject, compute the difference in outcomes between the two treatment conditions, then test whether this difference's mean is zero. The key advantage is that the subject random effect s_i cancels in the within-subject comparison, removing what is often the largest source of variance. Carryover effects are tested by examining whether the sum of outcomes across periods differs between sequence groups. If carryover is present, only first-period data should be used, sacrificing the within-subject advantage.

Crossover designs should be used when the treatment effect is expected to be temporary and reversible, when there is a sufficient washout period between conditions to prevent carryover, and when user-level variability is a dominant source of noise. In digital experimentation, crossover designs work well for testing UI changes, recommendation algorithm variants, or search ranking modifications where the effect on user behavior does not persist after the treatment is removed. Common pitfalls include carryover effects where the first treatment influences behavior during the second treatment period (a user who learns a new navigation pattern may retain that knowledge even when switched back), dropout during the second period (creating missing data that complicates analysis), and period-by-treatment interactions where the treatment effect genuinely differs between periods.

Advanced crossover designs include higher-order crossovers with more than two periods (e.g., ABB/BAA designs that allow estimation and testing of carryover), Latin square crossover designs that balance multiple treatments across periods and subjects, and modified crossover designs that incorporate washout periods between treatment conditions to mitigate carryover. For digital experimentation, N-of-1 trials (single-subject crossovers with multiple alternation periods) can provide personalized treatment effect estimates, supporting user-level personalization. Bayesian crossover analysis naturally handles the hierarchical structure of subjects within sequences and periods within subjects, providing individual-level treatment effect estimates with proper uncertainty quantification. The switchback testing design is closely related to crossover but operates at the market or system level rather than the individual user level.

Related Terms

Switchback Testing

An experimental design that alternates between treatment and control conditions over time periods within the same unit (such as a geographic region or marketplace), used when user-level randomization is not feasible due to interference or operational constraints.

Latin Square Design

An experimental design that controls for two known sources of variation by arranging treatments in a grid where each treatment appears exactly once in each row and column, efficiently balancing nuisance factors without requiring a full factorial experiment.

Factorial Design

An experimental design that simultaneously tests all possible combinations of two or more factors, each with multiple levels, enabling the estimation of both individual factor effects and interaction effects between factors in a single experiment.

Multivariate Testing

An experimentation method that simultaneously tests multiple variables and their combinations to determine which combination of changes produces the best outcome, unlike A/B testing which typically varies a single element at a time.

Split Testing

The practice of randomly dividing users into two or more groups and exposing each group to a different version of a product experience to measure which version performs better on a target metric, commonly known as A/B testing.

Holdout Testing

An experimental design where a small percentage of users are permanently excluded from receiving a new feature or set of features, serving as a long-term control group to measure the cumulative impact of product changes over time.