Back to glossary

Switchback Testing

An experimental design that alternates between treatment and control conditions over time periods within the same unit (such as a geographic region or marketplace), used when user-level randomization is not feasible due to interference or operational constraints.

Switchback testing is a quasi-experimental method designed for situations where traditional user-level A/B testing fails because of interference between users. In a two-sided marketplace like Uber, DoorDash, or Airbnb, changing the matching algorithm for some riders but not others is problematic because both groups compete for the same drivers. If treatment riders get better matches, control riders get worse matches by displacement, violating the stable unit treatment value assumption (SUTVA) that underlies standard A/B testing. Switchback testing solves this by alternating the entire marketplace between treatment and control in time blocks, such as running the treatment algorithm for one hour, then switching to control for the next hour. The treatment effect is estimated by comparing outcome metrics during treatment periods versus control periods, with appropriate adjustments for time trends.

The methodology involves dividing time into blocks (e.g., hours, days) and randomly assigning each block to treatment or control. The randomization can be stratified by time-of-day and day-of-week to balance known temporal patterns. For geographic markets, different regions can be assigned independently, increasing the effective sample size. The analysis uses a difference-in-means estimator comparing treatment and control periods, with adjustments for temporal autocorrelation and trends. Because adjacent time periods are correlated (demand at 2pm is similar to demand at 3pm), standard error calculations must account for this serial correlation using methods like Newey-West standard errors or cluster-robust standard errors with clustering by time block. The block length involves a tradeoff: shorter blocks increase the number of switches and statistical precision but may not allow the full treatment effect to manifest if there are carryover effects, while longer blocks reduce the number of independent observations.

Switchback testing should be used when interference between units makes user-level randomization invalid, which is common in marketplaces, ridesharing, delivery, pricing, and any context with shared resources. Common pitfalls include carryover effects where the treatment condition in one period affects outcomes in subsequent control periods (e.g., a pricing change that causes users to stockpile), time-of-day confounding if the treatment and control periods are not balanced across temporal patterns, and insufficient number of switching periods for adequate statistical power. The minimum number of switch periods depends on the desired power and the within-period and between-period variance, but typically at least 50-100 periods are needed for reasonable precision. Alternatives include geo-randomization (assigning entire cities or regions) and synthetic control methods for when switchback is also infeasible.

Advanced switchback designs include multi-arm switchbacks that test several treatments, crossover designs that expose the same unit to all conditions in a balanced sequence, and adaptive switchback designs that adjust the assignment probability based on accumulating evidence. Interference-aware analysis methods can partially address carryover effects by including lagged treatment indicators in the regression model. Recent research by Bojinov and Shephard at Harvard has developed formal frameworks for switchback experiments that account for both temporal interference and serial correlation, providing valid confidence intervals and hypothesis tests. At companies like Uber, Lyft, and DoorDash, switchback experiments have become a core part of the experimentation toolkit, with dedicated infrastructure for scheduling treatment periods, monitoring real-time metrics during switches, and analyzing results with appropriate temporal adjustments.

Related Terms

Cluster Randomization

An experimental design that randomly assigns groups (clusters) of users rather than individual users to treatment conditions, used when individual randomization is not feasible or when interference between users within the same cluster would violate independence assumptions.

Crossover Design

An experimental design where the same subjects receive both the treatment and control conditions in different time periods, with each subject serving as their own control, reducing variance from between-subject differences.

Marketplace Experiment

An experiment conducted in a two-sided or multi-sided marketplace where treatment effects can propagate between buyer and seller sides, requiring specialized experimental designs that account for cross-side interference and equilibrium effects.

Multivariate Testing

An experimentation method that simultaneously tests multiple variables and their combinations to determine which combination of changes produces the best outcome, unlike A/B testing which typically varies a single element at a time.

Split Testing

The practice of randomly dividing users into two or more groups and exposing each group to a different version of a product experience to measure which version performs better on a target metric, commonly known as A/B testing.

Holdout Testing

An experimental design where a small percentage of users are permanently excluded from receiving a new feature or set of features, serving as a long-term control group to measure the cumulative impact of product changes over time.