Stratified Randomization

A randomization technique that first divides the user population into homogeneous subgroups (strata) based on important characteristics, then randomizes independently within each stratum to ensure treatment groups are balanced on known confounders and to improve statistical precision.

Stratified randomization ensures that experiment groups are balanced on important user characteristics, reducing the risk of accidental imbalance that can bias results and increasing statistical precision. While simple random assignment produces balanced groups on average, any single experiment may have meaningful imbalance by chance, especially with small sample sizes. By stratifying on characteristics like platform (iOS vs. Android), geography (US vs. international), user tenure (new vs. existing), or subscription status (free vs. paid), the randomization guarantees exact or near-exact balance within each stratum. For growth teams, stratified randomization is particularly valuable when the stratification variables are strong predictors of the outcome metric, because it reduces within-group variance and increases statistical power.

The implementation of stratified randomization involves defining strata based on user characteristics, then performing independent random assignment within each stratum. For example, with two strata (mobile and desktop) and a 50/50 split, exactly 50% of mobile users are assigned to treatment and 50% to control, and independently, 50% of desktop users are assigned to each. This guarantees that the treatment and control groups have identical platform composition. The analysis should account for the stratification through stratified estimation: the overall treatment effect is a weighted average of within-stratum effects, weighted by stratum size. The stratified estimator has lower variance than the unstratified estimator when the stratum means differ, with variance reduction proportional to the between-stratum variation as a fraction of total variation. Experimentation platforms like Statsig and Eppo support stratified randomization natively, and most hash-based randomization systems can incorporate stratification through layered hashing.

Stratified randomization should be used when there are known strong predictors of the outcome that could create accidental imbalance, when the experiment has a relatively small sample size where random chance could produce meaningful imbalance, and when certain strata are particularly important for the analysis (e.g., the team plans to analyze mobile and desktop separately). Common pitfalls include stratifying on too many variables (which creates many small strata and can make the randomization complex), stratifying on post-randomization variables (which is impossible and conceptually invalid), not accounting for the stratification in the analysis (which produces valid but less efficient estimates), and confusing stratification with blocking in factorial designs. The number of strata should be kept manageable, typically using 2-8 strata defined by 1-3 categorical variables.

Advanced stratification methods include covariate-adaptive randomization methods like Pocock-Simon minimization, which sequentially assigns users to maintain balance across multiple covariates simultaneously, even when the number of covariates makes full stratification impractical. Re-randomization methods check whether a proposed random assignment achieves acceptable covariate balance and re-draw if it does not, providing guaranteed balance at the cost of modifying the randomization distribution. For large-scale online experiments, the practical benefit of stratification diminishes because random assignment produces near-perfect balance with large samples, but stratified analysis (analyzing within strata and combining) still provides variance reduction benefits even without stratified randomization. The combination of stratified randomization with CUPED variance reduction provides compounding power improvements.

Related Terms

Randomization Unit

The entity (user, session, page view, device, cluster, or geographic region) at which random assignment to experiment variants occurs, determining the independence structure of the data and affecting both the validity and statistical power of the experiment.

CUPED Variance Reduction

A statistical technique (Controlled-experiment Using Pre-Experiment Data) that reduces metric variance in online experiments by adjusting for pre-experiment user behavior, increasing statistical power by 20-50% without requiring larger sample sizes.

Split Testing

The practice of randomly dividing users into two or more groups and exposing each group to a different version of a product experience to measure which version performs better on a target metric, commonly known as A/B testing.

Multivariate Testing

An experimentation method that simultaneously tests multiple variables and their combinations to determine which combination of changes produces the best outcome, unlike A/B testing which typically varies a single element at a time.

Holdout Testing

An experimental design where a small percentage of users are permanently excluded from receiving a new feature or set of features, serving as a long-term control group to measure the cumulative impact of product changes over time.

Power Analysis

A statistical calculation performed before an experiment to determine the minimum sample size required to detect a meaningful effect with a specified probability, balancing the risk of false negatives against practical constraints like traffic and experiment duration.