Cluster Randomization
An experimental design that randomly assigns groups (clusters) of users rather than individual users to treatment conditions, used when individual randomization is not feasible or when interference between users within the same cluster would violate independence assumptions.
Cluster randomization assigns entire groups of related users to the same treatment condition. Clusters can be geographic regions, companies, schools, social network communities, or any grouping where users interact with each other. For example, randomizing entire companies to test a collaboration feature ensures that all users within a company have the same experience, preventing the confusion of some team members having a feature that others lack. For growth teams, cluster randomization is essential for features that involve user-to-user interactions (messaging, sharing, collaboration), marketplace dynamics (driver-rider matching), or network effects, where individual randomization would create interference between treatment and control users within the same cluster.
The key statistical consideration in cluster randomization is that the effective sample size is much smaller than the total number of users because users within a cluster are correlated. The design effect, which quantifies this reduction, is DE = 1 + (m - 1) * ICC, where m is the average cluster size and ICC is the intracluster correlation coefficient (the proportion of total variance attributable to between-cluster differences). With an ICC of 0.05 and clusters of 100 users, the design effect is 5.95, meaning you need roughly 6 times as many users as an individually randomized experiment for the same power. The analysis must account for clustering using methods like generalized estimating equations (GEE), mixed-effects models, or cluster-robust standard errors. The simplest approach aggregates outcomes to the cluster level and performs the analysis on cluster-level means, treating clusters as the independent unit. Power analysis for cluster randomized experiments requires specifying both the number of clusters and the cluster size, using formulas that incorporate the ICC: n_clusters = 2*(Z_alpha/2 + Z_beta)^2 * (sigma_b^2 + sigma_w^2/m) / delta^2.
Cluster randomization should be used when interference between users makes individual randomization invalid, when the treatment must be applied at the group level for operational reasons, or when contamination between groups would be unacceptable. Common pitfalls include underestimating the sample size requirement by ignoring the design effect, having too few clusters for reliable inference (at least 20-30 clusters per arm is recommended for cluster-robust standard errors), and not measuring or accounting for cluster-level confounders. The loss of power from clustering can be partially offset by stratifying the randomization on important cluster-level characteristics, ensuring balance between treatment and control clusters. Teams should also consider whether partial clustering (some outcomes are clustered, others are individual-level) applies to their setting.
Advanced cluster randomization designs include stepped-wedge designs, where all clusters eventually receive the treatment but the timing is randomized, providing additional temporal comparisons. Adaptive cluster randomization uses covariate-adaptive methods to improve balance when the number of clusters is small. For social network experiments, graph cluster randomization partitions the social graph into dense communities and randomizes at the community level, reducing interference while preserving more of the individual-level variation than geographic clustering. Platforms like Planout (from Meta) and internal tools at LinkedIn and Uber support cluster-randomized experiments with built-in variance estimation. The increasing importance of network effects in digital products means cluster randomization will only become more prevalent as traditional user-level A/B testing proves inadequate for evaluating social, collaborative, and marketplace features.
Related Terms
Switchback Testing
An experimental design that alternates between treatment and control conditions over time periods within the same unit (such as a geographic region or marketplace), used when user-level randomization is not feasible due to interference or operational constraints.
Randomization Unit
The entity (user, session, page view, device, cluster, or geographic region) at which random assignment to experiment variants occurs, determining the independence structure of the data and affecting both the validity and statistical power of the experiment.
Network Effect Experiment
An experiment designed to measure and optimize features that become more valuable as more users adopt them, addressing the unique challenges of testing network-dependent features where individual user value depends on the behavior and adoption of other users.
Multivariate Testing
An experimentation method that simultaneously tests multiple variables and their combinations to determine which combination of changes produces the best outcome, unlike A/B testing which typically varies a single element at a time.
Split Testing
The practice of randomly dividing users into two or more groups and exposing each group to a different version of a product experience to measure which version performs better on a target metric, commonly known as A/B testing.
Holdout Testing
An experimental design where a small percentage of users are permanently excluded from receiving a new feature or set of features, serving as a long-term control group to measure the cumulative impact of product changes over time.