Factorial Design

An experimental design that simultaneously tests all possible combinations of two or more factors, each with multiple levels, enabling the estimation of both individual factor effects and interaction effects between factors in a single experiment.

Factorial design extends simple A/B testing to study multiple factors simultaneously. In a 2x2 factorial design, two binary factors create four treatment cells: neither change, change A only, change B only, and both changes. This design efficiently estimates the main effect of each factor (averaged across levels of the other factors) and the interaction effect (whether the combination is more or less effective than the sum of individual effects). For growth teams, factorial designs are valuable when multiple changes are being considered simultaneously, such as testing both a new headline and a new layout. Rather than running two sequential A/B tests, a factorial design tests everything at once, saves time, and reveals interactions that sequential tests would miss entirely.

The analysis of a factorial experiment uses a linear model: Y = mu + alpha_i + beta_j + (alpha*beta)_ij + epsilon, where alpha and beta are the main effects and (alpha*beta) is the interaction. In practice, this is implemented as a regression with indicator variables for each factor and their product. The main effect of factor A is estimated by comparing all cells with A=1 to all cells with A=0, regardless of factor B's level, giving each main effect estimate the full sample size rather than the per-cell size. This is the efficiency advantage of factorial designs: with N total users, each main effect is estimated with precision comparable to an N/2-per-group A/B test, not an N/4-per-group test. The interaction term tests whether the effect of A differs depending on the level of B. A significant interaction means the factors are not independent and their combined effect differs from the sum of individual effects. The sample size requirement for detecting interactions is roughly four times that for detecting main effects of the same magnitude, so factorial experiments should be powered based on whether interaction detection is a priority.

Factorial designs should be used when teams want to evaluate multiple independent changes without extending the experimentation timeline, when there is reason to believe factors may interact, or when an organization wants to maximize learning per unit of traffic. Common pitfalls include running factorial designs with too many factors, creating an unmanageable number of cells (a 2^5 design has 32 cells), not having enough traffic to power interaction tests, and interpreting main effects without checking for interactions (which can be misleading if strong interactions exist). Fractional factorial designs address the cell count problem by strategically omitting certain combinations while still estimating main effects and low-order interactions, under the assumption that higher-order interactions are negligible. This is formalized in the sparsity of effects principle.

Advanced factorial design concepts include resolution, which describes which effects are confounded (aliased) in a fractional design. A Resolution III design can estimate main effects but not separate them from two-factor interactions, while Resolution V can estimate all main effects and two-factor interactions clearly. For digital experimentation, the most practical designs are 2^k full factorials with k = 2 or 3 factors, and 2^(k-p) fractional factorials for k > 3. Some experimentation platforms like Statsig support factorial experiments through their layered experiment infrastructure, where each factor is a separate experiment layer and the platform handles the combinatorial assignment. Response surface methodology extends factorial designs to find optimal continuous parameter values, useful for tuning algorithm parameters like recommendation weights or notification frequency.

Related Terms

Multivariate Testing

An experimentation method that simultaneously tests multiple variables and their combinations to determine which combination of changes produces the best outcome, unlike A/B testing which typically varies a single element at a time.

Split Testing

The practice of randomly dividing users into two or more groups and exposing each group to a different version of a product experience to measure which version performs better on a target metric, commonly known as A/B testing.

Latin Square Design

An experimental design that controls for two known sources of variation by arranging treatments in a grid where each treatment appears exactly once in each row and column, efficiently balancing nuisance factors without requiring a full factorial experiment.

Holdout Testing

An experimental design where a small percentage of users are permanently excluded from receiving a new feature or set of features, serving as a long-term control group to measure the cumulative impact of product changes over time.

Power Analysis

A statistical calculation performed before an experiment to determine the minimum sample size required to detect a meaningful effect with a specified probability, balancing the risk of false negatives against practical constraints like traffic and experiment duration.

Effect Size

A quantitative measure of the magnitude of a treatment's impact, expressed either as an absolute difference between groups or as a standardized metric like Cohen's d that allows comparison across different experiments and metrics.