Effect Size
A quantitative measure of the magnitude of a treatment's impact, expressed either as an absolute difference between groups or as a standardized metric like Cohen's d that allows comparison across different experiments and metrics.
Effect size measures how large a difference a treatment produces, separate from whether that difference is statistically significant. Statistical significance tells you whether an effect exists; effect size tells you whether it matters. A tiny effect can be statistically significant with a large enough sample, while a meaningful effect can fail to reach significance in an underpowered test. For growth and advertising teams, effect size is ultimately what determines business impact. A statistically significant 0.01% improvement in click-through rate may be real but commercially irrelevant, while a 5% improvement in conversion rate could be worth millions in annual revenue. Understanding effect sizes helps teams prioritize which experiments to run, set appropriate sample sizes, and evaluate whether winning experiments justify implementation costs.
Effect size can be expressed in several ways depending on context. The absolute effect size is simply the difference in metric values between treatment and control, such as a 2 percentage point increase in conversion rate from 10% to 12%. The relative effect size expresses this as a percentage change from baseline: a 20% relative lift in the example above. Standardized effect sizes like Cohen's d divide the absolute difference by the pooled standard deviation, producing a unit-free measure that allows comparison across different metrics. Cohen's d = (mean_treatment - mean_control) / pooled_SD. Convention labels d = 0.2 as small, d = 0.5 as medium, and d = 0.8 as large, though these benchmarks were developed for behavioral science and may not apply directly to digital experimentation where even small standardized effects can have large business impact at scale. For proportion metrics, the effect size can also be expressed using the odds ratio or risk ratio. Experimentation platforms typically report both absolute and relative effect sizes along with confidence intervals.
Teams should focus on the practical significance of effect sizes rather than relying solely on p-values. A useful framework is defining the minimum effect size of interest (MESOI) before the experiment: what is the smallest improvement that would justify shipping the change given its implementation and maintenance costs? This feeds directly into power analysis and helps teams avoid celebrating statistically significant but trivially small effects. Common pitfalls include confusing relative and absolute effect sizes in communication (a 50% relative increase from 0.2% to 0.3% sounds impressive but is tiny in absolute terms), ignoring the confidence interval around the point estimate (a point estimate of +5% with a 95% CI of [-1%, +11%] is much less actionable than +5% with CI [+3%, +7%]), and failing to consider effect size heterogeneity across user segments.
Advanced considerations include using effect size distributions from an organization's experiment history to calibrate expectations. Research by Ron Kohavi and others shows that most A/B tests in industry produce small effects (median around 0-2% relative change), which has important implications for power analysis and experiment prioritization. Meta-analysis techniques aggregate effect sizes across related experiments to produce more precise estimates of a treatment category's typical impact. For Bayesian analysis, the effect size prior should reflect realistic expectations based on historical data rather than uninformative priors. Heterogeneous treatment effect analysis, using methods like causal forests, reveals how effect size varies across user segments, often finding that an overall small effect masks large positive effects for some users and negative effects for others.
Related Terms
Power Analysis
A statistical calculation performed before an experiment to determine the minimum sample size required to detect a meaningful effect with a specified probability, balancing the risk of false negatives against practical constraints like traffic and experiment duration.
Minimum Detectable Effect
The smallest improvement in a metric that an experiment is designed to reliably detect with a given level of statistical power and significance, determining the practical sensitivity of the test.
Practical Significance
The assessment of whether a statistically significant experiment result represents a meaningful business impact that justifies the cost of implementation, maintenance, and complexity of shipping the change, distinct from mere statistical significance.
Confidence Interval
A range of values, derived from sample data, that is expected to contain the true population parameter with a specified probability, providing both an estimate of the treatment effect and the precision of that estimate.
Multivariate Testing
An experimentation method that simultaneously tests multiple variables and their combinations to determine which combination of changes produces the best outcome, unlike A/B testing which typically varies a single element at a time.
Split Testing
The practice of randomly dividing users into two or more groups and exposing each group to a different version of a product experience to measure which version performs better on a target metric, commonly known as A/B testing.