Type II Error
The error of failing to reject a false null hypothesis, also known as a false negative, where an experiment fails to detect a real treatment effect, concluding there is no difference when one actually exists.
A Type II error occurs when an experiment misses a genuine treatment effect, typically because the sample size is too small to distinguish the signal from noise. The probability of a Type II error is denoted beta, and statistical power (1 - beta) is the complement. With the conventional power target of 80%, there is a 20% chance of missing a real effect of the specified size. For growth teams, Type II errors are arguably more costly than Type I errors in many contexts: they lead to abandoning effective changes, creating a bias toward the status quo and slowing the pace of product improvement. A team that consistently runs underpowered experiments will discard many genuinely beneficial ideas, conclude that experimentation does not work, and potentially abandon the experimentation practice altogether.
Type II errors are directly related to statistical power, which depends on four factors: the significance level alpha (higher alpha increases power but also Type I error risk), the sample size (larger samples increase power), the true effect size (larger effects are easier to detect), and the variance of the metric (lower variance increases power). The relationship is captured in the power formula: power = P(reject H0 | H1 is true) = P(Z > Z_alpha/2 - delta*sqrt(n)/(2*sigma)), where delta is the true effect size and sigma is the standard deviation. When an experiment fails to find a significant result, it does not prove the treatment has no effect; it only means the data were insufficient to detect an effect at the specified power level. The confidence interval around the null result is informative: a tight CI around zero suggests the true effect is genuinely small, while a wide CI means the test was simply inconclusive.
To minimize Type II errors, teams should perform rigorous power analysis before every experiment, targeting 80% or higher power for the minimum effect size of practical interest. When traffic is limited, several strategies can increase power without more users: variance reduction techniques like CUPED can reduce metric variance by 30-50% by controlling for pre-experiment covariates, effectively increasing the sample size. Using more sensitive metrics (e.g., revenue per user instead of conversion rate) can increase the signal-to-noise ratio. Restricting analysis to triggered users who actually encountered the change eliminates noise from unaffected users. Teams should also be cautious about interpreting null results: rather than declaring no effect, report the confidence interval and the minimum effect size the experiment could have detected. This practice prevents the common mistake of shipping a change without measurement because a previous underpowered test showed no significant effect.
Advanced considerations include the distinction between individual experiment power and portfolio-level power. An organization running many underpowered experiments at 50% power will miss half of all true effects, creating systematic underestimation of what experimentation can deliver. Some teams use meta-analysis of past experiments to estimate the typical effect size in their domain and calibrate power requirements accordingly. Sequential testing designs offer a potential reduction in expected sample size by allowing early stopping for both efficacy and futility, where a futility boundary allows declaring a Type II error early when the accumulating data strongly suggest the effect is negligibly small. Bayesian approaches frame the problem differently, computing the posterior probability that the effect exceeds a meaningful threshold, which directly addresses the decision-relevant question rather than the binary reject/fail-to-reject framework.
Related Terms
Type I Error
The error of incorrectly rejecting a true null hypothesis, also known as a false positive, where an experiment concludes that a treatment has an effect when in reality there is no true difference between treatment and control.
Power Analysis
A statistical calculation performed before an experiment to determine the minimum sample size required to detect a meaningful effect with a specified probability, balancing the risk of false negatives against practical constraints like traffic and experiment duration.
Effect Size
A quantitative measure of the magnitude of a treatment's impact, expressed either as an absolute difference between groups or as a standardized metric like Cohen's d that allows comparison across different experiments and metrics.
Minimum Detectable Effect
The smallest improvement in a metric that an experiment is designed to reliably detect with a given level of statistical power and significance, determining the practical sensitivity of the test.
Multivariate Testing
An experimentation method that simultaneously tests multiple variables and their combinations to determine which combination of changes produces the best outcome, unlike A/B testing which typically varies a single element at a time.
Split Testing
The practice of randomly dividing users into two or more groups and exposing each group to a different version of a product experience to measure which version performs better on a target metric, commonly known as A/B testing.