Back to glossary

Type II Error

The error of failing to reject a false null hypothesis, also known as a false negative, where an experiment fails to detect a real treatment effect, concluding there is no difference when one actually exists.

A Type II error occurs when an experiment misses a genuine treatment effect, typically because the sample size is too small to distinguish the signal from noise. The probability of a Type II error is denoted beta, and statistical power (1 - beta) is the complement. With the conventional power target of 80%, there is a 20% chance of missing a real effect of the specified size. For growth teams, Type II errors are arguably more costly than Type I errors in many contexts: they lead to abandoning effective changes, creating a bias toward the status quo and slowing the pace of product improvement. A team that consistently runs underpowered experiments will discard many genuinely beneficial ideas, conclude that experimentation does not work, and potentially abandon the experimentation practice altogether.

Type II errors are directly related to statistical power, which depends on four factors: the significance level alpha (higher alpha increases power but also Type I error risk), the sample size (larger samples increase power), the true effect size (larger effects are easier to detect), and the variance of the metric (lower variance increases power). The relationship is captured in the power formula: power = P(reject H0 | H1 is true) = P(Z > Z_alpha/2 - delta*sqrt(n)/(2*sigma)), where delta is the true effect size and sigma is the standard deviation. When an experiment fails to find a significant result, it does not prove the treatment has no effect; it only means the data were insufficient to detect an effect at the specified power level. The confidence interval around the null result is informative: a tight CI around zero suggests the true effect is genuinely small, while a wide CI means the test was simply inconclusive.

To minimize Type II errors, teams should perform rigorous power analysis before every experiment, targeting 80% or higher power for the minimum effect size of practical interest. When traffic is limited, several strategies can increase power without more users: variance reduction techniques like CUPED can reduce metric variance by 30-50% by controlling for pre-experiment covariates, effectively increasing the sample size. Using more sensitive metrics (e.g., revenue per user instead of conversion rate) can increase the signal-to-noise ratio. Restricting analysis to triggered users who actually encountered the change eliminates noise from unaffected users. Teams should also be cautious about interpreting null results: rather than declaring no effect, report the confidence interval and the minimum effect size the experiment could have detected. This practice prevents the common mistake of shipping a change without measurement because a previous underpowered test showed no significant effect.

Advanced considerations include the distinction between individual experiment power and portfolio-level power. An organization running many underpowered experiments at 50% power will miss half of all true effects, creating systematic underestimation of what experimentation can deliver. Some teams use meta-analysis of past experiments to estimate the typical effect size in their domain and calibrate power requirements accordingly. Sequential testing designs offer a potential reduction in expected sample size by allowing early stopping for both efficacy and futility, where a futility boundary allows declaring a Type II error early when the accumulating data strongly suggest the effect is negligibly small. Bayesian approaches frame the problem differently, computing the posterior probability that the effect exceeds a meaningful threshold, which directly addresses the decision-relevant question rather than the binary reject/fail-to-reject framework.

Related Terms