Multiple Comparison Correction
Statistical adjustments applied when testing multiple hypotheses simultaneously to control the overall probability of making at least one Type I error, preventing the inflation of false positive rates that occurs when many tests are conducted.
Multiple comparison correction addresses a fundamental statistical problem: when you test many hypotheses, the probability of at least one false positive grows rapidly even if each individual test is conducted at a low significance level. With 20 independent tests at alpha = 0.05, the probability of at least one false positive is 1 - (0.95)^20 = 0.64, meaning you are more likely than not to find a spurious significant result. For growth teams, this problem is pervasive because experiments typically track dozens of metrics (primary, secondary, and guardrail), analyze multiple user segments, and may test several variants, creating hundreds of implicit hypothesis tests. Without correction, teams will regularly ship changes based on false positives found through what amounts to data mining.
The most common correction methods fall into two categories: those controlling the familywise error rate (FWER) and those controlling the false discovery rate (FDR). The Bonferroni correction is the simplest FWER method: divide the significance threshold alpha by the number of tests m, so each test uses alpha/m. With 20 tests and alpha = 0.05, each test must achieve p < 0.0025 to be declared significant. While simple and valid, Bonferroni is conservative, especially when tests are correlated (as metrics often are). The Holm-Bonferroni method is a step-down procedure that is uniformly more powerful: order the p-values from smallest to largest, and compare each to alpha/(m-k+1) where k is its rank, stopping at the first non-rejection. For FDR control, the Benjamini-Hochberg procedure is standard: order the p-values, and find the largest k such that p(k) <= k*alpha/m, then reject all hypotheses with p-values at or below p(k). This controls the expected proportion of false discoveries among all discoveries, which is often more appropriate for exploratory analysis.
Teams should decide upfront which correction method to use based on the analysis context. For the primary metric of an experiment, no correction is needed since there is only one test. For secondary and exploratory metrics, FDR control via Benjamini-Hochberg is usually appropriate since the cost of a false discovery is lower and the conservatism of FWER methods would hide real effects. For guardrail metrics where the cost of missing a degradation is high, FWER methods like Holm-Bonferroni are appropriate. Many teams sidestep the problem by designating a single primary metric that determines the ship decision and treating all other metrics as informational, not requiring correction. Common pitfalls include forgetting to count subgroup analyses as additional tests, applying no correction while examining dozens of metrics, and being so conservative with correction that no experiment ever reaches significance.
Advanced approaches include graphical multiple comparison procedures that encode the logical relationships between hypotheses, allowing alpha to be recycled from rejected hypotheses to remaining ones in a structured way. For example, if the primary metric is significant, its alpha can be reallocated to secondary metrics, increasing their power. Online FDR control methods like LOND and LORD extend FDR control to the setting where hypotheses arrive sequentially over time, which is relevant for organizations analyzing a continuous stream of experiment results. Resampling-based methods like permutation testing and the bootstrap can provide exact or near-exact control without the conservatism of analytic corrections, especially when test statistics are correlated. Modern experimentation platforms like Statsig increasingly implement intelligent correction strategies automatically, applying different correction levels to primary, secondary, and exploratory metrics.
Related Terms
Type I Error
The error of incorrectly rejecting a true null hypothesis, also known as a false positive, where an experiment concludes that a treatment has an effect when in reality there is no true difference between treatment and control.
False Discovery Rate
The expected proportion of false positives among all statistically significant results, offering a less conservative alternative to familywise error rate control that is more appropriate when many hypotheses are tested and some false discoveries are acceptable.
Peeking Problem
The statistical inflation of false positive rates that occurs when experimenters repeatedly check experiment results and stop the test as soon as statistical significance is observed, rather than waiting for the pre-determined sample size to be reached.
Multivariate Testing
An experimentation method that simultaneously tests multiple variables and their combinations to determine which combination of changes produces the best outcome, unlike A/B testing which typically varies a single element at a time.
Split Testing
The practice of randomly dividing users into two or more groups and exposing each group to a different version of a product experience to measure which version performs better on a target metric, commonly known as A/B testing.
Holdout Testing
An experimental design where a small percentage of users are permanently excluded from receiving a new feature or set of features, serving as a long-term control group to measure the cumulative impact of product changes over time.