Confidence Interval
A range of values, derived from sample data, that is expected to contain the true population parameter with a specified probability, providing both an estimate of the treatment effect and the precision of that estimate.
A confidence interval (CI) provides far more information than a simple point estimate or p-value by quantifying the uncertainty around a measured effect. A 95% confidence interval means that if the experiment were repeated many times, 95% of the computed intervals would contain the true effect. For growth teams, confidence intervals are essential for making informed ship/no-ship decisions because they communicate both the likely magnitude and the range of plausible values for a treatment effect. A point estimate of +3% conversion lift tells you the most likely outcome, but the confidence interval [+0.5%, +5.5%] tells you the best and worst realistic scenarios, enabling proper risk assessment and revenue forecasting.
The standard confidence interval for a difference in means is calculated as (X_bar_treatment - X_bar_control) +/- Z_alpha/2 * SE, where SE is the standard error of the difference, computed as sqrt(s1^2/n1 + s2^2/n2). For proportions, the SE uses the proportion formula sqrt(p1(1-p1)/n1 + p2(1-p2)/n2). The width of the confidence interval is inversely proportional to the square root of the sample size, meaning that quadrupling the sample size halves the interval width. Experimentation platforms like Statsig, Optimizely, and Eppo display confidence intervals prominently in their dashboards. Many platforms also offer Bayesian credible intervals, which have a more intuitive interpretation: a 95% credible interval means there is a 95% probability that the true parameter lies within the interval, given the data and prior. This distinction matters because the frequentist CI's coverage guarantee applies to the procedure across hypothetical repetitions, not to the specific interval computed.
Teams should use confidence intervals as the primary output of experiment analysis rather than relying on binary significant/not-significant conclusions. A result can be not statistically significant but still highly informative: a 95% CI of [-0.5%, +4%] for conversion lift suggests the treatment is unlikely to be harmful and has a good chance of being beneficial, which might justify shipping. Conversely, a statistically significant result with a wide CI that includes trivially small effects may not warrant the implementation investment. Common pitfalls include misinterpreting the confidence level (it is not the probability that the true value is in this specific interval in the frequentist framework), ignoring the width of the interval when making decisions, and not adjusting confidence levels when examining multiple metrics or segments simultaneously. When running multiple comparisons, the family-wise confidence level is lower than the individual interval level.
Advanced applications include using confidence intervals for equivalence testing, where the goal is to show that a new system performs within an acceptable range of the old one rather than showing superiority. If the entire CI falls within a pre-specified equivalence margin, the treatment is declared equivalent. This is valuable for infrastructure migrations, code refactors, and cost-reduction changes where the goal is to confirm no harm. Confidence intervals also enable more nuanced meta-analysis across experiments, where overlapping CIs from multiple tests can be combined using inverse-variance weighting to produce a pooled estimate with a narrower interval. For sequential experiments that allow early stopping, the confidence intervals must be adjusted using methods like alpha spending functions to maintain valid coverage despite the multiple looks at the data.
Related Terms
Effect Size
A quantitative measure of the magnitude of a treatment's impact, expressed either as an absolute difference between groups or as a standardized metric like Cohen's d that allows comparison across different experiments and metrics.
Type I Error
The error of incorrectly rejecting a true null hypothesis, also known as a false positive, where an experiment concludes that a treatment has an effect when in reality there is no true difference between treatment and control.
Type II Error
The error of failing to reject a false null hypothesis, also known as a false negative, where an experiment fails to detect a real treatment effect, concluding there is no difference when one actually exists.
Multivariate Testing
An experimentation method that simultaneously tests multiple variables and their combinations to determine which combination of changes produces the best outcome, unlike A/B testing which typically varies a single element at a time.
Split Testing
The practice of randomly dividing users into two or more groups and exposing each group to a different version of a product experience to measure which version performs better on a target metric, commonly known as A/B testing.
Holdout Testing
An experimental design where a small percentage of users are permanently excluded from receiving a new feature or set of features, serving as a long-term control group to measure the cumulative impact of product changes over time.