Experimentation & Causal Inference Glossary
Statistical methods and frameworks for running rigorous experiments — A/B testing, multivariate testing, causal inference, bandit algorithms, and growth experimentation at scale.
Activation Experiment
An experiment specifically designed to increase the rate at which new users reach a product's activation milestone, the key early action that correlates with long-term retention, by testing changes to onboarding flows, first-run experiences, and value delivery.
Adaptive Experiment
An experiment design that modifies its parameters during execution based on accumulating data, including adjusting traffic allocation between variants, dropping underperforming arms, or modifying the sample size, while maintaining statistical validity through appropriate corrections.
Bayesian Optimization
A sequential decision-making framework that uses a probabilistic model of the objective function to efficiently search for the optimal configuration of parameters, balancing exploration of uncertain regions with exploitation of promising areas.
Causal Forest
A machine learning method based on random forests that estimates heterogeneous treatment effects, discovering how the impact of a treatment varies across different subgroups of users defined by their observable characteristics.
Cluster Randomization
An experimental design that randomly assigns groups (clusters) of users rather than individual users to treatment conditions, used when individual randomization is not feasible or when interference between users within the same cluster would violate independence assumptions.
Confidence Interval
A range of values, derived from sample data, that is expected to contain the true population parameter with a specified probability, providing both an estimate of the treatment effect and the precision of that estimate.
Contextual Bandit Experiment
An adaptive experiment that uses user context (features like demographics, behavior history, and session attributes) to personalize which treatment variant each user receives, learning a policy that maps user characteristics to optimal treatments in real time.
Crossover Design
An experimental design where the same subjects receive both the treatment and control conditions in different time periods, with each subject serving as their own control, reducing variance from between-subject differences.
CUPED Variance Reduction
A statistical technique (Controlled-experiment Using Pre-Experiment Data) that reduces metric variance in online experiments by adjusting for pre-experiment user behavior, increasing statistical power by 20-50% without requiring larger sample sizes.
Difference-in-Differences
A quasi-experimental statistical method that estimates a treatment effect by comparing the change in outcomes over time between a group that receives a treatment and a group that does not, removing biases from time-invariant differences between groups and common time trends.
Effect Size
A quantitative measure of the magnitude of a treatment's impact, expressed either as an absolute difference between groups or as a standardized metric like Cohen's d that allows comparison across different experiments and metrics.
Epsilon-Greedy
A simple exploration-exploitation algorithm used in multi-armed bandit experiments that exploits the current best-performing variant with probability (1-epsilon) and explores by randomly selecting any variant with probability epsilon, where epsilon is typically a small value like 0.1.
Experiment Analysis Plan
A pre-registered document specifying the hypothesis, primary and secondary metrics, statistical methods, sample size, analysis timeline, and decision criteria for an experiment, written before the experiment launches to prevent post-hoc rationalization and p-hacking.
Experiment Documentation
The systematic recording of experiment hypotheses, designs, configurations, results, and learnings in a structured, searchable format that preserves institutional knowledge and enables evidence-based decision-making across the organization.
Experiment Review Board
A cross-functional governance body that reviews experiment designs before launch and results before ship decisions, ensuring statistical rigor, alignment with organizational metrics, and prevention of common methodological errors.
Experiment Velocity
The rate at which an organization designs, launches, analyzes, and acts on experiments, typically measured as the number of experiments concluded per unit time, reflecting the speed of the organization's learning and iteration cycle.
Factorial Design
An experimental design that simultaneously tests all possible combinations of two or more factors, each with multiple levels, enabling the estimation of both individual factor effects and interaction effects between factors in a single experiment.
False Discovery Rate
The expected proportion of false positives among all statistically significant results, offering a less conservative alternative to familywise error rate control that is more appropriate when many hypotheses are tested and some false discoveries are acceptable.
Feature Gating
The practice of controlling access to product features based on configurable rules, enabling gradual rollouts, targeted access, and experiments by dynamically determining which users see which features without code deployments.
Frequentist Testing
The classical statistical hypothesis testing framework used in most A/B tests, where decisions are based on p-values and confidence intervals derived from the sampling distribution of test statistics under the null hypothesis of no treatment effect.
Growth Experimentation Framework
A structured organizational process for systematically generating, prioritizing, running, and learning from experiments across the entire user lifecycle, designed to maximize the rate of validated learning and compound the impact of product improvements.
Guardrail Metric Testing
The practice of monitoring a set of critical business metrics during every experiment to detect unintended negative side effects, even when the primary experiment metric shows a positive result, ensuring that optimizing one metric does not degrade overall user experience or business health.
Heterogeneous Treatment Effects
Variation in treatment effects across different subgroups of the population, where an intervention may have different impacts depending on user characteristics such as tenure, geography, device type, or behavioral patterns.
Holdout Testing
An experimental design where a small percentage of users are permanently excluded from receiving a new feature or set of features, serving as a long-term control group to measure the cumulative impact of product changes over time.
Intention-to-Treat
An analysis principle that evaluates experiment results based on the original random assignment of users to treatment groups, regardless of whether they actually received or engaged with the treatment, preserving the validity of randomization.
Interleaving Test
An experimentation method primarily used for ranking and recommendation systems where results from two algorithms are interleaved into a single list shown to each user, and user interactions with items from each algorithm determine which performs better.
Latin Square Design
An experimental design that controls for two known sources of variation by arranging treatments in a grid where each treatment appears exactly once in each row and column, efficiently balancing nuisance factors without requiring a full factorial experiment.
Long-Running Experiment
An experiment maintained for weeks, months, or even years beyond the standard analysis period to measure the long-term and cumulative effects of a treatment, capturing delayed impacts on retention, revenue, and user behavior that short-term experiments miss.
Marketplace Experiment
An experiment conducted in a two-sided or multi-sided marketplace where treatment effects can propagate between buyer and seller sides, requiring specialized experimental designs that account for cross-side interference and equilibrium effects.
Minimum Detectable Effect
The smallest treatment effect that an experiment is designed to detect with a specified level of statistical power, serving as the bridge between statistical capability and practical relevance in experiment planning.
Monetization Experiment
An experiment focused on increasing revenue per user through changes to pricing, upsell flows, premium feature presentation, upgrade prompts, and payment mechanics, measuring both immediate revenue impact and long-term customer lifetime value.
Multiple Comparison Correction
Statistical adjustments applied when testing multiple hypotheses simultaneously to control the overall probability of making at least one Type I error, preventing the inflation of false positive rates that occurs when many tests are conducted.
Multivariate Testing
An experimentation method that simultaneously tests multiple variables and their combinations to determine which combination of changes produces the best outcome, unlike A/B testing which typically varies a single element at a time.
Network Effect Experiment
An experiment designed to measure and optimize features that become more valuable as more users adopt them, addressing the unique challenges of testing network-dependent features where individual user value depends on the behavior and adoption of other users.
Novelty Effect
A temporary change in user behavior caused by the newness of a feature or design change rather than its intrinsic value, where engagement metrics initially spike because users explore the new experience but then decay as the novelty wears off.
Onboarding Experiment
An experiment that tests changes to the new user onboarding flow, measuring the impact on activation rates, time-to-value, and early retention by modifying the sequence, content, and mechanics of the initial product experience.
Paywall Testing
Experiments that test the design, timing, placement, and configuration of paywall experiences where free users encounter the boundary between free and paid features, optimizing the balance between conversion to paid and engagement retention.
Peeking Problem
The statistical inflation of false positive rates that occurs when experimenters repeatedly check experiment results and stop the test as soon as statistical significance is observed, rather than waiting for the pre-determined sample size to be reached.
Per-Protocol Analysis
An analysis approach that evaluates experiment results based on which treatment users actually received rather than their original random assignment, providing an estimate of the treatment effect among compliant users but potentially introducing selection bias.
Percentage Rollout
A deployment strategy that gradually increases the percentage of users who receive a new feature from a small initial percentage to full deployment, monitoring key metrics at each stage to catch problems before they affect the entire user base.
Power Analysis
A statistical calculation performed before an experiment to determine the minimum sample size required to detect a meaningful effect with a specified probability, balancing the risk of false negatives against practical constraints like traffic and experiment duration.
Practical Significance
The assessment of whether a statistically significant experiment result represents a meaningful business impact that justifies the cost of implementation, maintenance, and complexity of shipping the change, distinct from mere statistical significance.
Pre-Post Analysis
A quasi-experimental method that compares metrics before and after a treatment is applied to the same group, using the pre-treatment period as a baseline to estimate the treatment effect when a randomized control group is not available.
Pricing Experiment
An experiment that tests different pricing structures, price points, packaging configurations, or billing models to optimize revenue, conversion rates, or a combination of monetization metrics while monitoring the impact on user satisfaction and retention.
Primacy Effect
A temporary depression in user performance or engagement when encountering a changed experience, caused by the disruption of established habits and mental models, which can make a genuinely beneficial treatment appear harmful in the short term.
Propensity Score Matching
A statistical method that reduces selection bias in observational studies by matching treated and untreated units that have similar probabilities (propensity scores) of receiving the treatment, creating a pseudo-randomized comparison.
Randomization Unit
The entity (user, session, page view, device, cluster, or geographic region) at which random assignment to experiment variants occurs, determining the independence structure of the data and affecting both the validity and statistical power of the experiment.
Referral Testing
Experiments that optimize referral and invitation programs by testing incentive structures, sharing mechanics, referral messaging, and the invitation experience to maximize the number and quality of referred users.
Regression Discontinuity
A quasi-experimental design that exploits a sharp cutoff in a continuous assignment variable to estimate causal effects, comparing units just above and just below the threshold where treatment assignment changes discontinuously.
Retention Experiment
An experiment aimed at increasing the percentage of users who continue using a product over time, testing interventions that strengthen habit formation, increase perceived value, reduce churn triggers, and deepen user engagement.
Sample Ratio Mismatch
A diagnostic check that detects whether the observed ratio of users in experiment groups matches the expected ratio from the randomization design, where a significant deviation signals a data quality problem that can invalidate experiment results.
Split Testing
The practice of randomly dividing users into two or more groups and exposing each group to a different version of a product experience to measure which version performs better on a target metric, commonly known as A/B testing.
Stopping Rules
Pre-defined criteria that determine when an experiment should be concluded, including both the conditions for early termination due to clear results and the maximum duration or sample size at which a final analysis is performed.
Stratified Randomization
A randomization technique that first divides the user population into homogeneous subgroups (strata) based on important characteristics, then randomizes independently within each stratum to ensure treatment groups are balanced on known confounders and to improve statistical precision.
Switchback Testing
An experimental design that alternates between treatment and control conditions over time periods within the same unit (such as a geographic region or marketplace), used when user-level randomization is not feasible due to interference or operational constraints.
Synthetic Control
A causal inference method that constructs a weighted combination of untreated units to create an artificial control group that closely matches the treated unit's pre-treatment characteristics and trajectory, enabling credible treatment effect estimation when only one or a few units are treated.
Triggered Analysis
An analysis technique that restricts experiment evaluation to users who actually encountered or were exposed to the experimental change, reducing noise from unaffected users while maintaining the validity of the randomization through careful implementation.
Type I Error
The error of incorrectly rejecting a true null hypothesis, also known as a false positive, where an experiment concludes that a treatment has an effect when in reality there is no true difference between treatment and control.
Type II Error
The error of failing to reject a false null hypothesis, also known as a false negative, where an experiment fails to detect a real treatment effect, concluding there is no difference when one actually exists.
Virality Testing
Experiments that measure and optimize the organic spread of a product through user actions, testing features and mechanics that naturally encourage sharing, collaboration, and exposure of the product to non-users without explicit referral incentives.