Back to glossary

Propensity Score Matching

A statistical method that reduces selection bias in observational studies by matching treated and untreated units that have similar probabilities (propensity scores) of receiving the treatment, creating a pseudo-randomized comparison.

Propensity score matching (PSM) addresses the fundamental challenge of observational studies: when treatment is not randomly assigned, treated and untreated groups may differ systematically in ways that confound the treatment effect estimate. PSM works in two steps: first, estimate each unit's probability of receiving the treatment (the propensity score) based on observed covariates using logistic regression or machine learning; second, match each treated unit with one or more untreated units that have similar propensity scores, creating balanced comparison groups. For growth teams, PSM is useful for evaluating features that users self-select into (e.g., premium plans, optional onboarding flows, support contacts) where randomization would be inappropriate or impractical.

The propensity score e(X) = P(Treatment = 1 | X) is estimated by fitting a logistic regression or more flexible models like gradient boosted trees on the observed covariates X. Matching then pairs treated and control units with similar propensity scores using methods like nearest-neighbor matching (each treated unit is matched to the control unit with the closest propensity score), caliper matching (matches are only accepted within a maximum propensity score distance), or kernel matching (all control units contribute with weights inversely related to their propensity score distance). After matching, covariate balance is assessed: the standardized mean differences between treated and control groups on all covariates should be small (typically less than 0.1). The treatment effect is then estimated as the average difference in outcomes between matched treated and control units. Tools include the R packages MatchIt and WeightIt, Python's causalml and DoWhy libraries, and the Stata teffects commands.

PSM should be used when you have rich covariate data that captures the factors driving treatment selection, and when you believe that conditioning on these covariates removes all confounding (the unconfoundedness or selection on observables assumption). This assumption is strong and untestable: if there are unobserved factors that affect both treatment selection and the outcome, PSM will still produce biased estimates. Common pitfalls include including post-treatment variables as covariates (which introduces bias), achieving poor covariate balance after matching (indicating the matching failed), discarding too many treated units that cannot find good matches, and placing excessive trust in the results without sensitivity analysis for unobserved confounding. Alternatives and complements include inverse probability weighting (using propensity scores as weights rather than for matching), doubly robust estimation (combining outcome modeling with propensity scoring), and instrumental variables (which handle unobserved confounding but require a valid instrument).

Advanced PSM techniques include generalized propensity scores for continuous or multi-valued treatments, coarsened exact matching (CEM) that matches on discretized covariates without estimating a propensity model, and entropy balancing that directly solves for weights that achieve exact covariate balance on specified moments. Sensitivity analysis methods like Rosenbaum bounds quantify how strong unobserved confounding would need to be to overturn the estimated treatment effect, providing a principled way to assess the robustness of PSM findings. For digital experimentation, PSM is increasingly combined with machine learning: causal forests and meta-learners can estimate heterogeneous treatment effects from observational data using propensity scores as inputs, and doubly robust machine learning methods provide valid inference even when either the propensity model or the outcome model is misspecified.

Related Terms

Causal Forest

A machine learning method based on random forests that estimates heterogeneous treatment effects, discovering how the impact of a treatment varies across different subgroups of users defined by their observable characteristics.

Heterogeneous Treatment Effects

Variation in treatment effects across different subgroups of the population, where an intervention may have different impacts depending on user characteristics such as tenure, geography, device type, or behavioral patterns.

Difference-in-Differences

A quasi-experimental statistical method that estimates a treatment effect by comparing the change in outcomes over time between a group that receives a treatment and a group that does not, removing biases from time-invariant differences between groups and common time trends.

Multivariate Testing

An experimentation method that simultaneously tests multiple variables and their combinations to determine which combination of changes produces the best outcome, unlike A/B testing which typically varies a single element at a time.

Split Testing

The practice of randomly dividing users into two or more groups and exposing each group to a different version of a product experience to measure which version performs better on a target metric, commonly known as A/B testing.

Holdout Testing

An experimental design where a small percentage of users are permanently excluded from receiving a new feature or set of features, serving as a long-term control group to measure the cumulative impact of product changes over time.