Sample Ratio Mismatch

A diagnostic check that detects whether the observed ratio of users in experiment groups matches the expected ratio from the randomization design, where a significant deviation signals a data quality problem that can invalidate experiment results.

Sample ratio mismatch (SRM) is one of the most important diagnostic checks in online experimentation, yet it is frequently overlooked. If an experiment is designed to split traffic 50/50 between control and treatment, the observed counts should be approximately equal, with some random variation. An SRM test uses a chi-squared goodness-of-fit test to determine whether the observed ratio deviates significantly from the expected ratio. A significant SRM (typically p < 0.001 given the large sample sizes involved) indicates that something in the experiment implementation is systematically biasing which users end up in which group, which violates the fundamental assumption of random assignment and can invalidate all causal conclusions. For growth teams, SRM is a canary in the coal mine: it does not tell you what went wrong, but it tells you that something did.

Common causes of SRM include bugs in the randomization code, differences in page load performance between variants causing differential bot filtering or user abandonment, redirect-based experiments where one variant's redirect fails more often, initialization timing differences where the assignment check happens at different points in the user flow, and browser or client-side caching that affects variant delivery inconsistently. For example, if the treatment variant loads 200ms slower than control, some treatment users may bounce before their visit is logged, creating an SRM where control has more recorded users. Another common cause is interaction between experiments: if experiment A's treatment causes some users to never reach experiment B's assignment point, experiment B will show an SRM among users exposed to experiment A's treatment. The SRM test is computed as chi_squared = sum((observed_i - expected_i)^2 / expected_i) across all groups, compared to a chi-squared distribution with k-1 degrees of freedom.

Every experiment should include an automated SRM check that runs continuously and alerts when a mismatch is detected. The check should be performed early in the experiment (often within the first day) so that problematic experiments can be stopped before they waste more traffic. When an SRM is detected, the experiment results should be considered invalid until the root cause is identified and resolved. Common debugging steps include checking for differential logging between variants, examining bot traffic patterns, verifying that the randomization hash function produces uniform distribution, checking for interactions with other concurrent experiments, and examining whether variant-specific errors or timeouts could cause differential data loss. Teams should never try to fix an SRM by reweighting or adjusting the data; the goal is to identify and fix the underlying implementation bug.

Advanced SRM analysis includes checking not just the overall ratio but also the ratio over time (a sudden shift may indicate a deployment or configuration change), across different platforms or geographies (a platform-specific SRM points to a client-side implementation issue), and across different user segments. Some experimentation platforms like Statsig and Eppo run SRM checks automatically and flag experiments with mismatches. The concept extends beyond simple two-group experiments: for multi-arm experiments, factorial designs, and layered experiment systems, SRM checks should verify all expected ratios including interaction cells. Research by Fabijan, Dmitriev, and others at Microsoft has documented that SRM affects a significant percentage of experiments at major tech companies and is often the first indicator of subtle bugs that would otherwise silently bias results for months.

Related Terms

Split Testing

The practice of randomly dividing users into two or more groups and exposing each group to a different version of a product experience to measure which version performs better on a target metric, commonly known as A/B testing.

Randomization Unit

The entity (user, session, page view, device, cluster, or geographic region) at which random assignment to experiment variants occurs, determining the independence structure of the data and affecting both the validity and statistical power of the experiment.

Guardrail Metric Testing

The practice of monitoring a set of critical business metrics during every experiment to detect unintended negative side effects, even when the primary experiment metric shows a positive result, ensuring that optimizing one metric does not degrade overall user experience or business health.

Multivariate Testing

An experimentation method that simultaneously tests multiple variables and their combinations to determine which combination of changes produces the best outcome, unlike A/B testing which typically varies a single element at a time.

Holdout Testing

An experimental design where a small percentage of users are permanently excluded from receiving a new feature or set of features, serving as a long-term control group to measure the cumulative impact of product changes over time.

Power Analysis

A statistical calculation performed before an experiment to determine the minimum sample size required to detect a meaningful effect with a specified probability, balancing the risk of false negatives against practical constraints like traffic and experiment duration.