Interleaving Test

An experimentation method primarily used for ranking and recommendation systems where results from two algorithms are interleaved into a single list shown to each user, and user interactions with items from each algorithm determine which performs better.

Interleaving tests are a specialized experimental technique designed for evaluating ranking algorithms, search results, recommendation systems, and any system that produces ordered lists of items. Instead of showing half the users algorithm A's results and half algorithm B's results (traditional A/B testing), interleaving shows every user a single merged list that contains items from both algorithms. User interactions like clicks, purchases, or engagement with items are then attributed back to the originating algorithm to determine which algorithm contributed more preferred items. For growth teams working on search, recommendations, or content feeds, interleaving is dramatically more sensitive than traditional A/B testing, often requiring 10-100x fewer users to detect the same effect size.

The most common interleaving method is Team Draft Interleaving. For each user query, the two algorithms each produce a ranked list. The interleaved list is constructed by alternating between algorithms: a fair coin determines which algorithm contributes the first item, then they alternate, skipping items already included. Each item in the final list is tagged with its source algorithm. When a user clicks on an item, the click counts as a win for that item's algorithm. The test statistic is the proportion of users for whom algorithm A won more clicks versus algorithm B, tested against 0.5 using a binomial test. More sophisticated methods include Balanced Interleaving, which ensures each algorithm contributes equal items regardless of position, and Optimized Interleaving, which maximizes sensitivity by choosing the interleaving that best discriminates between algorithms. These methods are implemented in experimentation frameworks at companies like Netflix, Spotify, and Bing.

Interleaving should be used as a fast screening method for ranking algorithm changes when you need a quick directional signal before committing to a full A/B test. The dramatic sensitivity advantage comes from the within-user comparison: each user serves as their own control, eliminating between-user variance. However, interleaving has important limitations: it only measures preference (which algorithm users prefer) not absolute engagement (whether users engage more overall). An algorithm that shows more clickable but less relevant results might win interleaving tests while decreasing downstream satisfaction. Interleaving also cannot measure system-level effects like changes in total user engagement, session length, or revenue, which require traditional A/B tests. The standard workflow is to use interleaving for fast screening of many candidate algorithms, then validate the winner with a full A/B test that measures business metrics.

Advanced interleaving methods include Multileaving, which extends interleaving to compare more than two algorithms simultaneously, and Pairwise Preference interleaving, which handles cases where user preferences are relative rather than absolute. For deep learning recommendation systems, interleaving tests need to account for the fact that the merged list may contain items that neither algorithm would have shown in isolation, potentially introducing artifacts. Position bias correction is also important: items higher in the interleaved list receive more clicks regardless of quality, so credit attribution should account for position. Recent work on counterfactual interleaving uses inverse propensity scoring to debias interleaving results when the interleaving policy differs from what would have been shown by either algorithm alone.

Related Terms

Split Testing

The practice of randomly dividing users into two or more groups and exposing each group to a different version of a product experience to measure which version performs better on a target metric, commonly known as A/B testing.

Bayesian Optimization

A sequential decision-making framework that uses a probabilistic model of the objective function to efficiently search for the optimal configuration of parameters, balancing exploration of uncertain regions with exploitation of promising areas.

Contextual Bandit Experiment

An adaptive experiment that uses user context (features like demographics, behavior history, and session attributes) to personalize which treatment variant each user receives, learning a policy that maps user characteristics to optimal treatments in real time.

Multivariate Testing

An experimentation method that simultaneously tests multiple variables and their combinations to determine which combination of changes produces the best outcome, unlike A/B testing which typically varies a single element at a time.

Holdout Testing

An experimental design where a small percentage of users are permanently excluded from receiving a new feature or set of features, serving as a long-term control group to measure the cumulative impact of product changes over time.

Power Analysis

A statistical calculation performed before an experiment to determine the minimum sample size required to detect a meaningful effect with a specified probability, balancing the risk of false negatives against practical constraints like traffic and experiment duration.