Back to glossary

Contextual Bandit Experiment

An adaptive experiment that uses user context (features like demographics, behavior history, and session attributes) to personalize which treatment variant each user receives, learning a policy that maps user characteristics to optimal treatments in real time.

Contextual bandit experiments extend multi-armed bandits from finding the single best variant to finding the best variant for each type of user. Instead of learning that Variant B is best overall, a contextual bandit might learn that Variant A is best for mobile users under 30 and Variant B is best for desktop users over 40. The algorithm uses user context (features) to personalize the variant assignment, continuously updating its policy as it observes outcomes. For growth teams, contextual bandits represent the frontier of experimentation, bridging the gap between traditional A/B testing (which finds one winner for all) and full personalization (which requires knowing the optimal experience for each user). They enable real-time personalized optimization of landing pages, notification strategies, recommendation presentations, and pricing.

Contextual bandit algorithms maintain a model that predicts the expected reward of each variant given the user's context features. Popular algorithms include LinUCB (which uses a linear model with upper confidence bound exploration), contextual Thompson Sampling (which samples from the posterior distribution of a Bayesian linear model), and neural contextual bandits (which use deep learning for the reward model). The workflow is: (1) User arrives with context features X (device type, geography, session count, etc.). (2) The algorithm predicts the expected reward for each variant given X. (3) The algorithm selects a variant balancing predicted reward with exploration needs. (4) The user receives the variant and their outcome is observed. (5) The model is updated with the new observation. Platforms like Statsig, Vowpal Wabbit, and custom implementations at companies like Netflix and Spotify support contextual bandit experiments. The offline evaluation of contextual bandit policies uses inverse propensity scoring to estimate how a new policy would have performed on historically logged data.

Contextual bandit experiments should be used when there is strong reason to believe that treatment effects vary across user segments, when the number of variants is manageable (typically 2-10), and when the context features are available at decision time. The most common applications include personalizing which email subject line to send, which landing page to show, which product recommendation layout to use, and which notification frequency to set. Common pitfalls include using context features that are noisy or not available in real time, overfitting the reward model to spurious patterns in early data, not accounting for delayed rewards (a user who converts three days later), and treating the bandit policy as a black box without understanding what context-variant mapping it has learned.

Advanced contextual bandit methods include batched Thompson Sampling for environments where model updates happen periodically rather than after each user, off-policy evaluation techniques like doubly robust estimation that evaluate candidate policies using historical data, non-stationary contextual bandits that adapt to changing user behavior over time, and meta-learning approaches that transfer knowledge across similar experimentation contexts. The theoretical framework for contextual bandits connects to heterogeneous treatment effect estimation from causal inference: the optimal policy is equivalent to assigning each user to the treatment with the highest conditional average treatment effect given their context. For organizations building sophisticated personalization systems, contextual bandits provide the experimentation framework for continuously learning and deploying personalized experiences at scale, closing the loop between experimentation and production machine learning.

Related Terms

Epsilon-Greedy

A simple exploration-exploitation algorithm used in multi-armed bandit experiments that exploits the current best-performing variant with probability (1-epsilon) and explores by randomly selecting any variant with probability epsilon, where epsilon is typically a small value like 0.1.

Adaptive Experiment

An experiment design that modifies its parameters during execution based on accumulating data, including adjusting traffic allocation between variants, dropping underperforming arms, or modifying the sample size, while maintaining statistical validity through appropriate corrections.

Bayesian Optimization

A sequential decision-making framework that uses a probabilistic model of the objective function to efficiently search for the optimal configuration of parameters, balancing exploration of uncertain regions with exploitation of promising areas.

Causal Forest

A machine learning method based on random forests that estimates heterogeneous treatment effects, discovering how the impact of a treatment varies across different subgroups of users defined by their observable characteristics.

Multivariate Testing

An experimentation method that simultaneously tests multiple variables and their combinations to determine which combination of changes produces the best outcome, unlike A/B testing which typically varies a single element at a time.

Split Testing

The practice of randomly dividing users into two or more groups and exposing each group to a different version of a product experience to measure which version performs better on a target metric, commonly known as A/B testing.