Back to glossary

Epsilon-Greedy

A simple exploration-exploitation algorithm used in multi-armed bandit experiments that exploits the current best-performing variant with probability (1-epsilon) and explores by randomly selecting any variant with probability epsilon, where epsilon is typically a small value like 0.1.

Epsilon-greedy is the simplest bandit algorithm for online experimentation, providing a straightforward approach to balancing exploration (learning about variants) with exploitation (showing users the best variant found so far). With probability (1-epsilon), the algorithm serves the variant with the highest observed reward rate; with probability epsilon, it randomly selects any variant including potentially inferior ones. This ensures continuous exploration while primarily serving the best option. For growth teams, epsilon-greedy is often the entry point into adaptive experimentation because it is easy to understand, implement, and debug, making it suitable for teams new to bandit methods who want to move beyond fixed A/B testing.

The implementation of epsilon-greedy is straightforward: maintain a running estimate of each variant's reward rate (e.g., conversion rate). For each incoming user, generate a random number between 0 and 1. If the number is less than epsilon, randomly select a variant with equal probability (explore). Otherwise, select the variant with the highest estimated reward rate (exploit). After the user's outcome is observed, update the variant's reward estimate. The choice of epsilon controls the exploration-exploitation tradeoff: higher epsilon means more exploration (slower convergence but more robust), lower epsilon means more exploitation (faster focus on the best variant but risk of premature convergence). A common variant is epsilon-decreasing, where epsilon starts high and decays over time: epsilon_t = epsilon_0 / (1 + decay_rate * t). This ensures heavy exploration early when estimates are uncertain, transitioning to near-pure exploitation as confidence grows.

Epsilon-greedy should be used when a simple, interpretable bandit algorithm is sufficient, when the number of arms is small (2-5), and when the exploration rate can be set based on domain knowledge. Common pitfalls include setting epsilon too low (insufficient exploration leading to premature lock-in on a locally optimal variant), not decaying epsilon over time (wasting traffic on unnecessary exploration after the best variant is clear), treating epsilon-greedy results as valid causal estimates (bandit allocation creates biased samples that require correction for inference), and using epsilon-greedy when a more sophisticated algorithm like Thompson Sampling would be substantially more efficient. For experiments where precise causal inference is needed, epsilon-greedy with a fixed allocation ratio (e.g., epsilon = 0.5, which is equivalent to uniform random assignment) may be more appropriate than a heavily exploitative setting.

Advanced epsilon-greedy variants include contextual epsilon-greedy, which conditions the exploitation decision on user features (choosing the best variant for each user type rather than the globally best variant), epsilon-first (pure exploration for a fixed initial period, then pure exploitation), and adaptive epsilon methods that adjust the exploration rate based on the confidence in current estimates. While epsilon-greedy is theoretically suboptimal compared to Thompson Sampling and UCB (its regret grows linearly rather than logarithmically), the difference is often small in practice for the sample sizes and number of arms typical in digital experiments. For teams that value simplicity and interpretability, epsilon-greedy provides most of the benefit of bandit methods with minimal implementation complexity.

Related Terms

Adaptive Experiment

An experiment design that modifies its parameters during execution based on accumulating data, including adjusting traffic allocation between variants, dropping underperforming arms, or modifying the sample size, while maintaining statistical validity through appropriate corrections.

Contextual Bandit Experiment

An adaptive experiment that uses user context (features like demographics, behavior history, and session attributes) to personalize which treatment variant each user receives, learning a policy that maps user characteristics to optimal treatments in real time.

Bayesian Optimization

A sequential decision-making framework that uses a probabilistic model of the objective function to efficiently search for the optimal configuration of parameters, balancing exploration of uncertain regions with exploitation of promising areas.

Multivariate Testing

An experimentation method that simultaneously tests multiple variables and their combinations to determine which combination of changes produces the best outcome, unlike A/B testing which typically varies a single element at a time.

Split Testing

The practice of randomly dividing users into two or more groups and exposing each group to a different version of a product experience to measure which version performs better on a target metric, commonly known as A/B testing.

Holdout Testing

An experimental design where a small percentage of users are permanently excluded from receiving a new feature or set of features, serving as a long-term control group to measure the cumulative impact of product changes over time.