Research note

What are Bandits, and Why Does Thompson Sampling Work?

Mar 16, 2022 · Bandits, UCB, Thompson Sampling, and what later theory revealed

The multi-armed bandit is the cleanest mathematical model of learning under uncertainty. You repeatedly choose among actions with unknown rewards, and every choice has a double role: it earns payoff now, but it also teaches you something about what to do next. That tension between exploration and exploitation is what makes bandits interesting.

This note focuses on one particular line of ideas: the classic bandit problem, the contrast between UCB and Thompson Sampling, the information-theoretic framework of Russo and Van Roy, and what later work showed about where Thompson Sampling is strong, and where it can fail in the frequentist setting.

The bandit problem

The standard mental picture is a row of slot machines. Each arm has an unknown mean reward. At each round you choose one arm, observe a reward, and update your beliefs. If you always knew the best arm, the problem would be trivial. The difficulty is that learning which arm is best requires trying arms that may not be best.

Performance is measured by regret, the reward lost relative to an oracle that always plays the optimal arm. If arm means are $\mu_a$ and the optimal mean is $\mu^*$, then

$$\mathrm{Regret}(T) = \sum_{t=1}^T \bigl(\mu^* - \mu_{A_t}\bigr).$$

Good algorithms have regret growing much slower than $T$, ideally on the order of $\sqrt{T}$ up to logarithmic factors. That means they pay a finite price for learning, then increasingly exploit what they have learned.

UCB: optimism as a principle

A purely greedy strategy fails because early noise can trap you. If one arm gets lucky on its first few pulls, greed may keep choosing it forever. The classical repair is optimism in the face of uncertainty.

In Upper Confidence Bound methods, each arm is scored by an estimate plus an exploration bonus:

$$\text{score}_a(t) = \widehat{\mu}_a(t) + \text{bonus}_a(t).$$

Arms with uncertain estimates get larger bonuses, so the algorithm is systematically drawn toward options that could plausibly be better than they currently look. UCB is attractive because the logic is transparent and the guarantees are frequentist: they hold uniformly over all problem instances in a class, not just on average under a prior.¹

The price is that UCB can feel somewhat rigid. You have to design the right bonus, and in richer models those bonuses can become technically delicate.

Thompson Sampling: exploration by posterior sampling

Thompson Sampling takes a different route. Instead of building an upper confidence bound, it maintains a posterior distribution over unknown parameters and samples one plausible model from that posterior. It then acts greedily in the sampled model.²

This gives a beautifully simple behavior. If the algorithm is nearly sure which arm is best, the posterior sample will almost always agree and it will exploit. If several arms still look plausible, they will each be sampled with noticeable probability and the algorithm will explore. Exploration is not added by hand; it emerges from uncertainty.

This is one reason Thompson Sampling is so appealing in practice. It is easy to describe, easy to implement in many Bayesian models, and often performs strikingly well empirically. But it raises a natural question: why should sampling from a posterior automatically produce good regret?

The Russo–Van Roy framework

The cleanest answer comes from the information-theoretic analysis of Russo and Van Roy.³ Their key object is the information ratio, which compares what the algorithm loses at a round to what it learns from that round.

$$\Psi_t = \frac{(\text{expected instantaneous regret})^2}{\text{information gained about the best arm}}.$$

If this ratio is small, the algorithm is paying regret efficiently: whenever it sacrifices reward, it is buying information that helps future decisions. Russo and Van Roy showed that if $\Psi_t \le \Gamma$ uniformly, then the cumulative Bayesian regret satisfies

$$\mathbb{E}[\mathrm{Regret}(T)] \leq \sqrt{\Gamma \cdot T \cdot H(A^*)},$$

where $H(A^*)$ is the entropy of the optimal arm under the prior. This is an elegant formula. Regret depends on three things: the time horizon, how uncertain you initially were about the optimum, and how efficiently the algorithm converts regret into information.

The remarkable part is that Thompson Sampling naturally has a small information ratio in many settings. Intuitively, it chooses an action because that action looks optimal in a plausible world. If the action turns out to be suboptimal, that can only be because there was real uncertainty about what was best, and observing the reward helps resolve exactly that uncertainty. Regret and information gain are therefore tightly linked.

This framework unified a wide range of problems: ordinary multi-armed bandits, full-information problems, linear bandits, and semi-bandit settings. It also highlighted something conceptually important: not all uncertainty matters equally. What matters is not the entropy of the whole parameter, but the uncertainty about the optimal action.

What the framework does not give you

The Russo–Van Roy theory is Bayesian. Its guarantees are expectations under a prior, and its complexity term is the entropy of the optimal action under that prior. That makes the theory insightful, but it is not the same as a worst-case frequentist guarantee.

In many applications, that distinction matters. A Bayesian bound may say that an algorithm is excellent on average under a prior, while a frequentist analysis asks a harsher question: if I fix an arbitrary problem instance in this class, how bad can the regret be? For Thompson Sampling, this turned out to be more subtle than the early Bayesian story suggested.

The issue is not that Thompson Sampling becomes a bad idea in general. The issue is that posterior sampling alone does not automatically guarantee the kind of optimism or coverage that frequentist analyses often need.

Frequentist limitations of Thompson Sampling

A sharp warning comes from work by Raymond Zhang and Richard Combes on combinatorial semi-bandits.⁴ They show that standard Thompson Sampling can be suboptimal in high dimensions: its regret can scale exponentially in the ambient dimension on some instances, and its minimax regret can be almost linear in $T$.

Conceptually, this is striking. Thompson Sampling is often praised for being automatically exploratory, but these examples show that its exploration can still be too weak in the wrong geometry. In their constructions, the algorithm can become overly committed to an apparently good region and take exponentially long to discover the truly optimal decision. Even adding a fixed amount of forced exploration at the start does not fix the problem.

This does not contradict the Bayesian analysis. It clarifies its scope. The information-ratio framework explains why Thompson Sampling can be efficient under a prior; it does not imply that posterior sampling is always minimax-optimal in worst-case frequentist terms.

Frequentist limitation Zhang, R. and Combes, R. “On the Suboptimality of Thompson Sampling in High Dimensions.” The paper shows that in some combinatorial semi-bandit problems, Thompson Sampling can have exponential dependence on dimension and that fixed forced exploration does not remove the issue.

Feel-Good Thompson Sampling

A different response comes from Tong Zhang’s work on Feel-Good Thompson Sampling for contextual bandits and reinforcement learning.⁵ The starting point is closely related: in the frequentist setting, standard Thompson Sampling may simply not be aggressive enough in exploring actions that look optimistic but are not yet well supported by the data.

The proposed fix is to bias the sampling procedure toward models that look historically optimistic. In other words, instead of sampling only according to posterior plausibility, the algorithm adds a “feel-good” term that favors models assigning high reward to actions. This makes the method more explicitly optimistic, while still retaining the posterior-sampling flavor of Thompson Sampling.

The interesting point is not only the algorithmic tweak, but the interpretation. Feel-Good Thompson Sampling suggests that the gap between Bayesian elegance and frequentist guarantees can be bridged by injecting a controlled form of optimism into posterior sampling. In Tong Zhang’s analysis, this leads to minimax-optimal frequentist regret bounds in finite-action contextual bandits, and extends to linearly embeddable contextual bandits as well.

Frequentist repair Tong Zhang. “Feel-Good Thompson Sampling for Contextual Bandits and Reinforcement Learning.” The paper argues that standard Thompson Sampling is not sufficiently exploratory in some pessimistic frequentist settings, and shows how an optimism-leaning modification recovers strong regret guarantees.

What to take away

UCB and Thompson Sampling solve the same exploration problem in two different ways. UCB adds optimism explicitly through bonuses. Thompson Sampling samples plausible worlds and lets uncertainty create exploration implicitly. The Russo–Van Roy framework explains beautifully why the second idea works so well in Bayesian terms: regret is acceptable when it buys information about the optimum.

But later work showed that this is not the end of the story. In the frequentist setting, Thompson Sampling can under-explore in structured high-dimensional problems. That is the core lesson of Raymond Zhang and Combes. Tong Zhang’s Feel-Good Thompson Sampling pushes in the opposite direction: keep the posterior-sampling intuition, but add enough optimism to recover stronger worst-case guarantees.

That, to me, is what makes the theory of bandits so appealing. Even an algorithm as simple and successful as Thompson Sampling still teaches us something subtle: uncertainty alone is often enough, but not always. The difference between “usually works” and “provably robust” is where much of the theory lives.

For the classical UCB perspective, see Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). “Finite-time Analysis of the Multiarmed Bandit Problem.” Machine Learning, 47, 235–256. ↑
Thompson, W. R. (1933). “On the likelihood that one unknown probability exceeds another in view of the evidence of two samples.” Biometrika, 25(3–4), 285–294. A modern empirical revival is Chapelle, O. and Li, L. (2011). “An Empirical Evaluation of Thompson Sampling.” NeurIPS. ↑
Russo, D. and Van Roy, B. (2014/2016). “An Information-Theoretic Analysis of Thompson Sampling.” First circulated in 2014 and published in JMLR in 2016. ↑
Zhang, R. and Combes, R. (2021). “On the Suboptimality of Thompson Sampling in High Dimensions.” ↑
Zhang, T. (2021). “Feel-Good Thompson Sampling for Contextual Bandits and Reinforcement Learning.” ↑