Research note
An Information-Theoretic Lens on Bandits and RL
My thesis is titled An Information-Theoretic Approach to Bandits and Reinforcement Learning. This post explains what that means: what bandits are, why information theory is the right language for studying them, what the key ideas in the framework are, and how my papers extend it to settings where the existing theory broke down.
The explore–exploit dilemma
Many real problems require making decisions repeatedly, under uncertainty, where each decision both produces a payoff and reveals information about the world. A recommendation system chooses which article to show a user, learning from whether they click. A drug trial allocates patients to treatments, updating its beliefs about efficacy as data arrives. A trading system decides which order to place, weighing the benefit of fast execution against learning about market impact.
All of these share a common structure: the agent must balance exploration, taking actions to learn more, against exploitation, taking actions that look best given what is already known. Explore too little and you settle on a suboptimal action without realising better ones exist. Explore too much and you waste time on actions you have already identified as inferior. This is the explore–exploit dilemma, and it is central to every sequential decision problem.
To study this dilemma formally, researchers use mathematical models. The simplest is the multi-armed bandit: an agent repeatedly chooses among a fixed set of actions (arms), each yielding a reward drawn from an unknown distribution. The agent’s goal is to maximise cumulative reward, or equivalently to minimise regret, the total reward lost by not always choosing the best arm. Despite its simplicity, the bandit model captures the essence of learning from evaluative feedback and serves as the theoretical foundation for more complex forms of sequential decision-making.
Reinforcement learning extends this to settings where actions influence future states, not just immediate rewards. Choosing when to change lanes affects not only what happens now but which traffic situations arise later. The explore–exploit dilemma deepens: you must also reason about the long-term value of the information you gather.
Measuring performance: regret
The standard measure of an algorithm’s performance in bandit and RL problems is regret: the difference between the cumulative reward of an oracle who always plays the optimal action and the cumulative reward the algorithm actually achieved. An algorithm that never learns has regret that grows linearly in the number of rounds. A good algorithm has regret that grows as $\sqrt{T}$, meaning it identifies the best action quickly and exploits it, with the cost of exploration amortised over time.
The goal of bandit theory is to design algorithms that achieve sublinear regret, and to prove lower bounds that characterise how small regret can possibly be. Understanding the gap between what is achievable and what is optimal requires a rigorous language for measuring how difficult a problem is, and this is where information theory enters.
Information theory as a tool for learning
Information theory was originally developed by Claude Shannon in the late 1940s to study how much data can be communicated reliably over noisy channels. Its central objects, entropy, mutual information, KL divergence, provide a rigorous language for measuring uncertainty and quantifying how much is learned from an observation.
The connection to learning problems is natural. Before any interaction, the agent is uncertain about which arm is best. The uncertainty is captured by the entropy of the optimal arm under the prior distribution: the more uniformly plausible all arms are, the higher the entropy, the harder the problem. As the agent collects rewards, it gains information, the mutual information between the observed data and the identity of the optimal arm grows, and its uncertainty decreases.
This framing makes precise the informal idea that exploration should be efficient: a good algorithm should not spend more regret than the information it gains justifies. Formalising this trade-off leads directly to the framework I build on in my thesis.
The Russo–Van Roy framework: the information ratio
The key framework for my thesis was introduced by Russo and Van Roy in 2016.1 Their central observation is that one can relate the Bayesian regret of a learning algorithm directly to two quantities: how much uncertainty existed at the start (the prior entropy of the optimal arm), and how efficiently the algorithm converts each round of interaction into information about the optimal arm.
They formalise this through the information ratio $\Psi_t$, defined at each round as:
$$\Psi_t = \frac{(\text{expected instantaneous regret})^2}{\text{information gained about the optimal arm}}$$
The numerator is the cost of the current action, how suboptimal it is on average. The denominator is the benefit, how much the observation reduces uncertainty about which arm is best. If this ratio is bounded by some constant $\Gamma$, then the cumulative Bayesian regret over $T$ rounds satisfies:
$$\mathbb{E}[\mathrm{Regret}(T)] \leq \sqrt{\Gamma \cdot T \cdot H(A^*)}$$
where $H(A^*)$ is the prior entropy of the optimal arm. The bound is remarkably clean: regret grows as $\sqrt{T}$ (unavoidable), proportionally to how uncertain you were at the start (also unavoidable), and proportionally to $\sqrt{\Gamma}$, the inefficiency of the algorithm in converting regret into learning.
An algorithm with a small information ratio is one that never incurs regret without learning something. It is, in a precise sense, making good use of every observation.
Thompson Sampling: an algorithm with a naturally small information ratio
Once we have the information ratio as the key quantity, a natural question is: which algorithms have small information ratios? This is where Thompson Sampling enters the picture.
Thompson Sampling, originally proposed by William Thompson in 1933 and rediscovered empirically in 2011,2 is conceptually simple. At each round, instead of point-estimating which arm is best, the agent maintains a full posterior distribution over the model parameters. It then acts by sampling a plausible world from this posterior and playing the action that would be optimal in that world.
This is neither greedy (always play the current best estimate) nor uniformly exploratory (try all arms equally). It is something more principled: exploration as a direct consequence of uncertainty. The agent explores precisely because it is uncertain, and it stops exploring as its posterior concentrates on the true parameter.
Russo and Van Roy showed that Thompson Sampling has a provably bounded information ratio. The argument is elegant: because Thompson Sampling samples its action from the same distribution as the optimal arm under the posterior, any regret it incurs reflects genuine uncertainty about which arm is best, and that uncertainty is precisely what gets resolved by observing the reward. The two sides of the ratio are fundamentally coupled, and neither can dominate the other by too much.
For a standard multi-armed bandit with $|\mathcal{A}|$ arms, the information ratio of Thompson Sampling is bounded by $2|\mathcal{A}|$, yielding a regret bound of $O(\sqrt{|\mathcal{A}|\, T \log |\mathcal{A}|})$. This recovers known optimal rates and provides a clean proof via the information-ratio machinery.
What the framework leaves open
The Russo–Van Roy framework is powerful and general, but in its original form it leaves important questions unanswered. My thesis addresses several of them.
The regret bound depends on the entropy of the optimal arm. For a finite set of arms this is at most $\log |\mathcal{A}|$, which is fine. But for continuous action spaces, where the action set is a ball in $\mathbb{R}^d$, this entropy is infinite. The original framework simply does not apply. Yet continuous action spaces arise naturally in linear bandits, where actions are feature vectors and rewards are linear functions of an unknown parameter. A different proof technique is needed.
The framework also does not immediately handle contextual bandits, where a context (user features, current state) is observed at each round and the optimal action depends on it. The identity of the optimal arm changes every round, so the information ratio as originally defined is not the right object to analyse.
And when rewards are not sub-Gaussian but follow a nonlinear model, such as the logistic model for binary outcomes, the information ratio can in principle depend badly on the model’s parameters. Previous analyses produced bounds that scaled exponentially with the logistic slope, which is both theoretically unsatisfying and empirically wrong.
My contributions: extending the framework
My papers address each of these limitations, extending the information-theoretic framework to richer and more structured settings.
Linear bandits with continuous action spaces (Papers A & B). When the action set is a compact subset of $\mathbb{R}^d$ and rewards are linear in the parameter, the prior entropy of the optimal arm is infinite. The fix is to work with a quantised approximation of the optimal arm at resolution $\varepsilon$, paying an additive approximation error of $\varepsilon T$ in regret. The information ratio of Thompson Sampling on this compressed problem is bounded by the problem dimension $d$, exploiting the fact that similar actions give similar rewards, so you only need to resolve uncertainty at scale $\varepsilon$. Optimising over $\varepsilon$ recovers the near-optimal rate $O(d\sqrt{T} \log T)$ and extends it to general continuous action spaces for the first time.
Contextual bandits with general rewards (Paper C). When a context is observed before each decision, the right object is the lifted information ratio, which conditions both the regret and the information gain on the observed context. This captures the fact that the optimal arm changes each round: what matters is how much the observation reduces uncertainty about the model parameter $\theta$, not directly about the optimal arm. Paper C extends this analysis to sub-Gaussian and bounded rewards, generalising a result of Neu et al. that covered only binary outcomes, and provides simplified proofs using classical information-theoretic identities.
Logistic bandits: resolving a decade-old conjecture (Paper D). The logistic bandit is the natural model for binary rewards, click or no click, success or failure, with reward probability given by the logistic function of the arm–parameter inner product. All prior regret bounds scaled exponentially with the logistic slope $\sigma$, which is backwards: larger $\sigma$ makes the problem easier, not harder. Dong & Van Roy had conjectured in 2018 that the Thompson Sampling information ratio should depend only on the dimension $d$, independently of $\sigma$. Paper D proves this conjecture. The key idea is to lower-bound the information gain using the variance of the reward probability rather than its expectation, and to show that the resulting ratio of variances is monotone in $\sigma$ and converges to a universal constant in the limit $\sigma \to \infty$. The resulting bound is $O\!\left(\frac{d}{\kappa}\sqrt{T} \log\frac{\kappa T}{d}\right)$, only logarithmic in $\sigma$ and independent of the number of actions.
Beyond bandits: Bayesian reinforcement learning and offline learning (Papers E & F). The final part of the thesis asks whether the information-theoretic ideas developed for bandits extend to full reinforcement learning, where actions influence future states. Paper E establishes a framework for analysing the minimum Bayesian regret in Markov decision processes, using a data-processing inequality to show that any transformation of the agent’s knowledge can only decrease achievable performance, bounding the inherent difficulty of RL problems using information-theoretic divergences. Paper F turns to the offline setting, where a fixed dataset of logged interactions must be used to select a policy. It derives parameter-free PAC-Bayesian bounds that achieve optimal rates, connecting directly to relative-entropy policy search methods like TRPO and PPO.
The broader picture
What makes the information-theoretic approach appealing is that it gives clean, interpretable answers to the question of why some problems are hard. The regret bounds derived through information ratios identify precisely which features of a problem drive the sample complexity: the prior uncertainty (entropy), the structural relationship between actions and rewards (dimension, alignment), and the quality of the information gathered per round. This is not just about proving tight bounds, it is about understanding what makes learning efficient.
The framework also has practical implications. Algorithms like Information Directed Sampling are built by directly minimising the information ratio at each step, selecting actions that promise the greatest reduction in uncertainty per unit of regret. The theory provides both the justification and the objective for such algorithms.
If you want to go deeper, the natural starting points are the Russo–Van Roy paper (JMLR 2016) and the tutorial on Thompson Sampling by Russo, Van Roy, Kazerouni, Osband, and Wen. For the extension to contextual settings, Neu et al. (NeurIPS 2022) is the key reference. And for the information-theoretic analysis of offline learning, the PAC-Bayesian literature starting from Catoni (2007) provides the foundation.
- Russo, D. and Van Roy, B. (2016). “An information-theoretic analysis of Thompson sampling.” Journal of Machine Learning Research, 17(68), 2442–2471. ↑
- Thompson, W.R. (1933). “On the likelihood that one unknown probability exceeds another in view of the evidence of two samples.” Biometrika, 25(3–4), 285–294. The rediscovery paper is Chapelle, O. and Li, L. (2011). “An Empirical Evaluation of Thompson Sampling.” NeurIPS. ↑