Research note

The Logistic Bandit Problem

Aug 8, 2025 · Why a decade-old conjecture was hard, and how we approached it

This post gives the intuition behind my paper “An Information-Theoretic Analysis of Thompson Sampling for Logistic Bandit Problems”, published in TMLR (02/2026). The paper resolves an open question that had been explicitly stated as a conjecture by Dong & Van Roy in 2018, and which traces back to a challenge posed by McMahan & Streeter in 2012: can we prove that Thompson Sampling behaves well on logistic bandits without paying an exponential price in $\beta$?

The setup

In the logistic bandit, at each round you pick an action a from some set and receive a binary reward with probability given by a logistic function of the inner product ⟨a, θ⟩:

$$P(\text{reward} = 1) = \phi_\beta(\langle a, \theta \rangle) = \frac{e^{\beta \langle a,\, \theta \rangle}}{1 + e^{\beta \langle a,\, \theta \rangle}}$$

Following the notation in the paper, $\theta \in \mathcal{O}$ is the unknown parameter, $\beta > 0$ is the known logistic slope, and both the action space $\mathcal{A}$ and parameter space $\mathcal{O}$ lie in the $d$-dimensional unit ball. A geometric quantity that appears throughout the analysis is the minimax alignment constant

$$\alpha := \min_{\theta \in \mathcal{O}} \max_{a \in \mathcal{A}} \langle a, \theta \rangle,$$

which measures how well the action space aligns, in the worst case, with the parameter space. The goal is to learn which action is best while minimising cumulative regret.

Logistic bandits are a natural model whenever the feedback is binary but depends on features. In a recommender system, an action can represent an item or a ranking policy, and the reward is whether the user clicks or not. In click-through-rate prediction and personalised advertising, the learner chooses which content to show and observes a yes/no response. Similar models also appear in spam email detection, where one wants to classify messages using contextual features, and in medical decision problems with binary outcomes such as success or failure of a treatment.

In all of these examples, the logistic link is a standard way of turning a score $\langle a, \theta \rangle$ into a probability. So logistic bandits are not an artificial variant of the linear bandit problem: they are one of the canonical models for sequential decision-making with binary rewards. That is exactly why it was so striking that the theory remained stuck for so long.

The slope problem

Every prior bound for this problem, whether using UCB or Thompson Sampling, scaled exponentially with the slope $\beta$. The best frequentist result of Faury et al. (2020) already had the right leading-order term $d\sqrt{T}\log T$, but it still carried a lower-order term growing like $e^\beta$. All previous Bayesian bounds were worse, with the exponential dependence appearing in the dominant term.

This is frustrating because empirically, larger $\beta$ tends to make the problem easier. As $\beta$ grows, the logistic curve becomes steeper around zero, so small differences in ⟨a, θ⟩ produce clearer reward differences. But globally the same curve also develops exponentially flat regions near the tails, and that is where the bad theory was hiding.

Figure 1. The logistic function $\phi_\beta$ for three values of $\beta$. As $\beta$ increases, the curve gets steeper around zero, where $\phi_\beta'(0)=\beta/4$, but it also becomes exponentially flat near the tails. On the interval $[-1,1]$, the smallest derivative is attained at $\pm 1$ and equals $\beta e^\beta/(1+e^\beta)^2 \asymp \beta e^{-\beta}$. In Faury et al., the resulting conditioning parameter $\kappa$, the ratio between the largest and smallest derivative, therefore grows exponentially in $\beta$.

Dong & Van Roy (2018) made this precise by conjecturing that the Thompson Sampling information ratio for logistic bandits should be bounded by a constant times d, independent of $\beta$ and of the number of actions. Their own analysis could not prove it, and direct adaptation of linear-bandit arguments produced bounds that explode exponentially in $\beta$. The conjecture sat open for years.

Where the exponential comes from

To understand the difficulty, recall how the information ratio is defined: it is the squared expected regret divided by the mutual information gained about the optimal action. Bounding it requires upper-bounding the squared regret and lower-bounding the information gain.

The information gain comes from the reward signal, which is a Bernoulli random variable. To lower-bound how much a Bernoulli outcome tells you about the parameter, the standard approach links it to the variance of the reward probability: if the probability is uncertain (high variance), observing the outcome is informative.

The key issue is that a direct translation of linear-bandit proofs compares reward gaps to inner-product gaps using global derivative bounds on $\phi_\beta$. In the numerator, one upper-bounds differences with the largest derivative $$L_2 = \sup_{x \in [-1,1]} |\phi_\beta'(x)| = \beta/4,$$ while in the denominator one lower-bounds them with the smallest derivative on the same interval, $$L_1 = \inf_{x \in [-1,1]} |\phi_\beta'(x)| = \frac{\beta e^\beta}{(1+e^\beta)^2} \asymp \beta e^{-\beta}.$$ The bad conditioning is therefore the ratio $L_2/L_1$, denoted $\kappa$ in Faury et al., and this ratio grows exponentially in $\beta$ precisely because the logistic curve has exponentially flat regions when $\beta$ is large.

Figure 2. The key difference between previous analyses and this paper. Derivative-based arguments compare the largest and smallest slopes of $\phi_\beta$, which introduces the conditioning parameter $\kappa$ used by Faury et al. and makes the bound exponential in $\beta$. The variance-based argument in the paper avoids this comparison entirely and yields an information-ratio bound $\Gamma \le d(4/\alpha)^2$, uniform over all $\beta > 0$.

The two key ideas

The proof of the main result, an information ratio bounded by $d(4/\alpha)^2$, rests on two innovations.

Variance instead of expectation. Rather than upper-bounding the squared expected regret by the square of the expected regret, the analysis bounds it through the variance of the regret. The information gain is then lower-bounded by twice the variance of the reward probability. This puts both sides of the information ratio into a variance form and avoids the disastrous comparison between the largest and smallest derivatives of $\phi_\beta$.

Taking $\beta \to \infty$ as a uniform bound. Even after switching to the variance formulation, the relevant ratio of variances still depends on $\beta$ a priori. The second innovation is to show that this ratio is monotone increasing in $\beta$, so its supremum over all $\beta \geq 0$ is achieved in the limit $\beta \to \infty$, where the logistic function becomes a step function. In that limit, the ratio is bounded by the pure constant $4$, which is what makes the information-ratio bound uniform.

Figure 3. The ratio of expected variances, a key intermediate quantity in the proof, is monotonically increasing in β and converges to a ceiling of 4. The β → ∞ limit therefore serves as a uniform upper bound for all β ≥ 0, and that ceiling is a pure geometric constant. This is the step that eliminates all exponential dependence on β.

The result

Combining the variance bound on the information ratio with a quantisation argument to handle continuous parameter spaces, the main theorem gives:

$$\mathbb{E}[\mathrm{Regret}(T)] \leq \frac{4d}{\alpha} \sqrt{T \cdot \log\!\left(3 + 6\sqrt{\frac{\beta T}{2d}}\right)}$$

The dependence on $\beta$ is only logarithmic, matching the intuition that larger $\beta$ makes the problem easier. The bound is independent of the cardinality of the action space, so it applies to continuous action sets. And it resolves the conjecture of Dong & Van Roy: the information ratio is indeed $O(d)$, with no fragility dimension needed.

The constant $\alpha = \min_{\theta \in \mathcal{O}} \max_{a \in \mathcal{A}} \langle a, \theta \rangle$ is the minimax alignment constant from the paper. It measures the worst-case alignment between the action and parameter spaces. When the action space contains the parameter space so that $\alpha = 1$, the bound simplifies to $O(d\sqrt{T\log(\beta T/d)})$, matching the linear-bandit minimax rate up to logarithmic factors.

Algorithm	Leading term	$\beta$ dependence
Thompson Sampling (Russo & Van Roy 2014)	$e^\beta \cdot d \cdot \sqrt{T} \log(T)^3$	Exponential
Logistic-UCB-2 (Faury et al. 2020)	$d\sqrt{T} \log T + e^\beta \cdot d^2 \log(T)^2$	Exponential (lower order)
Thompson Sampling (Neu et al. 2022)	$d^{1/2} \cdot \sqrt{T} \cdot \|\mathcal{A}\| \cdot \log(\beta T)^{1/2}$	Logarithmic, scales with $\|\mathcal{A}\|$
This paper	$\mathbf{(d/\alpha)\,\sqrt{T\log(\beta T/d)}}$	Logarithmic, $\|\mathcal{A}\|$-free

What this means in practice

The numerical experiments in the paper confirm the theory: as $\beta$ grows, the empirical regret of Thompson Sampling decreases, while all prior bounds increase exponentially and quickly become vacuous. The new bound tracks the empirical regret throughout.

More broadly, this result helps explain why practitioners have long found Thompson Sampling to work reliably across logistic bandit settings, including large or continuous action spaces, even in regimes where theory offered no guarantees. The theory was just behind the practice.

Paper: Amaury Gouverneur, Tobias J. Oechtering, Mikael Skoglund. “An Information-Theoretic Analysis of Thompson Sampling for Logistic Bandit Problems.” Transactions on Machine Learning Research, 02/2026. OpenReview · arXiv:2412.02861

What’s next

The analysis technique, using variance rather than expectation to control the information ratio, is not specific to the logistic link function. The key properties exploited (smoothness of the link function, the Bregman divergence structure of exponential family KL divergences) are shared by the full class of generalised linear bandits. Extending the results there is a natural next step, as is connecting to the frequentist setting via the Optimistic Information Directed Sampling framework of Neu, Papini, and Schwartz (2024).

The conjecture is stated as Conjecture 1 in: Dong, S. and Van Roy, B. “An information-theoretic analysis for Thompson sampling with many actions.” NeurIPS 2018.
The open challenge appears in: McMahan, H.B. and Streeter, M. “Open problem: Better bounds for online logistic regression.” COLT 2012.