Model Selection for Contextual Bandits

§ Problem Statement

Setup

Contextual bandit model selection. Fix a context space $\mathcal{X}$ , a finite action set $\mathcal{A}=\{1,\dots,K\}$ , and horizon $T$ . On each round $t=1,\dots,T$ , an environment (possibly adaptive but non-anticipating) selects a context $x_t\in\mathcal{X}$ and a loss vector $\ell_t\in[0,1]^K$ ; the learner observes $x_t$ , chooses an action $a_t\in\mathcal{A}$ (possibly randomized based on the history), and observes only the scalar loss $\ell_t(a_t)$ . A deterministic policy is a map $\pi:\mathcal{X}\to\mathcal{A}$ . For a policy class $\Pi\subseteq(\mathcal{X}\to\mathcal{A})$ , define the (expected) regret to $\Pi$ by

\mathrm{Reg}(\Pi):=\mathbb{E}\Big[\sum_{t=1}^T \ell_t(a_t)\Big]-\min_{\pi\in\Pi}\sum_{t=1}^T \ell_t(\pi(x_t)),

where the expectation is over the learner's internal randomization (and any randomness in the environment).

Model-selection input: a nested sequence of finite policy classes $\Pi_1\subset\Pi_2\subset\cdots\subset\Pi_M$ .

Unsolved Problem

Problem 1. Does there exist a single contextual bandit algorithm such that for every $T,K,M$ and every such nested sequence, for all $m\in\{1,\dots,M\}$ simultaneously,

\mathrm{Reg}(\Pi_m)\le C\,\sqrt{K T\,(\log|\Pi_m|+\log m)}

for a universal constant $C$ , up to standard polylogarithmic factors?

Open Problem 1b (adversarial, weaker targets): If the bound in Problem 1a is unattainable, is there an algorithm guaranteeing for all $m$ either (1) $\mathrm{Reg}(\Pi_m)\le \mathrm{poly}(K,M,\log\log|\Pi_M|)\cdot\sqrt{T\log|\Pi_m|}$ , or (2) for some fixed $\alpha\in[1/2,1)$ , $\mathrm{Reg}(\Pi_m)\le \mathrm{poly}(K,M,\log\log|\Pi_M|)\cdot T^{\alpha}(\log|\Pi_m|)^{1-\alpha}$ ; and if (2) is impossible, prove a lower bound ruling it out for every $\alpha\in[1/2,1)$ .

Open Problem 2 (stochastic, realizable): Suppose instead $(x_t,\ell_t)_{t=1}^T$ are i.i.d. from an unknown distribution $\mathcal{D}$ . Let $\mathcal{F}_1\subset\cdots\subset\mathcal{F}_M$ be a nested sequence of finite regression classes with $f:\mathcal{X}\times\mathcal{A}\to[0,1]$ . Each $f$ induces a policy $\pi_f(x)\in\arg\min_{a\in\mathcal{A}} f(x,a)$ (ties broken by the smallest index). Assume realizability: there exist $m^*\in\{1,\dots,M\}$ and $f^*\in\mathcal{F}_{m^*}$ such that $\mathbb{E}[\ell(a)\mid x]=f^*(x,a)$ for all $x\in\mathcal{X}$ and $a\in\mathcal{A}$ . Does there exist an algorithm and some $\alpha\in[1/2,1)$ such that, for every nested sequence, whenever realizability holds,

\mathbb{E}\Big[\sum_{t=1}^T \ell_t(a_t)-\sum_{t=1}^T \ell_t(\pi_{f^*}(x_t))\Big] \le \mathrm{poly}(K,M,\log\log|\mathcal{F}_M|)\cdot T^{\alpha}(\log|\mathcal{F}_{m^*}|)^{1-\alpha},

or else prove that no algorithm can achieve such a guarantee for any $\alpha\in[1/2,1)$ ?

§ Discussion

Loading discussion…

§ Significance & Implications

The question is whether bandit feedback permits structural adaptivity over nested model/policy classes with only a small overhead (e.g., an added $\log m$ term) compared to the best-in-class regret rate. In full-information online learning, simultaneous regret guarantees over nested classes are achievable via exponential-weights-style methods, but contextual bandits must cope with importance weighting and high-variance loss estimates. Resolving these problems would clarify whether there is an inherent barrier to sharp model selection under partial feedback, or whether a single algorithm can be both minimax-optimal for large classes and automatically faster when the best comparator lies in a smaller class.

§ Known Partial Results

Foster et al. (2020): For a single finite policy class $\Pi$ , standard adversarial contextual-bandit algorithms (e.g., Exp4) achieve regret on the order of $\sqrt{K T\log|\Pi|}$ .
Foster et al. (2020): In full-information online learning (observing the entire loss vector each round), there are algorithms that simultaneously guarantee $\mathrm{Reg}(\Pi_m)=O(\sqrt{T(\log|\Pi_m|+\log m)})$ for all $m$ , showing that the desired model-selection overhead is possible without bandit feedback.
Foster et al. (2020): The COLT open-problem note argues that straightforward bandit adaptations of these ideas (e.g., using non-uniform priors over nested classes, Corral-style aggregation, or naive parameter tuning) do not directly yield the target tradeoff, and tend to produce weaker interpolation-type rates of the form $\tilde O\big(T^{\alpha}(\log|\Pi_m|)^{\beta}\big)$ with $\alpha+\beta>1$ .
Foster et al. (2020): In the stochastic realizable setting, the note reports progress in specialized structured cases (e.g., nested linear classes under additional assumptions) using testing/estimation ideas based on approximating the optimal expected loss $L^*$ at sublinear rates.
Foster et al. (2020): The note also highlights reductions/consequences: a positive solution to Open Problem 1a would imply near-optimal switching-regret bounds for bandits by choosing $\Pi_m$ to be the class of action sequences with at most $m$ switches; conversely, strong lower bounds for model selection in contextual bandits would inform limits of certain importance-weighting reductions used to transfer full-information guarantees (e.g., second-order/quantile-style bounds) to bandit settings.

§ References

[1]

Open Problem: Model Selection for Contextual Bandits

Dylan J. Foster, Akshay Krishnamurthy, Haipeng Luo (2020)

Conference on Learning Theory (COLT), PMLR 125

📍 Open-problem note in COLT proceedings.

Link ↗

[2]

Open Problem: Model Selection for Contextual Bandits (PDF)

Dylan J. Foster, Akshay Krishnamurthy, Haipeng Luo (2020)

Conference on Learning Theory (COLT), PMLR 125

📍 Proceedings PDF.

Link ↗

[3]

The nonstochastic multiarmed bandit problem

Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, Robert E. Schapire (2002)

Machine Learning

📍 Introduces adversarial bandits with expert advice; Exp4-style guarantees underpin the baseline $\sqrt{K T\log|\Pi|}$ rate for finite policy classes.

Link ↗DOI ↗

[4]

Corralling a Band of Bandit Algorithms

Alekh Agarwal, Haipeng Luo, Behnam Neyshabur, Robert E. Schapire (2017)

Conference on Learning Theory (COLT), PMLR

📍 Representative bandit model-aggregation method referenced in discussions of why naive aggregation may not yield sharp nested-class model selection in contextual bandits.

Link ↗

§ Tags

colt-open-problem learning-theory bandits contextual-bandits