Unsolved

Model Selection for Contextual Bandits

Posed by Dylan J. Foster et al. (2020)

§ Problem Statement

Setup

Contextual bandit model selection. Fix a context space X\mathcal{X}, a finite action set A={1,,K}\mathcal{A}=\{1,\dots,K\}, and horizon TT. On each round t=1,,Tt=1,\dots,T, an environment (possibly adaptive but non-anticipating) selects a context xtXx_t\in\mathcal{X} and a loss vector t[0,1]K\ell_t\in[0,1]^K; the learner observes xtx_t, chooses an action atAa_t\in\mathcal{A} (possibly randomized based on the history), and observes only the scalar loss t(at)\ell_t(a_t). A deterministic policy is a map π:XA\pi:\mathcal{X}\to\mathcal{A}. For a policy class Π(XA)\Pi\subseteq(\mathcal{X}\to\mathcal{A}), define the (expected) regret to Π\Pi by

Reg(Π):=E[t=1Tt(at)]minπΠt=1Tt(π(xt)),\mathrm{Reg}(\Pi):=\mathbb{E}\Big[\sum_{t=1}^T \ell_t(a_t)\Big]-\min_{\pi\in\Pi}\sum_{t=1}^T \ell_t(\pi(x_t)),

where the expectation is over the learner's internal randomization (and any randomness in the environment).

Model-selection input: a nested sequence of finite policy classes Π1Π2ΠM\Pi_1\subset\Pi_2\subset\cdots\subset\Pi_M.

Unsolved Problem

Problem 1. Does there exist a single contextual bandit algorithm such that for every T,K,MT,K,M and every such nested sequence, for all m{1,,M}m\in\{1,\dots,M\} simultaneously,

Reg(Πm)CKT(logΠm+logm)\mathrm{Reg}(\Pi_m)\le C\,\sqrt{K T\,(\log|\Pi_m|+\log m)}

for a universal constant CC, up to standard polylogarithmic factors?

Open Problem 1b (adversarial, weaker targets): If the bound in Problem 1a is unattainable, is there an algorithm guaranteeing for all mm either (1) Reg(Πm)poly(K,M,loglogΠM)TlogΠm\mathrm{Reg}(\Pi_m)\le \mathrm{poly}(K,M,\log\log|\Pi_M|)\cdot\sqrt{T\log|\Pi_m|}, or (2) for some fixed α[1/2,1)\alpha\in[1/2,1), Reg(Πm)poly(K,M,loglogΠM)Tα(logΠm)1α\mathrm{Reg}(\Pi_m)\le \mathrm{poly}(K,M,\log\log|\Pi_M|)\cdot T^{\alpha}(\log|\Pi_m|)^{1-\alpha}; and if (2) is impossible, prove a lower bound ruling it out for every α[1/2,1)\alpha\in[1/2,1).

Open Problem 2 (stochastic, realizable): Suppose instead (xt,t)t=1T(x_t,\ell_t)_{t=1}^T are i.i.d. from an unknown distribution D\mathcal{D}. Let F1FM\mathcal{F}_1\subset\cdots\subset\mathcal{F}_M be a nested sequence of finite regression classes with f:X×A[0,1]f:\mathcal{X}\times\mathcal{A}\to[0,1]. Each ff induces a policy πf(x)argminaAf(x,a)\pi_f(x)\in\arg\min_{a\in\mathcal{A}} f(x,a) (ties broken by the smallest index). Assume realizability: there exist m{1,,M}m^*\in\{1,\dots,M\} and fFmf^*\in\mathcal{F}_{m^*} such that E[(a)x]=f(x,a)\mathbb{E}[\ell(a)\mid x]=f^*(x,a) for all xXx\in\mathcal{X} and aAa\in\mathcal{A}. Does there exist an algorithm and some α[1/2,1)\alpha\in[1/2,1) such that, for every nested sequence, whenever realizability holds,

E[t=1Tt(at)t=1Tt(πf(xt))]poly(K,M,loglogFM)Tα(logFm)1α,\mathbb{E}\Big[\sum_{t=1}^T \ell_t(a_t)-\sum_{t=1}^T \ell_t(\pi_{f^*}(x_t))\Big] \le \mathrm{poly}(K,M,\log\log|\mathcal{F}_M|)\cdot T^{\alpha}(\log|\mathcal{F}_{m^*}|)^{1-\alpha},

or else prove that no algorithm can achieve such a guarantee for any α[1/2,1)\alpha\in[1/2,1)?

§ Discussion

Loading discussion…

§ Significance & Implications

The question is whether bandit feedback permits structural adaptivity over nested model/policy classes with only a small overhead (e.g., an added logm\log m term) compared to the best-in-class regret rate. In full-information online learning, simultaneous regret guarantees over nested classes are achievable via exponential-weights-style methods, but contextual bandits must cope with importance weighting and high-variance loss estimates. Resolving these problems would clarify whether there is an inherent barrier to sharp model selection under partial feedback, or whether a single algorithm can be both minimax-optimal for large classes and automatically faster when the best comparator lies in a smaller class.

§ Known Partial Results

  • Foster et al. (2020): For a single finite policy class Π\Pi, standard adversarial contextual-bandit algorithms (e.g., Exp4) achieve regret on the order of KTlogΠ\sqrt{K T\log|\Pi|}.

  • Foster et al. (2020): In full-information online learning (observing the entire loss vector each round), there are algorithms that simultaneously guarantee Reg(Πm)=O(T(logΠm+logm))\mathrm{Reg}(\Pi_m)=O(\sqrt{T(\log|\Pi_m|+\log m)}) for all mm, showing that the desired model-selection overhead is possible without bandit feedback.

  • Foster et al. (2020): The COLT open-problem note argues that straightforward bandit adaptations of these ideas (e.g., using non-uniform priors over nested classes, Corral-style aggregation, or naive parameter tuning) do not directly yield the target tradeoff, and tend to produce weaker interpolation-type rates of the form O~(Tα(logΠm)β)\tilde O\big(T^{\alpha}(\log|\Pi_m|)^{\beta}\big) with α+β>1\alpha+\beta>1.

  • Foster et al. (2020): In the stochastic realizable setting, the note reports progress in specialized structured cases (e.g., nested linear classes under additional assumptions) using testing/estimation ideas based on approximating the optimal expected loss LL^* at sublinear rates.

  • Foster et al. (2020): The note also highlights reductions/consequences: a positive solution to Open Problem 1a would imply near-optimal switching-regret bounds for bandits by choosing Πm\Pi_m to be the class of action sequences with at most mm switches; conversely, strong lower bounds for model selection in contextual bandits would inform limits of certain importance-weighting reductions used to transfer full-information guarantees (e.g., second-order/quantile-style bounds) to bandit settings.

§ References

[1]

Open Problem: Model Selection for Contextual Bandits

Dylan J. Foster, Akshay Krishnamurthy, Haipeng Luo (2020)

Conference on Learning Theory (COLT), PMLR 125

📍 Open-problem note in COLT proceedings.

[2]

Open Problem: Model Selection for Contextual Bandits (PDF)

Dylan J. Foster, Akshay Krishnamurthy, Haipeng Luo (2020)

Conference on Learning Theory (COLT), PMLR 125

📍 Proceedings PDF.

[3]

The nonstochastic multiarmed bandit problem

Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, Robert E. Schapire (2002)

Machine Learning

📍 Introduces adversarial bandits with expert advice; Exp4-style guarantees underpin the baseline $\sqrt{K T\log|\Pi|}$ rate for finite policy classes.

[4]

Corralling a Band of Bandit Algorithms

Alekh Agarwal, Haipeng Luo, Behnam Neyshabur, Robert E. Schapire (2017)

Conference on Learning Theory (COLT), PMLR

📍 Representative bandit model-aggregation method referenced in discussions of why naive aggregation may not yield sharp nested-class model selection in contextual bandits.

§ Tags