Model Selection for Contextual Bandits
Posed by Dylan J. Foster et al. (2020)
§ Problem Statement
Setup
Contextual bandit model selection. Fix a context space , a finite action set , and horizon . On each round , an environment (possibly adaptive but non-anticipating) selects a context and a loss vector ; the learner observes , chooses an action (possibly randomized based on the history), and observes only the scalar loss . A deterministic policy is a map . For a policy class , define the (expected) regret to by
where the expectation is over the learner's internal randomization (and any randomness in the environment).
Model-selection input: a nested sequence of finite policy classes .
Unsolved Problem
Problem 1. Does there exist a single contextual bandit algorithm such that for every and every such nested sequence, for all simultaneously,
for a universal constant , up to standard polylogarithmic factors?
Open Problem 1b (adversarial, weaker targets): If the bound in Problem 1a is unattainable, is there an algorithm guaranteeing for all either (1) , or (2) for some fixed , ; and if (2) is impossible, prove a lower bound ruling it out for every .
Open Problem 2 (stochastic, realizable): Suppose instead are i.i.d. from an unknown distribution . Let be a nested sequence of finite regression classes with . Each induces a policy (ties broken by the smallest index). Assume realizability: there exist and such that for all and . Does there exist an algorithm and some such that, for every nested sequence, whenever realizability holds,
or else prove that no algorithm can achieve such a guarantee for any ?
§ Discussion
§ Significance & Implications
The question is whether bandit feedback permits structural adaptivity over nested model/policy classes with only a small overhead (e.g., an added term) compared to the best-in-class regret rate. In full-information online learning, simultaneous regret guarantees over nested classes are achievable via exponential-weights-style methods, but contextual bandits must cope with importance weighting and high-variance loss estimates. Resolving these problems would clarify whether there is an inherent barrier to sharp model selection under partial feedback, or whether a single algorithm can be both minimax-optimal for large classes and automatically faster when the best comparator lies in a smaller class.
§ Known Partial Results
Foster et al. (2020): For a single finite policy class , standard adversarial contextual-bandit algorithms (e.g., Exp4) achieve regret on the order of .
Foster et al. (2020): In full-information online learning (observing the entire loss vector each round), there are algorithms that simultaneously guarantee for all , showing that the desired model-selection overhead is possible without bandit feedback.
Foster et al. (2020): The COLT open-problem note argues that straightforward bandit adaptations of these ideas (e.g., using non-uniform priors over nested classes, Corral-style aggregation, or naive parameter tuning) do not directly yield the target tradeoff, and tend to produce weaker interpolation-type rates of the form with .
Foster et al. (2020): In the stochastic realizable setting, the note reports progress in specialized structured cases (e.g., nested linear classes under additional assumptions) using testing/estimation ideas based on approximating the optimal expected loss at sublinear rates.
Foster et al. (2020): The note also highlights reductions/consequences: a positive solution to Open Problem 1a would imply near-optimal switching-regret bounds for bandits by choosing to be the class of action sequences with at most switches; conversely, strong lower bounds for model selection in contextual bandits would inform limits of certain importance-weighting reductions used to transfer full-information guarantees (e.g., second-order/quantile-style bounds) to bandit settings.
§ References
Open Problem: Model Selection for Contextual Bandits
Dylan J. Foster, Akshay Krishnamurthy, Haipeng Luo (2020)
Conference on Learning Theory (COLT), PMLR 125
📍 Open-problem note in COLT proceedings.
Open Problem: Model Selection for Contextual Bandits (PDF)
Dylan J. Foster, Akshay Krishnamurthy, Haipeng Luo (2020)
Conference on Learning Theory (COLT), PMLR 125
📍 Proceedings PDF.
The nonstochastic multiarmed bandit problem
Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, Robert E. Schapire (2002)
Machine Learning
📍 Introduces adversarial bandits with expert advice; Exp4-style guarantees underpin the baseline $\sqrt{K T\log|\Pi|}$ rate for finite policy classes.
Corralling a Band of Bandit Algorithms
Alekh Agarwal, Haipeng Luo, Behnam Neyshabur, Robert E. Schapire (2017)
Conference on Learning Theory (COLT), PMLR
📍 Representative bandit model-aggregation method referenced in discussions of why naive aggregation may not yield sharp nested-class model selection in contextual bandits.