Finite-Time Instance Dependent Optimality for Stochastic Online Learning with Feedback Graphs
Posed by Teodor Vanislavov Marinov et al. (2022)
§ Problem Statement
Setup
Consider stochastic online learning with a known directed feedback graph on actions, where every node has a self-loop . At each round , the environment draws a loss vector i.i.d. from an unknown distribution . The learner selects an action , suffers loss , and observes the losses where is the out-neighborhood.
Let and assume there is a unique optimal action . Define gaps , and for an algorithm (policy) define expected regret
where . Define the instance-wise finite-horizon optimum
It is known that (under standard stochastic assumptions) there are instance-dependent algorithms whose regret matches the leading term in known information-theoretic lower bounds for feedback graphs as .
Unsolved Problem
-
Finite-time characterization: Characterize (or, if one restricts to a natural class of "globally reasonable" algorithms, the corresponding restricted infimum) for finite in terms of (i) the instance (e.g., gaps and distributional distinguishability quantities such as KL divergences between arm-loss distributions) and (ii) the graph structure .
-
When does the asymptotic rate control finite time?: Characterize the class of graphs for which, for every instance and for horizons in a non-asymptotic ("moderate") regime, the finite-time optimum is already controlled (up to lower-order terms) by the asymptotically optimal instance-dependent benchmark from existing lower bounds; and characterize graphs for which some instances and horizons necessarily exhibit a finite-time separation from that asymptotic benchmark.
§ Discussion
§ Significance & Implications
Feedback graphs formalize structured side-observations, interpolating between bandit feedback (only the played action observed) and full information (all actions observed). For feedback graphs, asymptotic instance-dependent theory identifies the leading growth rate dictated by information-theoretic lower bounds, but the COLT 2022 open-problem note highlights that this asymptotic benchmark does not uniquely pin down what "optimal" means at a fixed horizon and can be decoupled from finite-time optimality. A finite-time, instance-and-graph-specific characterization would clarify the correct performance target at horizon , determine when asymptotically optimal algorithms are already near-optimal at realistic sample sizes, and isolate graph structures that intrinsically force additional exploration cost beyond the asymptotic leading term.
§ Known Partial Results
Marinov et al. (2022): In the classical stochastic multi-armed bandit setting (no side-observations beyond the played arm), both asymptotic and non-asymptotic instance-dependent regret bounds are available; for Gaussian rewards, bounds matching the instance-dependent lower bounds are known up to lower-order terms (as cited in the COLT 2022 open-problem abstract).
Marinov et al. (2022): For stochastic online learning with feedback graphs, instance-dependent algorithms are known that are asymptotically optimal in the sense of matching the leading term in known information-theoretic lower bounds for this model (as stated in the COLT 2022 open-problem abstract).
Marinov et al. (2022): The COLT 2022 open-problem note reports that, unlike in standard bandits, finite-time instance-dependent optimality is not uniquely determined by the asymptotic theory for feedback graphs and can be decoupled from the asymptotic rate, motivating a dedicated finite-time characterization.
Marinov et al. (2022): The COLT 2022 open-problem note poses two concrete directions: (i) characterize the finite-horizon instance-dependent optimal regret in this model, and (ii) determine which graphs do (and do not) admit finite-time regret controlled by the asymptotically optimal instance-dependent benchmark for reasonable horizons.
§ References
Teodor Vanislavov Marinov, Mehryar Mohri, Julian Zimmert (2022)
Conference on Learning Theory (COLT), PMLR 178
📍 Open-problem note in COLT proceedings.
Teodor Vanislavov Marinov, Mehryar Mohri, Julian Zimmert (2022)
Conference on Learning Theory (COLT), PMLR 178
📍 Proceedings PDF.