Unsolved

Finite-Time Instance Dependent Optimality for Stochastic Online Learning with Feedback Graphs

Posed by Teodor Vanislavov Marinov et al. (2022)

§ Problem Statement

Setup

Consider stochastic online learning with a known directed feedback graph G=([K],E)G=([K],E) on K2K\ge 2 actions, where every node has a self-loop (i,i)E(i,i)\in E. At each round t=1,,Tt=1,\dots,T, the environment draws a loss vector Xt=(Xt,1,,Xt,K)[0,1]KX_t=(X_{t,1},\dots,X_{t,K})\in[0,1]^K i.i.d. from an unknown distribution ν\nu. The learner selects an action It[K]I_t\in[K], suffers loss Xt,ItX_{t,I_t}, and observes the losses {Xt,j:jO(It)}\{X_{t,j}: j\in\mathcal O(I_t)\} where O(i)={j[K]:(i,j)E}\mathcal O(i)=\{j\in[K]:(i,j)\in E\} is the out-neighborhood.

Let μi=Eν[Xt,i]\mu_i=\mathbb E_{\nu}[X_{t,i}] and assume there is a unique optimal action iargminiμii^*\in\arg\min_i \mu_i. Define gaps Δi=μiμi0\Delta_i=\mu_i-\mu_{i^*}\ge 0, and for an algorithm (policy) π\pi define expected regret

RT(π;ν,G)=E[t=1T(Xt,ItXt,i)]=iiΔiE[Ni(T)],R_T(\pi;\nu,G)=\mathbb E\Big[\sum_{t=1}^T (X_{t,I_t}-X_{t,i^*})\Big]=\sum_{i\ne i^*} \Delta_i\,\mathbb E[N_i(T)],

where Ni(T)=t=1T1{It=i}N_i(T)=\sum_{t=1}^T \mathbf 1\{I_t=i\}. Define the instance-wise finite-horizon optimum

RT(ν,G)=infπRT(π;ν,G).R_T^*(\nu,G)=\inf_{\pi} R_T(\pi;\nu,G).

It is known that (under standard stochastic assumptions) there are instance-dependent algorithms whose regret matches the leading logT\log T term in known information-theoretic lower bounds for feedback graphs as TT\to\infty.

Unsolved Problem

  1. Finite-time characterization: Characterize RT(ν,G)R_T^*(\nu,G) (or, if one restricts to a natural class of "globally reasonable" algorithms, the corresponding restricted infimum) for finite TT in terms of (i) the instance ν\nu (e.g., gaps {Δi}\{\Delta_i\} and distributional distinguishability quantities such as KL divergences between arm-loss distributions) and (ii) the graph structure GG.

  2. When does the asymptotic rate control finite time?: Characterize the class of graphs GG for which, for every instance ν\nu and for horizons TT in a non-asymptotic ("moderate") regime, the finite-time optimum is already controlled (up to lower-order terms) by the asymptotically optimal instance-dependent logT\log T benchmark from existing lower bounds; and characterize graphs for which some instances and horizons necessarily exhibit a finite-time separation from that asymptotic logT\log T benchmark.

§ Discussion

Loading discussion…

§ Significance & Implications

Feedback graphs formalize structured side-observations, interpolating between bandit feedback (only the played action observed) and full information (all actions observed). For feedback graphs, asymptotic instance-dependent theory identifies the leading logT\log T growth rate dictated by information-theoretic lower bounds, but the COLT 2022 open-problem note highlights that this asymptotic benchmark does not uniquely pin down what "optimal" means at a fixed horizon and can be decoupled from finite-time optimality. A finite-time, instance-and-graph-specific characterization would clarify the correct performance target at horizon TT, determine when asymptotically optimal algorithms are already near-optimal at realistic sample sizes, and isolate graph structures that intrinsically force additional exploration cost beyond the asymptotic leading term.

§ Known Partial Results

  • Marinov et al. (2022): In the classical stochastic multi-armed bandit setting (no side-observations beyond the played arm), both asymptotic and non-asymptotic instance-dependent regret bounds are available; for Gaussian rewards, bounds matching the instance-dependent lower bounds are known up to lower-order terms (as cited in the COLT 2022 open-problem abstract).

  • Marinov et al. (2022): For stochastic online learning with feedback graphs, instance-dependent algorithms are known that are asymptotically optimal in the sense of matching the leading logT\log T term in known information-theoretic lower bounds for this model (as stated in the COLT 2022 open-problem abstract).

  • Marinov et al. (2022): The COLT 2022 open-problem note reports that, unlike in standard bandits, finite-time instance-dependent optimality is not uniquely determined by the asymptotic theory for feedback graphs and can be decoupled from the asymptotic rate, motivating a dedicated finite-time characterization.

  • Marinov et al. (2022): The COLT 2022 open-problem note poses two concrete directions: (i) characterize the finite-horizon instance-dependent optimal regret in this model, and (ii) determine which graphs do (and do not) admit finite-time regret controlled by the asymptotically optimal instance-dependent logT\log T benchmark for reasonable horizons.

§ References

[1]

Open Problem: Finite-Time Instance Dependent Optimality for Stochastic Online Learning with Feedback Graphs

Teodor Vanislavov Marinov, Mehryar Mohri, Julian Zimmert (2022)

Conference on Learning Theory (COLT), PMLR 178

📍 Open-problem note in COLT proceedings.

[2]

Open Problem: Finite-Time Instance Dependent Optimality for Stochastic Online Learning with Feedback Graphs (PDF)

Teodor Vanislavov Marinov, Mehryar Mohri, Julian Zimmert (2022)

Conference on Learning Theory (COLT), PMLR 178

📍 Proceedings PDF.

§ Tags