Unsolved

Do Good Algorithms Necessarily Query Bad Points?

Posed by Rong Ge et al. (2019)

§ Problem Statement

Setup

Let WRd\mathcal{W}\subseteq\mathbb{R}^d be a closed convex set. Consider stochastic convex optimization of

F(w)=EξD[f(w;ξ)]F(w)=\mathbb{E}_{\xi\sim D}[f(w;\xi)]

where FF is convex on W\mathcal{W} and attains its minimum at some wargminwWF(w)w^*\in\arg\min_{w\in\mathcal{W}}F(w). A stochastic first-order oracle, when queried at wtWw_t\in\mathcal{W}, returns gt=f(wt;ξt)g_t=\nabla f(w_t;\xi_t) such that (unbiasedness)

E[gtFt1,wt]=F(wt),\mathbb{E}[g_t\mid \mathcal{F}_{t-1},w_t]=\nabla F(w_t),

where Ft1=σ(ξ1,,ξt1)\mathcal{F}_{t-1}=\sigma(\xi_1,\ldots,\xi_{t-1}); the problem class P\mathcal{P} further specifies regularity/noise assumptions (e.g. a uniform second-moment bound E[gt2Ft1,wt]G2\mathbb{E}[\|g_t\|^2\mid \mathcal{F}_{t-1},w_t]\le G^2 for all tt and wtWw_t\in\mathcal{W}).

For t1t\ge 1, define the minimax expected excess-risk rate after tt oracle calls by

Rt:=infA sup(F,D)P E[F(w^t)F(w)],R_t^*:=\inf_{A}\ \sup_{(F,D)\in\mathcal{P}}\ \mathbb{E}\bigl[F(\hat w_t)-F(w^*)\bigr],

where the infimum ranges over all (possibly randomized) algorithms AA that make tt oracle queries and output w^tW\hat w_t\in\mathcal{W} measurable with respect to Ft\mathcal{F}_t.

Define a (stochastic) first-order algorithm to be non-adaptive if its query points are affine functions of past observed gradients with coefficients fixed before the run: there exist deterministic vectors btRdb_t\in\mathbb{R}^d and deterministic matrices At,jRd×dA_{t,j}\in\mathbb{R}^{d\times d} (allowed to depend on tt and on known problem parameters, but not on the realized oracle randomness) such that for all t1t\ge 1,

wt = ΠW(bt+j=1t1At,jgj),w_t\ =\ \Pi_{\mathcal{W}}\Bigl(b_t+\sum_{j=1}^{t-1}A_{t,j}g_j\Bigr),

where ΠW\Pi_{\mathcal{W}} is Euclidean projection onto W\mathcal{W}.

Unsolved Problem

Problem 2019. For a given class P\mathcal{P}, is it information-theoretically impossible for a non-adaptive algorithm to have uniformly minimax-optimal query points at all sufficiently large times? Concretely, must it be the case that for every non-adaptive algorithm and every constant C<C<\infty there exist arbitrarily large tt for which

sup(F,D)P E[F(wt)F(w)] > CRt?\sup_{(F,D)\in\mathcal{P}}\ \mathbb{E}[F(w_t)-F(w^*)]\ >\ C\,R_t^*?

Equivalently, is it necessary that

lim supt sup(F,D)PE[F(wt)F(w)]Rt > 1,\limsup_{t\to\infty}\ \frac{\sup_{(F,D)\in\mathcal{P}}\mathbb{E}[F(w_t)-F(w^*)]}{R_t^*}\ >\ 1,

possibly with a gap that grows (e.g. polylogarithmically) with tt for some natural choices of P\mathcal{P}?

§ Discussion

Loading discussion…

§ Significance & Implications

Minimax rates RtR_t^* are defined for an algorithm's output after tt oracle calls, but many stochastic-approximation methods are used in "anytime" or streaming settings where the current iterate wtw_t itself is deployed. This problem isolates whether non-adaptivity (fixed, pre-scheduled linear use of past gradients) fundamentally prevents having all query points achieve minimax-order excess risk, even when some function of the history (e.g. an averaged iterate) can be minimax-optimal. A negative answer would justify algorithmic designs that explicitly trade off statistical optimality against the reliability of intermediate iterates; a positive answer would imply that, at least information-theoretically, minimax-optimal performance need not require repeatedly querying provably bad points.

§ Known Partial Results

  • Ge et al. (2019): Classical stochastic-approximation results show that, for standard convex/noise classes P\mathcal{P}, SGD with polynomially decaying step sizes combined with iterate averaging can achieve minimax-order excess risk (for the averaged output) under unbiased-gradient and moment conditions.

  • Ge et al. (2019): The COLT 2019 open problem asks whether such minimax-optimality of an output statistic can coexist with minimax-optimality of the raw query points {wt}\{w_t\} for non-adaptive first-order methods, or whether some wtw_t must remain separated from RtR_t^* infinitely often in the worst case over P\mathcal{P}.

§ References

[1]

Open Problem: Do Good Algorithms Necessarily Query Bad Points?

Rong Ge, Prateek Jain, Sham M. Kakade, Rahul Kidambi, Dheeraj M. Nagaraj, Praneeth Netrapalli (2019)

Conference on Learning Theory (COLT), PMLR 99

📍 Open-problem note in COLT proceedings.

[2]

Open Problem: Do Good Algorithms Necessarily Query Bad Points? (PDF)

Rong Ge, Prateek Jain, Sham M. Kakade, Rahul Kidambi, Dheeraj M. Nagaraj, Praneeth Netrapalli (2019)

Conference on Learning Theory (COLT), PMLR 99

📍 Proceedings PDF.

§ Tags