Unsolved

Anytime Convergence Rate of Gradient Descent

Posed by Guy Kornowski et al. (2024)

§ Problem Statement

Setup

Let f:RdRf: \mathbb{R}^d \to \mathbb{R} be a convex differentiable function with LL-Lipschitz gradient ("LL-smooth"), i.e., f(x)f(y)Lxy\|\nabla f(x)-\nabla f(y)\| \le L\|x-y\| for all x,yx,y. Assume ff attains its minimum, and fix an initialization x0Rdx_0 \in \mathbb{R}^d with some minimizer xargminfx^* \in \arg\min f and value f=f(x)f^* = f(x^*). Consider vanilla gradient descent with a predetermined (oblivious) stepsize sequence (ηt)t0(\eta_t)_{t\ge 0} (possibly depending on LL but not on the stopping time TT):

xt+1=xtηtf(xt),t=0,1,2,.x_{t+1} = x_t - \eta_t \nabla f(x_t), \qquad t=0,1,2,\dots.

The classical worst-case guarantee for suitable constant stepsizes gives an anytime bound of order f(xT)fCLx0x2/Tf(x_T)-f^* \le C\,L\|x_0-x^*\|^2/T holding for all TNT\in\mathbb{N} and all LL-smooth convex ff, with a universal constant CC.

Unsolved Problem

Do stepsizes alone yield a strictly faster worst-case anytime rate on the actual iterate xTx_T? Equivalently, does there exist a stepsize sequence (ηt)t0(\eta_t)_{t\ge 0} and an exponent α>1\alpha>1 such that for some universal constant C<C<\infty, for every dimension dd, every LL-smooth convex ff attaining its minimum, every initialization x0x_0, every choice of minimizer xargminfx^*\in\arg\min f, and every TNT\in\mathbb{N},

f(xT)fCLx0x2Tα?f(x_T)-f^* \le C\,\frac{L\|x_0-x^*\|^2}{T^\alpha}?

More generally, what is the best function r(T)r(T) for which there exists an oblivious stepsize schedule such that f(xT)fCLx0x2r(T)f(x_T)-f^* \le C\,L\|x_0-x^*\|^2\,r(T) holds simultaneously for all TT and all LL-smooth convex ff?

§ Discussion

Loading discussion…

§ Significance & Implications

This problem isolates the power and limits of "stepsize-only" design for the most basic first-order method. An affirmative answer would show that, without momentum, restarts, or changing the update rule, gradient descent can achieve a worst-case guarantee that decays faster than 1/T1/T at every time TT using a single oblivious schedule. A negative answer would establish an intrinsic separation between plain gradient descent and accelerated methods in the anytime (per-iterate) sense, pinpointing that any apparent acceleration from clever stepsizes must fail at some times or for some smooth convex objectives.

§ Known Partial Results

  • Kornowski et al. (2024): With constant stepsize ηtη(0,2/L)\eta_t\equiv\eta\in(0,2/L), standard analysis yields an anytime worst-case bound of the form f(xT)fCLx0x2/Tf(x_T)-f^* \le C\,L\|x_0-x^*\|^2/T for all TNT\in\mathbb{N}.

  • Kornowski et al. (2024): There exist non-constant stepsize schedules that achieve an accelerated bound at selected, pre-specified horizons (e.g., along a sparse subsequence of times), but such horizon-specific acceleration does not by itself imply an anytime bound for the specific iterate xTx_T at every TT.

  • Kornowski et al. (2024): A horizon-specific guarantee can be converted into an anytime guarantee only for "best-so-far" performance (e.g., mintTf(xt)f\min_{t\le T} f(x_t)-f^* by selecting a nearby special horizon), which is strictly weaker than bounding f(xT)ff(x_T)-f^* for each TT.

  • Kornowski et al. (2024): Any oblivious schedule that would improve the anytime worst-case rate beyond O(1/T)O(1/T) uniformly over all LL-smooth convex objectives (in the sense f(xT)f=o(Lx0x2/T)f(x_T)-f^* = o(L\|x_0-x^*\|^2/T) for all TT) must use unbounded stepsizes, i.e., lim suptηt=\limsup_{t\to\infty} \eta_t = \infty.

  • Kornowski et al. (2024): Large individual steps can cause persistent large errors: in dimension d=1d=1, if at some time TT one has ηT2/L\eta_T\ge 2/L and also t=0T1ηt1/L\sum_{t=0}^{T-1}\eta_t\ge 1/L, then there exists an LL-smooth convex ff such that

  • Kornowski et al. (2024): Consequently, if a schedule has infinitely many times where the ratio ηT/t<Tηt\eta_T/\sum_{t<T}\eta_t is bounded away from 00 while taking such "long" steps, then f(xT)ff(x_T)-f^* can remain on the order of Lx0x2L\|x_0-x^*\|^2 at arbitrarily large times, ruling out any decaying anytime bound for that schedule.

§ References

[1]

Open Problem: Anytime Convergence Rate of Gradient Descent

Guy Kornowski, Ohad Shamir (2024)

Conference on Learning Theory (COLT), PMLR 247

📍 Open-problem note in COLT proceedings.

[2]

Open Problem: Anytime Convergence Rate of Gradient Descent (PDF)

Guy Kornowski, Ohad Shamir (2024)

Conference on Learning Theory (COLT), PMLR 247

📍 Proceedings PDF.

§ Tags