Unsolved

Can Single-Shuffle SGD be Better than Reshuffling SGD and GD?

Posed by Chulhee Yun et al. (2021)

§ Problem Statement

Setup

Let n2n\ge 2, K1K\ge 1, and d1d\ge 1. Let SnS_n denote the set of all permutations of {1,,n}\{1,\dots,n\}. For real symmetric d×dd\times d matrices X,YX,Y, write XYX\preceq Y for the Loewner order (i.e., YXY-X is positive semidefinite), and let 2\|\cdot\|_2 be the spectral (operator) norm. For a sequence of matrices (B1,,Bn)(B_1,\dots,B_n), use the convention

i=1nBi:=BnB1.\prod_{i=1}^n B_i := B_n\cdots B_1.

Given symmetric matrices A1,,AnA_1,\dots,A_n, define for each σSn\sigma\in S_n the without-replacement product

Pσ:=i=1nAσ(i).P_\sigma := \prod_{i=1}^n A_{\sigma(i)}.

Define the three averaged quantities (all powers are matrix powers)

WSS:=1n!σSn(Pσ)K,WRS:=(1n!σSnPσ)K,WGD:=(1ni=1nAi)nK.W_{\mathrm{SS}} := \frac{1}{n!}\sum_{\sigma\in S_n} \left(P_\sigma\right)^K,\qquad W_{\mathrm{RS}} := \left(\frac{1}{n!}\sum_{\sigma\in S_n} P_\sigma\right)^K,\qquad W_{\mathrm{GD}} := \left(\frac{1}{n}\sum_{i=1}^n A_i\right)^{nK}.

Unsolved Problem

Is it true that for every n2n\ge 2 and K1K\ge 1 there exists a constant ηn,K(0,1]\eta_{n,K}\in(0,1] (depending only on nn and KK, not on dd or on the specific matrices) such that, whenever

(1ηn,K)IAiIfor all i{1,,n},(1-\eta_{n,K})I \preceq A_i \preceq I\quad\text{for all } i\in\{1,\dots,n\},

one has

WSS2WRS2WGD2 ?\|W_{\mathrm{SS}}\|_2 \le \|W_{\mathrm{RS}}\|_2 \le \|W_{\mathrm{GD}}\|_2\ ?

Equivalently: under uniform near-identity well-conditioning, does single-shuffle (expectation outside the KK-th power) never yield a larger spectral norm than random reshuffling (expectation inside the KK-th power), and is random reshuffling never worse than the full-gradient proxy WGDW_{\mathrm{GD}}, with constants uniform over dimension and instances?

§ Discussion

Loading discussion…

§ Significance & Implications

The conjecture asks for a dimension-free, instance-uniform ordering between three concrete noncommutative averages of epoch-wise update matrices. In finite-sum optimization analyses where one epoch corresponds to multiplying near-identity factors (e.g., Ai=IαMiA_i=I-\alpha M_i with small step size α\alpha), bounds on WSS2,WRS2,WGD2\|W_{\mathrm{SS}}\|_2,\|W_{\mathrm{RS}}\|_2,\|W_{\mathrm{GD}}\|_2 translate into quantitative comparisons of expected contraction under single-shuffle, per-epoch reshuffling, and a full-gradient baseline for quadratic/linear models. Establishing existence of ηn,K\eta_{n,K} depending only on (n,K)(n,K) would provide a clean matrix-inequality tool that separates the algorithmic difference "where the expectation is taken" (before vs after the KK-th power) and links without-replacement sampling effects to strengthened noncommutative AM-GM-type inequalities in a regime (uniformly well-conditioned, near-identity factors) tailored to SGD step-size constraints.

§ Known Partial Results

  • Yun et al. (2021): The problem is motivated by extensions of the Recht-Re noncommutative AM-GM conjecture, augmented to include a term capturing the single-shuffle (shuffle-once) sampling scheme that is not covered by the original conjecture.

  • Yun et al. (2021): Existing counterexamples to the most general PSD-matrix versions of related AM-GM-type conjectures rely on rank-deficient (singular) matrices; this motivates restricting attention to positive definite matrices with uniformly bounded condition number, modeled here by (1η)IAiI(1-\eta)I\preceq A_i\preceq I.

  • Yun et al. (2021): The near-identity constraint (1η)IAiI(1-\eta)I\preceq A_i\preceq I matches common SGD linearization forms Ai=IαMiA_i=I-\alpha M_i under small step sizes, and the conjecture seeks an ηn,K\eta_{n,K} that is uniform over all such instances and all dimensions.

  • Yun et al. (2021): In the commuting (simultaneously diagonalizable) case, all products PσP_\sigma coincide, so WSS=WRSW_{\mathrm{SS}}=W_{\mathrm{RS}}, and the comparison to WGDW_{\mathrm{GD}} reduces to scalar inequalities.

  • Yun et al. (2021): As reported in the open-problem note, one can prove WSS2WRS2\|W_{\mathrm{SS}}\|_2\le \|W_{\mathrm{RS}}\|_2 for sufficiently small step sizes in the structured form Ai=IαMiA_i=I-\alpha M_i, but the allowable smallness may depend on the specific instance; the open problem asks for a bound depending only on (n,K)(n,K).

  • Yun et al. (2021): Suggested by special cases discussed in the note, a scaling like ηn,K=O(1/(nK))\eta_{n,K}=O(1/(nK)) is plausible, but no dimension-free, instance-uniform choice is currently established in general.

§ References

[1]

Open Problem: Can Single-Shuffle SGD be Better than Reshuffling SGD and GD?

Chulhee Yun, Suvrit Sra, Ali Jadbabaie (2021)

Conference on Learning Theory (COLT), PMLR 134

📍 Open-problem note in COLT proceedings.

[2]

Open Problem: Can Single-Shuffle SGD be Better than Reshuffling SGD and GD? (PDF)

Chulhee Yun, Suvrit Sra, Ali Jadbabaie (2021)

Conference on Learning Theory (COLT), PMLR 134

📍 Proceedings PDF.

§ Tags