Can Single-Shuffle SGD be Better than Reshuffling SGD and GD?
Posed by Chulhee Yun et al. (2021)
§ Problem Statement
Setup
Let , , and . Let denote the set of all permutations of . For real symmetric matrices , write for the Loewner order (i.e., is positive semidefinite), and let be the spectral (operator) norm. For a sequence of matrices , use the convention
Given symmetric matrices , define for each the without-replacement product
Define the three averaged quantities (all powers are matrix powers)
Unsolved Problem
Is it true that for every and there exists a constant (depending only on and , not on or on the specific matrices) such that, whenever
one has
Equivalently: under uniform near-identity well-conditioning, does single-shuffle (expectation outside the -th power) never yield a larger spectral norm than random reshuffling (expectation inside the -th power), and is random reshuffling never worse than the full-gradient proxy , with constants uniform over dimension and instances?
§ Discussion
§ Significance & Implications
The conjecture asks for a dimension-free, instance-uniform ordering between three concrete noncommutative averages of epoch-wise update matrices. In finite-sum optimization analyses where one epoch corresponds to multiplying near-identity factors (e.g., with small step size ), bounds on translate into quantitative comparisons of expected contraction under single-shuffle, per-epoch reshuffling, and a full-gradient baseline for quadratic/linear models. Establishing existence of depending only on would provide a clean matrix-inequality tool that separates the algorithmic difference "where the expectation is taken" (before vs after the -th power) and links without-replacement sampling effects to strengthened noncommutative AM-GM-type inequalities in a regime (uniformly well-conditioned, near-identity factors) tailored to SGD step-size constraints.
§ Known Partial Results
Yun et al. (2021): The problem is motivated by extensions of the Recht-Re noncommutative AM-GM conjecture, augmented to include a term capturing the single-shuffle (shuffle-once) sampling scheme that is not covered by the original conjecture.
Yun et al. (2021): Existing counterexamples to the most general PSD-matrix versions of related AM-GM-type conjectures rely on rank-deficient (singular) matrices; this motivates restricting attention to positive definite matrices with uniformly bounded condition number, modeled here by .
Yun et al. (2021): The near-identity constraint matches common SGD linearization forms under small step sizes, and the conjecture seeks an that is uniform over all such instances and all dimensions.
Yun et al. (2021): In the commuting (simultaneously diagonalizable) case, all products coincide, so , and the comparison to reduces to scalar inequalities.
Yun et al. (2021): As reported in the open-problem note, one can prove for sufficiently small step sizes in the structured form , but the allowable smallness may depend on the specific instance; the open problem asks for a bound depending only on .
Yun et al. (2021): Suggested by special cases discussed in the note, a scaling like is plausible, but no dimension-free, instance-uniform choice is currently established in general.
§ References
Open Problem: Can Single-Shuffle SGD be Better than Reshuffling SGD and GD?
Chulhee Yun, Suvrit Sra, Ali Jadbabaie (2021)
Conference on Learning Theory (COLT), PMLR 134
📍 Open-problem note in COLT proceedings.
Open Problem: Can Single-Shuffle SGD be Better than Reshuffling SGD and GD? (PDF)
Chulhee Yun, Suvrit Sra, Ali Jadbabaie (2021)
Conference on Learning Theory (COLT), PMLR 134
📍 Proceedings PDF.