Unsolved

Overparameterized optimal subsample size for infinite-ensemble subagging

Sourced from the work of Takuya Koriyama, Pratik Patil, Jin-Hong Du, Kai Tan, Pierre C. Bellec

§ Problem Statement

Setup

Assume the homogeneous subagging setup studied in Koriyama et al.: i.i.d. data (xi,yi)i=1n(x_i,y_i)_{i=1}^n in proportional asymptotics (n,pn,p\to\infty with p/nγ(0,)p/n\to\gamma\in(0,\infty)), with Gaussian design xiN(0,Ip/p)x_i\sim N(0,I_p/p), linear signal-plus-noise response yi=xiθ+ziy_i=x_i^\top\theta+z_i, and finite second moments for signal/noise. Base learners are regularized M-estimators with convex differentiable loss and convex penalty, trained on uniform subsamples of size kk, and the full-ensemble estimator is the MM\to\infty limit (conditional subsample expectation).

Let Rn,p(λ,k)\mathcal R_{n,p}(\lambda,k) denote squared prediction risk of the full-ensemble estimator, and let kn,p(λ)argmin1knRn,p(λ,k)k^\star_{n,p}(\lambda)\in\arg\min_{1\le k\le n}\mathcal R_{n,p}(\lambda,k).

Unsolved Problem

In the vanishing-regularization regime λ=λn,p0+\lambda=\lambda_{n,p}\to0^+, establish whether

lim supn,pkn,p(λn,p)min{n,p}1\limsup_{n,p\to\infty}\frac{k^\star_{n,p}(\lambda_{n,p})}{\min\{n,p\}}\le 1

holds under the full generality of the M-estimation framework. Existing results/evidence separate into: (i) proved or analytically derived behavior in specific tractable cases (notably ridgeless/squared-loss settings), (ii) empirical/numerical evidence in other cases (including lasso-type settings), and (iii) the unresolved unified conjecture across general losses/penalties.

For sequences with p<np<n, the limsup formulation to test is correspondingly

lim supn,p,p<nkn,p(λn,p)p1,\limsup_{n,p\to\infty,\,p<n}\frac{k^\star_{n,p}(\lambda_{n,p})}{p}\le 1,

rather than a pointwise eventual inequality.

§ Discussion

Loading discussion…

§ Significance & Implications

This would formalize when implicit regularization from subagging alone is sufficient to control prediction error without explicit penalization. A proof would clarify phase transitions in optimal subsampling and provide principled guidance for choosing kk in modern high-dimensional regimes. See Koriyama et al. for the current asymptotic formulas and evidence.

§ Known Partial Results

  • Koriyama et al. (2025): Koriyama et al. provide precise asymptotic risk characterizations for subagged regularized M-estimators. In specialized tractable regimes (notably ridgeless/squared-loss settings), the formulas support overparameterized-optimal-kk behavior; for broader settings (including lassoless/lasso-type cases), the paper presents supportive numerical evidence but not a single theorem covering all losses/penalties under vanishing regularization.

§ References

[1]

Precise Asymptotics of Bagging Regularized M-estimators

Takuya Koriyama, Pratik Patil, Jin-Hong Du, Kai Tan, Pierre C. Bellec (2025)

Annals of Statistics (future paper; to appear)

📍 Section 5.2 ("Optimal subsample size"), first paragraph and Figure 5 discussion of $k^\star$ shifting toward the overparameterized regime for vanishing explicit regularization (arXiv v3 dated 2025-09-27; canonical citation uses base arXiv id).

Primary source; IMS Annals of Statistics future-papers listing and arXiv preprint (latest public revision v3).

§ Tags