Unsolved

Rigorous tempered-overfitting guarantees for MDL ReLU interpolators under label noise

Sourced from the work of Sourav Chatterjee, Timothy Sudijono

§ Problem Statement

Setup

Fix an input space XRd\mathcal X\subseteq \mathbb R^d and let (Xi,Yi)i=1n(X_i,Y_i)_{i=1}^n be i.i.d. samples from a distribution PP on X×{0,1}\mathcal X\times\{0,1\}. Assume there exists a binary target function f:X{0,1}f^\star:\mathcal X\to\{0,1\} such that

Y=f(X)ξ,Y = f^\star(X)\oplus \xi,

where \oplus is XOR, and conditionally on X=xX=x, ξBernoulli(η(x))\xi\sim \mathrm{Bernoulli}(\eta(x)) with measurable η:X[0,ηˉ]\eta:\mathcal X\to[0,\bar\eta] for some fixed ηˉ<12\bar\eta<\tfrac12. Then P(Yf(X)X=x)=η(x)\mathbb P(Y\neq f^\star(X)\mid X=x)=\eta(x) and

R=infg:X{0,1}P(g(X)Y)=E[η(X)].R^\star=\inf_{g:\mathcal X\to\{0,1\}} \mathbb P(g(X)\neq Y)=\mathbb E[\eta(X)].

Assume ff^\star has finite description complexity in a fixed prefix-free language L\mathcal L:

KL(f):=min{p:pL, p outputs f}<.K_{\mathcal L}(f^\star):=\min\{|p|: p\in\mathcal L,\ p\text{ outputs }f^\star\}<\infty.

Let FReLU\mathcal F_{\mathrm{ReLU}} be binary classifiers representable by finite feedforward ReLU networks (real-valued output thresholded at 1/21/2), each with prefix-free code length DL(f)N\mathrm{DL}(f)\in\mathbb N. Define f^n\hat f_n via penalized ERM with approximate minimization: for given λn>0\lambda_n>0 and tolerance αn0\alpha_n\ge 0, choose measurable f^nFReLU\hat f_n\in\mathcal F_{\mathrm{ReLU}} such that

R^n(f^n)+λnDL(f^n)ninffFReLU{R^n(f)+λnDL(f)n}+αn,R^n(f)=1ni=1n1{f(Xi)Yi}.\hat R_n(\hat f_n)+\lambda_n\frac{\mathrm{DL}(\hat f_n)}{n} \le \inf_{f\in\mathcal F_{\mathrm{ReLU}}}\left\{\hat R_n(f)+\lambda_n\frac{\mathrm{DL}(f)}{n}\right\}+\alpha_n, \quad \hat R_n(f)=\frac1n\sum_{i=1}^n\mathbf 1\{f(X_i)\neq Y_i\}.

This setup follows Chatterjee & Sudijono (2024).

Unsolved Problem

Under explicit assumptions linking program complexity, ReLU approximation/realizability, coding choices, and the sequences (λn,αn)(\lambda_n,\alpha_n), prove a nonasymptotic high-probability excess-risk bound

P ⁣(R(f^n)Rε(n,δ,KL(f),ηˉ,model/coding parameters))1δ,\mathbb P\!\left(R(\hat f_n)-R^\star\le \varepsilon(n,\delta,K_{\mathcal L}(f^\star),\bar\eta,\text{model/coding parameters})\right)\ge 1-\delta,

with explicit ε()\varepsilon(\cdot) and ε0\varepsilon\to0 as nn\to\infty (for fixed δ\delta and fixed complexity/noise parameters). Motivation: this would rigorously quantify whether and when MDL regularization tempers overfitting in noisy-label ReLU interpolation, extending the source paper's noiseless guarantees.

§ Discussion

Loading discussion…

§ Significance & Implications

The abstract of Chatterjee & Sudijono (2024) proves strong generalization in noiseless low-complexity settings and only indicates noisy extensions heuristically. A rigorous theorem in the noisy regime would formalize whether MDL-style selection avoids catastrophic memorization and quantify when overfitting is tempered under label noise.

§ Known Partial Results

  • Chatterjee et al. (2024): Theorem 5.1 proves high-probability generalization guarantees in the noiseless low-complexity setting, and Section 7 discusses noisy-label extensions only as future work; a full proved noisy-label theorem is still open.

§ References

[1]

Neural Networks Generalize on Low Complexity Data

Sourav Chatterjee, Timothy Sudijono (2024)

arXiv preprint

📍 Theorem 5.1 (main noiseless high-probability generalization guarantee) and Section 7 (Discussion), paragraph after Theorem 5.1 noting noisy-observation extensions as future work rather than a proved noisy-label theorem.

Primary source is the 2024 arXiv preprint; Annals of Statistics acceptance/to-appear status is separate (reported in 2025) and not itself a noisy-regime theorem.

§ Tags