UnsolvedMajor Solve Problem

Tight PAC-Bayes Bounds for Deep Neural Networks

§ Problem Statement

Setup

Let X\mathcal{X} be an input space, let Y={1,,C}\mathcal{Y}=\{1,\dots,C\} be a finite label set, and let D\mathcal{D} be an unknown distribution on X×Y\mathcal{X}\times\mathcal{Y}. A training sample is S=((x1,y1),,(xn,yn))DnS=((x_1,y_1),\dots,(x_n,y_n))\sim \mathcal{D}^n with nn i.i.d. examples. Fix a deep neural-network architecture with parameter vector θRp\theta\in\mathbb{R}^p (typically pnp\gg n), and let hθ:XYh_\theta:\mathcal{X}\to\mathcal{Y} be the induced classifier. For bounded loss :Y×Y[0,1]\ell:\mathcal{Y}\times\mathcal{Y}\to[0,1] (in particular =1{y^y}\ell=\mathbf{1}\{\hat y\neq y\}), define population risk and empirical risk

LD(hθ)=E(x,y)D[(hθ(x),y)],L^S(hθ)=1ni=1n(hθ(xi),yi).L_{\mathcal D}(h_\theta)=\mathbb E_{(x,y)\sim\mathcal D}\big[\ell(h_\theta(x),y)\big],\qquad \hat L_S(h_\theta)=\frac1n\sum_{i=1}^n \ell(h_\theta(x_i),y_i).

A stochastic neural network (Gibbs predictor) is specified by a posterior distribution QQ on Rp\mathbb{R}^p; prediction uses θQ\theta\sim Q. Its risks are EθQ[LD(hθ)]\mathbb E_{\theta\sim Q}[L_{\mathcal D}(h_\theta)] and EθQ[L^S(hθ)]\mathbb E_{\theta\sim Q}[\hat L_S(h_\theta)]. Let PP be a prior distribution on Rp\mathbb{R}^p that is independent of SS, and let KL(QP)\mathrm{KL}(Q\|P) be Kullback-Leibler divergence.

PAC-Bayes bounds are rooted in late-1990s learning theory, with key milestones including 1997 PAC-style Bayes analyses, McAllester's COLT results in 1998/1999, and later consolidations such as McAllester (2003); see Guedj (2019) for a modern overview.

Unsolved Problem

Determine whether there exists a PAC-Bayes framework (choice of admissible priors/posteriors and bound/estimation procedure) such that, for modern large-scale deep networks and datasets, one can compute in polynomial time a certified upper bound B(S,δ)B(S,\delta) satisfying with probability at least 1δ1-\delta over SDnS\sim\mathcal D^n:

EθQ[LD(hθ)]EθQ[L^S(hθ)]+KL(QP)+ln(c(n,δ))2nB(S,δ),\mathbb E_{\theta\sim Q}[L_{\mathcal D}(h_\theta)] \le \mathbb E_{\theta\sim Q}[\hat L_S(h_\theta)] +\sqrt{\frac{\mathrm{KL}(Q\|P)+\ln(c(n,\delta))}{2n}} \le B(S,\delta),

for some explicit c(n,δ)c(n,\delta) of logarithmic order in nn and 1/δ1/\delta, and simultaneously meeting all three requirements:

  1. Non-vacuity at practical scale: for standard trained architectures (including ImageNet-scale settings), B(S,δ)<1B(S,\delta)<1 for 00-11 loss (strictly better than the trivial upper bound 11).

  2. Computational tractability: B(S,δ)B(S,\delta) (including optimization over/selection of QQ and evaluation of the KL and empirical-risk terms) is computable to certified numerical accuracy in time polynomial in natural problem parameters (at least nn, pp, and log(1/δ)\log(1/\delta)).

  3. Predictive tightness across hyperparameters: over a family of training/hyperparameter choices Λ\Lambda, the map λBλ(S,δ)\lambda\mapsto B_\lambda(S,\delta) is strongly positively associated with true test error λEθQλ[LD(hθ)]\lambda\mapsto \mathbb E_{\theta\sim Q_\lambda}[L_{\mathcal D}(h_\theta)] (for example, high rank correlation), so the bound is not only non-vacuous but informative about relative generalization performance.

See Dziugaite & Roy (2017) for further context.

§ Discussion

Loading discussion…

§ Significance & Implications

Understanding why deep neural networks generalize despite massive overparameterization is one of the central mysteries of modern machine learning theory. Classical bounds (VC dimension, Rademacher complexity) are typically vacuous for practical networks. PAC-Bayes and compression-based analyses provide some of the strongest available guarantees, but obtaining broadly tight, computationally practical, and consistently informative bounds for modern large-scale pipelines remains open.

§ Known Partial Results

  • Dziugaite & Roy (2017): first non-vacuous PAC-Bayes bound for small neural networks (MNIST).

  • Pérez-Ortiz et al. (2021): tighter PAC-Bayes risk certificates for probabilistic neural networks on MNIST/CIFAR-10 settings (not an ImageNet/ResNet non-vacuity result).

  • Arora et al. (2018): ](#references).

  • Zhou et al. (2019): explicit non-vacuous ImageNet-scale guarantees via a PAC-Bayesian compression approach.

  • Dziugaite et al. (2017): For many non-compression PAC-Bayes pipelines on modern large models, guarantees are still often loose or architecture-dependent, so broadly tight practical bounds remain open.

§ References

[1]

Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data

Gintare Karolina Dziugaite, Daniel M. Roy (2017)

UAI

📍 Section 7 (Conclusion and Future Work), discussion of bound-tightening directions and scalability limits.

[2]

PAC-Bayesian stochastic model selection

David McAllester (2003)

Machine Learning

📍 Core PAC-Bayesian stochastic model-selection bounds (the main theorem statements in the body of the paper).

[3]

A Primer on PAC-Bayesian Learning

Benjamin Guedj (2019)

📍 Historical overview and final 'open problems/perspectives' discussion on tightness and scalability.

[4]

Some PAC-Bayesian Theorems

David A. McAllester (1998)

COLT

📍 Foundational COLT-era PAC-Bayesian theorem statements.

[5]

PAC-Bayesian Model Averaging

David A. McAllester (1999)

COLT

📍 Model-averaging PAC-Bayesian bound development in the main theorem/results sections.

[6]

Stronger generalization bounds for deep nets via a compression approach

Sanjeev Arora, Rong Ge, Behnam Neyshabur, Yi Zhang (2018)

ICML

📍 Compression-based bound refinements yielding tighter guarantees than earlier compression analyses.

[7]

Non-vacuous Generalization Bounds at the ImageNet Scale: a PAC-Bayesian Compression Approach

Wenda Zhou, Victor Veitch, Morgane Austern, Ryan P. Adams, Peter Orbanz (2019)

ICLR

📍 Main theorem plus ImageNet experiments reporting explicit non-vacuous compression-based PAC-Bayes guarantees.

§ Tags