Tight PAC-Bayes Bounds for Deep Neural Networks
§ Problem Statement
Setup
Let be an input space, let be a finite label set, and let be an unknown distribution on . A training sample is with i.i.d. examples. Fix a deep neural-network architecture with parameter vector (typically ), and let be the induced classifier. For bounded loss (in particular ), define population risk and empirical risk
A stochastic neural network (Gibbs predictor) is specified by a posterior distribution on ; prediction uses . Its risks are and . Let be a prior distribution on that is independent of , and let be Kullback-Leibler divergence.
PAC-Bayes bounds are rooted in late-1990s learning theory, with key milestones including 1997 PAC-style Bayes analyses, McAllester's COLT results in 1998/1999, and later consolidations such as McAllester (2003); see Guedj (2019) for a modern overview.
Unsolved Problem
Determine whether there exists a PAC-Bayes framework (choice of admissible priors/posteriors and bound/estimation procedure) such that, for modern large-scale deep networks and datasets, one can compute in polynomial time a certified upper bound satisfying with probability at least over :
for some explicit of logarithmic order in and , and simultaneously meeting all three requirements:
-
Non-vacuity at practical scale: for standard trained architectures (including ImageNet-scale settings), for - loss (strictly better than the trivial upper bound ).
-
Computational tractability: (including optimization over/selection of and evaluation of the KL and empirical-risk terms) is computable to certified numerical accuracy in time polynomial in natural problem parameters (at least , , and ).
-
Predictive tightness across hyperparameters: over a family of training/hyperparameter choices , the map is strongly positively associated with true test error (for example, high rank correlation), so the bound is not only non-vacuous but informative about relative generalization performance.
See Dziugaite & Roy (2017) for further context.
§ Discussion
§ Significance & Implications
Understanding why deep neural networks generalize despite massive overparameterization is one of the central mysteries of modern machine learning theory. Classical bounds (VC dimension, Rademacher complexity) are typically vacuous for practical networks. PAC-Bayes and compression-based analyses provide some of the strongest available guarantees, but obtaining broadly tight, computationally practical, and consistently informative bounds for modern large-scale pipelines remains open.
§ Known Partial Results
Dziugaite & Roy (2017): first non-vacuous PAC-Bayes bound for small neural networks (MNIST).
Pérez-Ortiz et al. (2021): tighter PAC-Bayes risk certificates for probabilistic neural networks on MNIST/CIFAR-10 settings (not an ImageNet/ResNet non-vacuity result).
Arora et al. (2018): ](#references).
Zhou et al. (2019): explicit non-vacuous ImageNet-scale guarantees via a PAC-Bayesian compression approach.
Dziugaite et al. (2017): For many non-compression PAC-Bayes pipelines on modern large models, guarantees are still often loose or architecture-dependent, so broadly tight practical bounds remain open.
§ References
Gintare Karolina Dziugaite, Daniel M. Roy (2017)
UAI
📍 Section 7 (Conclusion and Future Work), discussion of bound-tightening directions and scalability limits.
PAC-Bayesian stochastic model selection
David McAllester (2003)
Machine Learning
📍 Core PAC-Bayesian stochastic model-selection bounds (the main theorem statements in the body of the paper).
A Primer on PAC-Bayesian Learning
Benjamin Guedj (2019)
📍 Historical overview and final 'open problems/perspectives' discussion on tightness and scalability.
David A. McAllester (1998)
COLT
📍 Foundational COLT-era PAC-Bayesian theorem statements.
David A. McAllester (1999)
COLT
📍 Model-averaging PAC-Bayesian bound development in the main theorem/results sections.
Stronger generalization bounds for deep nets via a compression approach
Sanjeev Arora, Rong Ge, Behnam Neyshabur, Yi Zhang (2018)
ICML
📍 Compression-based bound refinements yielding tighter guarantees than earlier compression analyses.
Non-vacuous Generalization Bounds at the ImageNet Scale: a PAC-Bayesian Compression Approach
Wenda Zhou, Victor Veitch, Morgane Austern, Ryan P. Adams, Peter Orbanz (2019)
ICLR
📍 Main theorem plus ImageNet experiments reporting explicit non-vacuous compression-based PAC-Bayes guarantees.