Tight PAC-Bayes Bounds for Deep Neural Networks

§ Problem Statement

Setup

Let $\mathcal{X}$ be an input space, let $\mathcal{Y}=\{1,\dots,C\}$ be a finite label set, and let $\mathcal{D}$ be an unknown distribution on $\mathcal{X}\times\mathcal{Y}$ . A training sample is $S=((x_1,y_1),\dots,(x_n,y_n))\sim \mathcal{D}^n$ with $n$ i.i.d. examples. Fix a deep neural-network architecture with parameter vector $\theta\in\mathbb{R}^p$ (typically $p\gg n$ ), and let $h_\theta:\mathcal{X}\to\mathcal{Y}$ be the induced classifier. For bounded loss $\ell:\mathcal{Y}\times\mathcal{Y}\to[0,1]$ (in particular $\ell=\mathbf{1}\{\hat y\neq y\}$ ), define population risk and empirical risk

L_{\mathcal D}(h_\theta)=\mathbb E_{(x,y)\sim\mathcal D}\big[\ell(h_\theta(x),y)\big],\qquad \hat L_S(h_\theta)=\frac1n\sum_{i=1}^n \ell(h_\theta(x_i),y_i).

A stochastic neural network (Gibbs predictor) is specified by a posterior distribution $Q$ on $\mathbb{R}^p$ ; prediction uses $\theta\sim Q$ . Its risks are $\mathbb E_{\theta\sim Q}[L_{\mathcal D}(h_\theta)]$ and $\mathbb E_{\theta\sim Q}[\hat L_S(h_\theta)]$ . Let $P$ be a prior distribution on $\mathbb{R}^p$ that is independent of $S$ , and let $\mathrm{KL}(Q\|P)$ be Kullback-Leibler divergence.

PAC-Bayes bounds are rooted in late-1990s learning theory, with key milestones including 1997 PAC-style Bayes analyses, McAllester's COLT results in 1998/1999, and later consolidations such as McAllester (2003); see Guedj (2019) for a modern overview.

Unsolved Problem

Determine whether there exists a PAC-Bayes framework (choice of admissible priors/posteriors and bound/estimation procedure) such that, for modern large-scale deep networks and datasets, one can compute in polynomial time a certified upper bound $B(S,\delta)$ satisfying with probability at least $1-\delta$ over $S\sim\mathcal D^n$ :

\mathbb E_{\theta\sim Q}[L_{\mathcal D}(h_\theta)] \le \mathbb E_{\theta\sim Q}[\hat L_S(h_\theta)] +\sqrt{\frac{\mathrm{KL}(Q\|P)+\ln(c(n,\delta))}{2n}} \le B(S,\delta),

for some explicit $c(n,\delta)$ of logarithmic order in $n$ and $1/\delta$ , and simultaneously meeting all three requirements:

Non-vacuity at practical scale: for standard trained architectures (including ImageNet-scale settings), $B(S,\delta)<1$ for $0$ - $1$ loss (strictly better than the trivial upper bound $1$ ).
Computational tractability: $B(S,\delta)$ (including optimization over/selection of $Q$ and evaluation of the KL and empirical-risk terms) is computable to certified numerical accuracy in time polynomial in natural problem parameters (at least $n$ , $p$ , and $\log(1/\delta)$ ).
Predictive tightness across hyperparameters: over a family of training/hyperparameter choices $\Lambda$ , the map $\lambda\mapsto B_\lambda(S,\delta)$ is strongly positively associated with true test error $\lambda\mapsto \mathbb E_{\theta\sim Q_\lambda}[L_{\mathcal D}(h_\theta)]$ (for example, high rank correlation), so the bound is not only non-vacuous but informative about relative generalization performance.

See Dziugaite & Roy (2017) for further context.

§ Discussion

Loading discussion…

§ Significance & Implications

Understanding why deep neural networks generalize despite massive overparameterization is one of the central mysteries of modern machine learning theory. Classical bounds (VC dimension, Rademacher complexity) are typically vacuous for practical networks. PAC-Bayes and compression-based analyses provide some of the strongest available guarantees, but obtaining broadly tight, computationally practical, and consistently informative bounds for modern large-scale pipelines remains open.

§ Known Partial Results

Dziugaite & Roy (2017): first non-vacuous PAC-Bayes bound for small neural networks (MNIST).
Pérez-Ortiz et al. (2021): tighter PAC-Bayes risk certificates for probabilistic neural networks on MNIST/CIFAR-10 settings (not an ImageNet/ResNet non-vacuity result).
Arora et al. (2018): ](#references).
Zhou et al. (2019): explicit non-vacuous ImageNet-scale guarantees via a PAC-Bayesian compression approach.
Dziugaite et al. (2017): For many non-compression PAC-Bayes pipelines on modern large models, guarantees are still often loose or architecture-dependent, so broadly tight practical bounds remain open.