The Sample Complexity of Multi-Distribution Learning for VC Classes

§ Problem Statement

Setup

Let $X$ be a domain and let $H \subseteq \{0,1\}^X$ be a hypothesis class with VC dimension $\mathrm{VC}(H)=d$ . Fix parameters $k \in \mathbb{N}$ , $\epsilon \in (0,1/2)$ , and $\delta \in (0,1)$ . In the realizable multi-distribution PAC setting, an instance consists of $k$ unknown distributions $D_1,\dots,D_k$ over $X$ and an unknown target $h^* \in H$ . For each $i \in \{1,\dots,k\}$ , the learner may draw i.i.d. labeled examples $(x,h^*(x))$ with $x \sim D_i$ , choosing an (adaptive) sample allocation $(n_1,\dots,n_k)$ with total sample size $m=\sum_{i=1}^k n_i$ . The goal is to output, with probability at least $1-\delta$ over the sampled data, a hypothesis $\hat h \in H$ such that for every $i$ ,

\Pr_{x\sim D_i}[\hat h(x) \neq h^*(x)] \le \epsilon.

Define $m^*(d,k,\epsilon,\delta)$ to be the minimum $m$ for which there exists a (possibly adaptive) learner that succeeds for every domain $X$ , every class $H$ with $\mathrm{VC}(H)=d$ , and every choice of $D_1,\dots,D_k$ and $h^*\in H$ .

Unsolved Problem

Determine the correct asymptotic dependence of $m^*(d,k,\epsilon,\delta)$ on $d,k,\epsilon$ (and logarithmically on $1/\delta$ ), closing the gap between the best known upper bound (for constant $\delta$ )

m^* = O\!\left(\epsilon^{-2} \ln(k) (d + k) + \min\{\epsilon^{-1} d k,\ \epsilon^{-4} \ln(k) d\}\right)

and the best known lower bound

m^* = \Omega\!\left(\epsilon^{-2}(d + k \ln(k))\right).

§ Discussion

Loading discussion…

§ Significance & Implications

This problem isolates the information-theoretic cost of producing a single hypothesis that achieves error at most $\epsilon$ simultaneously on each of $k$ distributions, when learning a VC class of dimension $d$ under realizability. A tight characterization of $m^*(d,k,\epsilon,\delta)$ would determine whether the additional terms appearing in current upper bounds beyond $\epsilon^{-2}(d + k\ln k)$ (e.g., $\min\{\epsilon^{-1}dk,\epsilon^{-4}\ln(k)\,d\}$ ) are unavoidable in the worst case or can be removed by improved algorithms/analyses.

§ Known Partial Results

Awasthi et al. (2023): (Upper bound, constant $\delta$ .) For VC dimension $d$ and $k$ distributions, the COLT 2023 open-problem note states an upper bound of $O\!\left(\epsilon^{-2} \ln(k) (d + k) + \min\{\epsilon^{-1} d k,\ \epsilon^{-4} \ln(k) d\}\right)$ .
Awasthi et al. (2023): (Lower bound.) The same source states a lower bound of $\Omega\!\left(\epsilon^{-2}(d + k \ln(k))\right)$ .
Awasthi et al. (2023): (Gap.) The current bounds match in their leading $\epsilon^{-2}(d + k\ln k)$ contribution but differ in their additional dependence on $k$ and $\epsilon$ .
Awasthi et al. (2023): (Methodological hurdle.) The note discusses obstacles that appear fundamental for attempts to tighten these bounds via game-dynamics-based approaches in statistical learning.

§ References

[1]

Open Problem: The Sample Complexity of Multi-Distribution Learning for VC Classes

Pranjal Awasthi, Nika Haghtalab, Eric Zhao (2023)

Conference on Learning Theory (COLT), PMLR 195

📍 Open-problem note in COLT proceedings.

Link ↗

[2]

Open Problem: The Sample Complexity of Multi-Distribution Learning for VC Classes (PDF)

Pranjal Awasthi, Nika Haghtalab, Eric Zhao (2023)

Conference on Learning Theory (COLT), PMLR 195

📍 Proceedings PDF.

Link ↗

§ Tags

colt-open-problem learning-theory sample-complexity vc-dimension