Unsolved

The Sample Complexity of Multi-Distribution Learning for VC Classes

Posed by Pranjal Awasthi et al. (2023)

§ Problem Statement

Setup

Let XX be a domain and let H{0,1}XH \subseteq \{0,1\}^X be a hypothesis class with VC dimension VC(H)=d\mathrm{VC}(H)=d. Fix parameters kNk \in \mathbb{N}, ϵ(0,1/2)\epsilon \in (0,1/2), and δ(0,1)\delta \in (0,1). In the realizable multi-distribution PAC setting, an instance consists of kk unknown distributions D1,,DkD_1,\dots,D_k over XX and an unknown target hHh^* \in H. For each i{1,,k}i \in \{1,\dots,k\}, the learner may draw i.i.d. labeled examples (x,h(x))(x,h^*(x)) with xDix \sim D_i, choosing an (adaptive) sample allocation (n1,,nk)(n_1,\dots,n_k) with total sample size m=i=1knim=\sum_{i=1}^k n_i. The goal is to output, with probability at least 1δ1-\delta over the sampled data, a hypothesis h^H\hat h \in H such that for every ii,

PrxDi[h^(x)h(x)]ϵ.\Pr_{x\sim D_i}[\hat h(x) \neq h^*(x)] \le \epsilon.

Define m(d,k,ϵ,δ)m^*(d,k,\epsilon,\delta) to be the minimum mm for which there exists a (possibly adaptive) learner that succeeds for every domain XX, every class HH with VC(H)=d\mathrm{VC}(H)=d, and every choice of D1,,DkD_1,\dots,D_k and hHh^*\in H.

Unsolved Problem

Determine the correct asymptotic dependence of m(d,k,ϵ,δ)m^*(d,k,\epsilon,\delta) on d,k,ϵd,k,\epsilon (and logarithmically on 1/δ1/\delta), closing the gap between the best known upper bound (for constant δ\delta)

m=O ⁣(ϵ2ln(k)(d+k)+min{ϵ1dk, ϵ4ln(k)d})m^* = O\!\left(\epsilon^{-2} \ln(k) (d + k) + \min\{\epsilon^{-1} d k,\ \epsilon^{-4} \ln(k) d\}\right)

and the best known lower bound

m=Ω ⁣(ϵ2(d+kln(k))).m^* = \Omega\!\left(\epsilon^{-2}(d + k \ln(k))\right).

§ Discussion

Loading discussion…

§ Significance & Implications

This problem isolates the information-theoretic cost of producing a single hypothesis that achieves error at most ϵ\epsilon simultaneously on each of kk distributions, when learning a VC class of dimension dd under realizability. A tight characterization of m(d,k,ϵ,δ)m^*(d,k,\epsilon,\delta) would determine whether the additional terms appearing in current upper bounds beyond ϵ2(d+klnk)\epsilon^{-2}(d + k\ln k) (e.g., min{ϵ1dk,ϵ4ln(k)d}\min\{\epsilon^{-1}dk,\epsilon^{-4}\ln(k)\,d\}) are unavoidable in the worst case or can be removed by improved algorithms/analyses.

§ Known Partial Results

  • Awasthi et al. (2023): (Upper bound, constant δ\delta.) For VC dimension dd and kk distributions, the COLT 2023 open-problem note states an upper bound of O ⁣(ϵ2ln(k)(d+k)+min{ϵ1dk, ϵ4ln(k)d})O\!\left(\epsilon^{-2} \ln(k) (d + k) + \min\{\epsilon^{-1} d k,\ \epsilon^{-4} \ln(k) d\}\right).

  • Awasthi et al. (2023): (Lower bound.) The same source states a lower bound of Ω ⁣(ϵ2(d+kln(k)))\Omega\!\left(\epsilon^{-2}(d + k \ln(k))\right).

  • Awasthi et al. (2023): (Gap.) The current bounds match in their leading ϵ2(d+klnk)\epsilon^{-2}(d + k\ln k) contribution but differ in their additional dependence on kk and ϵ\epsilon.

  • Awasthi et al. (2023): (Methodological hurdle.) The note discusses obstacles that appear fundamental for attempts to tighten these bounds via game-dynamics-based approaches in statistical learning.

§ References

[1]

Open Problem: The Sample Complexity of Multi-Distribution Learning for VC Classes

Pranjal Awasthi, Nika Haghtalab, Eric Zhao (2023)

Conference on Learning Theory (COLT), PMLR 195

📍 Open-problem note in COLT proceedings.

[2]

Open Problem: The Sample Complexity of Multi-Distribution Learning for VC Classes (PDF)

Pranjal Awasthi, Nika Haghtalab, Eric Zhao (2023)

Conference on Learning Theory (COLT), PMLR 195

📍 Proceedings PDF.

§ Tags