Unsolved

How fast can a multiclass test set be overfit?

Posed by Vitaly Feldman et al. (2019)

§ Problem Statement

Setup

Let m2m \ge 2 and Y=[m]={1,,m}Y=[m]=\{1,\dots,m\}. Let XX be any domain. Let PP be any distribution over X×YX\times Y such that yy is independent of xx and uniform on YY (equivalently, for every classifier f:XYf:X\to Y, accP(f):=Pr(x,y)P[f(x)=y]=1/m\mathrm{acc}_P(f):=\Pr_{(x,y)\sim P}[f(x)=y]=1/m). For n1n\ge 1, draw a test set S=((x1,y1),,(xn,yn))PnS=((x_1,y_1),\dots,(x_n,y_n))\sim P^n.

For any (possibly randomized) classifier f:XYf:X\to Y, define empirical accuracy on SS by

accS(f):=1ni=1n1[f(xi)=yi].\mathrm{acc}_S(f):=\frac{1}{n}\sum_{i=1}^n \mathbf{1}[f(x_i)=y_i].

An analyst gets adaptive access to an exact accuracy oracle for SS for kk rounds: in round t=1,,kt=1,\dots,k, the analyst submits a classifier ftf_t (possibly depending on all previous queries and returned accuracies) and observes the exact value accS(ft)\mathrm{acc}_S(f_t). After kk queries, the analyst outputs a final classifier ff.

Unsolved Problem

In the regime kn/mk\le n/m, does there exist an (possibly inefficient) analyst strategy such that for every such distribution PP,

E[accS(f)]1m+Ω~ ⁣(knm),\mathbb{E}[\mathrm{acc}_S(f)] \ge \frac{1}{m}+\tilde{\Omega}\!\left(\sqrt{\frac{k}{nm}}\right),

where the expectation is over SPnS\sim P^n and the analyst's internal randomness, and Ω~()\tilde{\Omega}(\cdot) hides factors polylogarithmic in n,m,kn,m,k? Equivalently (under the stated assumption on PP), can one achieve expected overfitting bias E[accS(f)accP(f)]Ω~(k/(nm))\mathbb{E}[\mathrm{acc}_S(f)-\mathrm{acc}_P(f)]\ge \tilde{\Omega}(\sqrt{k/(nm)}) using only kk exact accuracy measurements?

§ Discussion

Loading discussion…

§ Significance & Implications

This problem isolates the information-theoretic power of repeated adaptive feedback of exact multiclass test accuracy. It asks whether multiclass problems provide a linear-in-mm robustness gain against accuracy-only overfitting (bias scaling like Θ~(k/(nm))\tilde{\Theta}(\sqrt{k/(nm)})) or whether a substantially weaker dependence on mm is inherent in this access model. Resolving the gap sharpens worst-case guidance for how long multiclass benchmarks/leaderboards can be reused when participants receive only aggregate accuracy feedback, and it delineates the limits of adaptive data analysis phenomena in a practically common evaluation interface.

§ Known Partial Results

  • Feldman et al. (2019): (Feldman, Frostig, Hardt 2019; COLT open-problem note, Thm. 1.1 informal) There exists a distribution PP over mm classes such that any algorithm making at most kk adaptive accuracy queries satisfies, with high probability,

  • Feldman et al. (2019): (Feldman, Frostig, Hardt 2019; COLT open-problem note, Thm. 1.2) For sufficiently large nn and nkkmin=O(mlogm)n\ge k\ge k_{\min}=O(m\log m), there is a computationally efficient point-wise attack that, on any fixed dataset SS, outputs ff with

  • Feldman et al. (2019): (Feldman, Frostig, Hardt 2019; COLT open-problem note) The point-wise attack above is proved optimal within a broad class of point-wise attacks (which predict each test point independently given the query transcript), leaving open whether non-point-wise strategies can achieve the Ω~(k/(nm))\tilde{\Omega}(\sqrt{k/(nm)}) dependence on mm.

  • Feldman et al. (2019): (Feldman, Frostig, Hardt 2019; referenced in the COLT note) In a stronger access model where the attacker also has access to the unlabeled test points {xi}i=1n\{x_i\}_{i=1}^n (but not the labels), for k=Ω(mlogm)k=\Omega(m\log m) there is an attack that on any SS outputs ff with

§ References

[1]

Open Problem: How fast can a multiclass test set be overfit?

Vitaly Feldman, Roy Frostig, Moritz Hardt (2019)

Conference on Learning Theory (COLT), PMLR 99

📍 Open-problem note in COLT proceedings.

[2]

Open Problem: How fast can a multiclass test set be overfit? (PDF)

Vitaly Feldman, Roy Frostig, Moritz Hardt (2019)

Conference on Learning Theory (COLT), PMLR 99

📍 Proceedings PDF.

[3]

The advantages of multiple classes for reducing overfitting from test set reuse

Vitaly Feldman, Roy Frostig, Moritz Hardt (2019)

International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research 97

📍 Related work and partial progress context.

§ Tags