How fast can a multiclass test set be overfit?

§ Problem Statement

Setup

Let $m \ge 2$ and $Y=[m]=\{1,\dots,m\}$ . Let $X$ be any domain. Let $P$ be any distribution over $X\times Y$ such that $y$ is independent of $x$ and uniform on $Y$ (equivalently, for every classifier $f:X\to Y$ , $\mathrm{acc}_P(f):=\Pr_{(x,y)\sim P}[f(x)=y]=1/m$ ). For $n\ge 1$ , draw a test set $S=((x_1,y_1),\dots,(x_n,y_n))\sim P^n$ .

For any (possibly randomized) classifier $f:X\to Y$ , define empirical accuracy on $S$ by

\mathrm{acc}_S(f):=\frac{1}{n}\sum_{i=1}^n \mathbf{1}[f(x_i)=y_i].

An analyst gets adaptive access to an exact accuracy oracle for $S$ for $k$ rounds: in round $t=1,\dots,k$ , the analyst submits a classifier $f_t$ (possibly depending on all previous queries and returned accuracies) and observes the exact value $\mathrm{acc}_S(f_t)$ . After $k$ queries, the analyst outputs a final classifier $f$ .

Unsolved Problem

In the regime $k\le n/m$ , does there exist an (possibly inefficient) analyst strategy such that for every such distribution $P$ ,

\mathbb{E}[\mathrm{acc}_S(f)] \ge \frac{1}{m}+\tilde{\Omega}\!\left(\sqrt{\frac{k}{nm}}\right),

where the expectation is over $S\sim P^n$ and the analyst's internal randomness, and $\tilde{\Omega}(\cdot)$ hides factors polylogarithmic in $n,m,k$ ? Equivalently (under the stated assumption on $P$ ), can one achieve expected overfitting bias $\mathbb{E}[\mathrm{acc}_S(f)-\mathrm{acc}_P(f)]\ge \tilde{\Omega}(\sqrt{k/(nm)})$ using only $k$ exact accuracy measurements?

§ Discussion

Loading discussion…

§ Significance & Implications

This problem isolates the information-theoretic power of repeated adaptive feedback of exact multiclass test accuracy. It asks whether multiclass problems provide a linear-in- $m$ robustness gain against accuracy-only overfitting (bias scaling like $\tilde{\Theta}(\sqrt{k/(nm)})$ ) or whether a substantially weaker dependence on $m$ is inherent in this access model. Resolving the gap sharpens worst-case guidance for how long multiclass benchmarks/leaderboards can be reused when participants receive only aggregate accuracy feedback, and it delineates the limits of adaptive data analysis phenomena in a practically common evaluation interface.

§ Known Partial Results

Feldman et al. (2019): (Feldman, Frostig, Hardt 2019; COLT open-problem note, Thm. 1.1 informal) There exists a distribution $P$ over $m$ classes such that any algorithm making at most $k$ adaptive accuracy queries satisfies, with high probability,
Feldman et al. (2019): (Feldman, Frostig, Hardt 2019; COLT open-problem note, Thm. 1.2) For sufficiently large $n$ and $n\ge k\ge k_{\min}=O(m\log m)$ , there is a computationally efficient point-wise attack that, on any fixed dataset $S$ , outputs $f$ with
Feldman et al. (2019): (Feldman, Frostig, Hardt 2019; COLT open-problem note) The point-wise attack above is proved optimal within a broad class of point-wise attacks (which predict each test point independently given the query transcript), leaving open whether non-point-wise strategies can achieve the $\tilde{\Omega}(\sqrt{k/(nm)})$ dependence on $m$ .
Feldman et al. (2019): (Feldman, Frostig, Hardt 2019; referenced in the COLT note) In a stronger access model where the attacker also has access to the unlabeled test points $\{x_i\}_{i=1}^n$ (but not the labels), for $k=\Omega(m\log m)$ there is an attack that on any $S$ outputs $f$ with