How fast can a multiclass test set be overfit?
Posed by Vitaly Feldman et al. (2019)
§ Problem Statement
Setup
Let and . Let be any domain. Let be any distribution over such that is independent of and uniform on (equivalently, for every classifier , ). For , draw a test set .
For any (possibly randomized) classifier , define empirical accuracy on by
An analyst gets adaptive access to an exact accuracy oracle for for rounds: in round , the analyst submits a classifier (possibly depending on all previous queries and returned accuracies) and observes the exact value . After queries, the analyst outputs a final classifier .
Unsolved Problem
In the regime , does there exist an (possibly inefficient) analyst strategy such that for every such distribution ,
where the expectation is over and the analyst's internal randomness, and hides factors polylogarithmic in ? Equivalently (under the stated assumption on ), can one achieve expected overfitting bias using only exact accuracy measurements?
§ Discussion
§ Significance & Implications
This problem isolates the information-theoretic power of repeated adaptive feedback of exact multiclass test accuracy. It asks whether multiclass problems provide a linear-in- robustness gain against accuracy-only overfitting (bias scaling like ) or whether a substantially weaker dependence on is inherent in this access model. Resolving the gap sharpens worst-case guidance for how long multiclass benchmarks/leaderboards can be reused when participants receive only aggregate accuracy feedback, and it delineates the limits of adaptive data analysis phenomena in a practically common evaluation interface.
§ Known Partial Results
Feldman et al. (2019): (Feldman, Frostig, Hardt 2019; COLT open-problem note, Thm. 1.1 informal) There exists a distribution over classes such that any algorithm making at most adaptive accuracy queries satisfies, with high probability,
Feldman et al. (2019): (Feldman, Frostig, Hardt 2019; COLT open-problem note, Thm. 1.2) For sufficiently large and , there is a computationally efficient point-wise attack that, on any fixed dataset , outputs with
Feldman et al. (2019): (Feldman, Frostig, Hardt 2019; COLT open-problem note) The point-wise attack above is proved optimal within a broad class of point-wise attacks (which predict each test point independently given the query transcript), leaving open whether non-point-wise strategies can achieve the dependence on .
Feldman et al. (2019): (Feldman, Frostig, Hardt 2019; referenced in the COLT note) In a stronger access model where the attacker also has access to the unlabeled test points (but not the labels), for there is an attack that on any outputs with
§ References
Open Problem: How fast can a multiclass test set be overfit?
Vitaly Feldman, Roy Frostig, Moritz Hardt (2019)
Conference on Learning Theory (COLT), PMLR 99
📍 Open-problem note in COLT proceedings.
Open Problem: How fast can a multiclass test set be overfit? (PDF)
Vitaly Feldman, Roy Frostig, Moritz Hardt (2019)
Conference on Learning Theory (COLT), PMLR 99
📍 Proceedings PDF.
The advantages of multiple classes for reducing overfitting from test set reuse
Vitaly Feldman, Roy Frostig, Moritz Hardt (2019)
International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research 97
📍 Related work and partial progress context.