§ Problem Statement
Setup
Let and let be an unknown distribution on . Draw a pooled dataset i.i.d. A (possibly randomized) selection rule observes the entire labeled pool and outputs an index set with (with ); write for the selected sub-sample. Fix a learning rule that maps any labeled sample to a predictor . Evaluate predictors by squared loss and population risk
Fix a hypothesis class and a family of distributions . Define the minimax excess risk (over ) achievable by first selecting points from an i.i.d. pool of size and then running the fixed learner :
where the expectation is over and any internal randomness of (and of , if randomized).
Unsolved Problem
For natural fixed regression learning rules , determine tight (up to constants and/or logarithmic factors) upper and lower bounds on as a function of and in basic regression settings, and characterize when selecting points can achieve excess risk comparable to using all points.
Two concrete testbeds emphasized in the COLT open-problem note are:
- Mean estimation: is a singleton, , and returns the empirical mean on the selected labels, . For a moment-bounded family such as , determine the optimal rate of
as a function of and .
- Linear regression: , , and is a standard fixed rule such as ordinary least squares (squared-loss ERM) applied to . For a moment-bounded family such as , determine the optimal dependence of on and , and identify regimes where suffices to match (up to constants/log factors) the excess risk achievable from the full pool.
§ Discussion
§ Significance & Implications
This problem isolates the algorithm-dependent limits of regression data curation: when the training rule is fixed in advance, how much of the population-risk performance (measured by excess squared loss relative to ) can be preserved by selecting only labeled examples from an available i.i.d. pool of size ? Tight minimax characterizations of would precisely quantify when subset selection can be statistically near-lossless versus when it necessarily incurs a penalty, and would make explicit how this depends on the selection budget , pool size , and (for linear regression) dimension under concrete moment conditions.
§ Known Partial Results
Hanneke et al. (2025): The COLT 2025 open-problem note (Hanneke, Moran, Shlimovich, Yehudayoff, 2025) formulates the general data-selection question for regression with a fixed learning rule and a selection budget from a pool of size , evaluated by population squared loss/excess risk.
Hanneke et al. (2025): The note highlights mean estimation and linear regression as basic settings where one seeks tight (up to constants/log factors) upper and lower bounds on the best achievable excess risk when training only on the selected sub-sample.
§ References
Open Problem: Data Selection for Regression Tasks
Steve Hanneke, Shay Moran, Alexander Shlimovich, Amir Yehudayoff (2025)
Conference on Learning Theory (COLT), PMLR 291
📍 Open-problem note in COLT proceedings.
Open Problem: Data Selection for Regression Tasks (PDF)
Steve Hanneke, Shay Moran, Alexander Shlimovich, Amir Yehudayoff (2025)
Conference on Learning Theory (COLT), PMLR 291
📍 Proceedings PDF.