Unsolved

Data Selection for Regression Tasks

Posed by Steve Hanneke et al. (2025)

§ Problem Statement

Setup

Let Z=X×R\mathcal{Z}=\mathcal{X}\times\mathbb{R} and let PP be an unknown distribution on Z\mathcal{Z}. Draw a pooled dataset S=((xi,yi))i=1NPNS=((x_i,y_i))_{i=1}^N\sim P^N i.i.d. A (possibly randomized) selection rule Sel\mathrm{Sel} observes the entire labeled pool SS and outputs an index set I=Sel(S){1,,N}I=\mathrm{Sel}(S)\subseteq\{1,\ldots,N\} with I=n|I|=n (with 1nN1\le n\le N); write SI=((xi,yi))iIS_I=((x_i,y_i))_{i\in I} for the selected sub-sample. Fix a learning rule A\mathcal{A} that maps any labeled sample to a predictor f^=A():XR\hat f=\mathcal{A}(\cdot):\mathcal{X}\to\mathbb{R}. Evaluate predictors by squared loss and population risk

(f,(x,y))=(f(x)y)2,LP(f)=E(X,Y)P[(f(X)Y)2].\ell(f,(x,y))=(f(x)-y)^2,\qquad L_P(f)=\mathbb{E}_{(X,Y)\sim P}[(f(X)-Y)^2].

Fix a hypothesis class F{f:XR}\mathcal{F}\subseteq\{f:\mathcal{X}\to\mathbb{R}\} and a family of distributions P\mathcal{P}. Define the minimax excess risk (over P\mathcal{P}) achievable by first selecting nn points from an i.i.d. pool of size NN and then running the fixed learner A\mathcal{A}:

RA(n,N;P):=supPP infSel E[LP(A(SSel(S)))inffFLP(f)],\mathfrak{R}_{\mathcal{A}}(n,N;\mathcal{P}) := \sup_{P\in\mathcal{P}}\ \inf_{\mathrm{Sel}}\ \mathbb{E}\Big[ L_P\big(\mathcal{A}(S_{\mathrm{Sel}(S)})\big) - \inf_{f\in\mathcal{F}} L_P(f) \Big],

where the expectation is over SPNS\sim P^N and any internal randomness of Sel\mathrm{Sel} (and of A\mathcal{A}, if randomized).

Unsolved Problem

For natural fixed regression learning rules A\mathcal{A}, determine tight (up to constants and/or logarithmic factors) upper and lower bounds on RA(n,N;P)\mathfrak{R}_{\mathcal{A}}(n,N;\mathcal{P}) as a function of nn and NN in basic regression settings, and characterize when selecting nNn\ll N points can achieve excess risk comparable to using all NN points.

Two concrete testbeds emphasized in the COLT open-problem note are:

  1. Mean estimation: X\mathcal{X} is a singleton, F={fμ:fμ(x)μ}\mathcal{F}=\{f_\mu: f_\mu(x)\equiv \mu\}, and A\mathcal{A} returns the empirical mean on the selected labels, μ^=1niIyi\hat\mu=\frac{1}{n}\sum_{i\in I} y_i. For a moment-bounded family such as P={P:E[Y2]1}\mathcal{P}=\{P: \mathbb{E}[Y^2]\le 1\}, determine the optimal rate of
supPP infSel E[(μ^EY)2]\sup_{P\in\mathcal{P}}\ \inf_{\mathrm{Sel}}\ \mathbb{E}[(\hat\mu-\mathbb{E}Y)^2]

as a function of nn and NN.

  1. Linear regression: XRd\mathcal{X}\subseteq\mathbb{R}^d, F={xw,x:wRd}\mathcal{F}=\{x\mapsto\langle w,x\rangle: w\in\mathbb{R}^d\}, and A\mathcal{A} is a standard fixed rule such as ordinary least squares (squared-loss ERM) applied to SIS_I. For a moment-bounded family such as P={P:E[X22]1, E[Y2]1}\mathcal{P}=\{P: \mathbb{E}[\|X\|_2^2]\le 1,\ \mathbb{E}[Y^2]\le 1\}, determine the optimal dependence of RA(n,N;P)\mathfrak{R}_{\mathcal{A}}(n,N;\mathcal{P}) on n,N,n,N, and dd, and identify regimes where nNn\ll N suffices to match (up to constants/log factors) the excess risk achievable from the full pool.

§ Discussion

Loading discussion…

§ Significance & Implications

This problem isolates the algorithm-dependent limits of regression data curation: when the training rule A\mathcal{A} is fixed in advance, how much of the population-risk performance (measured by excess squared loss relative to inffFLP(f)\inf_{f\in\mathcal{F}} L_P(f)) can be preserved by selecting only nn labeled examples from an available i.i.d. pool of size NN? Tight minimax characterizations of RA(n,N;P)\mathfrak{R}_{\mathcal{A}}(n,N;\mathcal{P}) would precisely quantify when subset selection can be statistically near-lossless versus when it necessarily incurs a penalty, and would make explicit how this depends on the selection budget nn, pool size NN, and (for linear regression) dimension dd under concrete moment conditions.

§ Known Partial Results

  • Hanneke et al. (2025): The COLT 2025 open-problem note (Hanneke, Moran, Shlimovich, Yehudayoff, 2025) formulates the general data-selection question for regression with a fixed learning rule A\mathcal{A} and a selection budget nn from a pool of size NN, evaluated by population squared loss/excess risk.

  • Hanneke et al. (2025): The note highlights mean estimation and linear regression as basic settings where one seeks tight (up to constants/log factors) upper and lower bounds on the best achievable excess risk when training A\mathcal{A} only on the selected sub-sample.

§ References

[1]

Open Problem: Data Selection for Regression Tasks

Steve Hanneke, Shay Moran, Alexander Shlimovich, Amir Yehudayoff (2025)

Conference on Learning Theory (COLT), PMLR 291

📍 Open-problem note in COLT proceedings.

[2]

Open Problem: Data Selection for Regression Tasks (PDF)

Steve Hanneke, Shay Moran, Alexander Shlimovich, Amir Yehudayoff (2025)

Conference on Learning Theory (COLT), PMLR 291

📍 Proceedings PDF.

§ Tags