Unsolved

Learning sparse linear concepts by priming the features

Posed by Manfred K. Warmuth et al. (2023)

§ Problem Statement

Setup

Online linear regression with squared loss in dimension nn. For rounds t=1,2,,Tt=1,2,\dots,T: the learner observes xtRnx_t\in\mathbb{R}^n, predicts y^t=wt,xt\hat y_t=\langle w_t,x_t\rangle, then observes ytRy_t\in\mathbb{R} and incurs loss t(wt)=(y^tyt)2\ell_t(w_t)=(\hat y_t-y_t)^2. Assume coordinatewise bounded instances xtX\|x_t\|_\infty\le X and bounded outcomes ytY|y_t|\le Y for all tt.

Let Xt1R(t1)×nX_{t-1}\in\mathbb{R}^{(t-1)\times n} be the design matrix with rows x1,,xt1x_1^\top,\dots,x_{t-1}^\top and yt1=(y1,,yt1)y_{t-1}=(y_1,\dots,y_{t-1})^\top. For a matrix AA, let AA^\dagger denote the Moore-Penrose pseudoinverse, and for pRnp\in\mathbb{R}^n let diag(p)\mathrm{diag}(p) be the diagonal matrix with pp on the diagonal.

A one-stage feature-priming least-squares predictor is specified by a closed-form rule that maps past data to a priming vector pt1=p(Xt1,yt1)Rnp_{t-1}=p(X_{t-1},y_{t-1})\in\mathbb{R}^n, forms primed design Xt1:=Xt1diag(pt1)X'_{t-1}:=X_{t-1}\,\mathrm{diag}(p_{t-1}), computes primed least-squares coefficients vt1:=(Xt1)yt1v_{t-1}:=(X'_{t-1})^\dagger y_{t-1}, and sets

wt:=diag(pt1)vt1=diag(pt1)(Xt1diag(pt1))yt1.w_t := \mathrm{diag}(p_{t-1})\,v_{t-1} = \mathrm{diag}(p_{t-1})\,(X_{t-1}\,\mathrm{diag}(p_{t-1}))^\dagger y_{t-1}.

The COLT 2023 note highlights concrete closed-form priming rules motivated by sparsity, including (i) per-coordinate 1D least squares pt1,i:=Xt1(:,i),yt1/Xt1(:,i)22p_{t-1,i}:=\langle X_{t-1}(:,i),y_{t-1}\rangle/\|X_{t-1}(:,i)\|_2^2 when Xt1(:,i)20\|X_{t-1}(:,i)\|_2\neq 0 (and e.g. 00 otherwise), (ii) correlation-style scores between column ii and yt1y_{t-1}, and (iii) the two-pass rule pt1:=wt1LLSp_{t-1}:=w^{\mathrm{LLS}}_{t-1} where wt1LLS:=Xt1yt1w^{\mathrm{LLS}}_{t-1}:=X_{t-1}^\dagger y_{t-1} is the minimum-norm least-squares fit on unprimed features.

Unsolved Problem

For at least one of these priming rules, does the resulting online algorithm satisfy a regret bound with only logarithmic dependence on nn against sparse (or 1\ell_1-bounded) linear comparators? Namely, is there a bound of the form

RegretT:=t=1T(wt,xtyt)2minw1W1t=1T(w,xtyt)2poly(W1,X,Y)logn\mathrm{Regret}_T := \sum_{t=1}^T (\langle w_t,x_t\rangle-y_t)^2 - \min_{\|w\|_1\le W_1}\sum_{t=1}^T (\langle w,x_t\rangle-y_t)^2 \le \mathrm{poly}(W_1,X,Y)\cdot \log n

(or more finely poly(W1,X,Y)klog(n/k)\le \mathrm{poly}(W_1,X,Y)\cdot k\log(n/k) when competing with kk-sparse comparators), uniformly over all sequences with xtX\|x_t\|_\infty\le X and ytY|y_t|\le Y? The question remains open even in noise-free realizable settings yt=w,xty_t=\langle w^*,x_t\rangle with sparse ww^*.

§ Discussion

Loading discussion…

§ Significance & Implications

For squared-loss online prediction, multiplicative-update methods can exploit sparsity to obtain regret bounds whose dimension dependence is logarithmic in nn under 1\ell_1/\ell_\infty type conditions, but their updates are not least-squares closed forms. In contrast, least-squares-style (Euclidean/second-order) updates are closed-form but typically do not yield sparsity-adaptive dimension dependence. Feature priming is a concrete, computationally simple way to bias a second least-squares fit toward a small set of coordinates using only the past data. Proving (or refuting) a worst-case logn\log n-type regret guarantee for such priming would clarify whether closed-form least-squares predictors can match sparsity-adaptive guarantees without switching to multiplicative updates, and would delineate the limits of coordinate-sensitive preconditioning schemes in online regression.

§ Known Partial Results

  • Warmuth et al. (2023): The COLT 2023 note observes that sparse linear problems are learned well by online multiplicative updates, motivating the search for closed-form least-squares-style alternatives.

  • Warmuth et al. (2023): The proposed priming family is fully specified by closed-form functions of past data followed by pseudoinverse least squares on a diagonally rescaled (coordinate-primed) design.

  • Warmuth et al. (2023): Experiments reported in the note indicate that several priming rules (notably using pt1=Xt1yt1p_{t-1}=X_{t-1}^\dagger y_{t-1} and variants such as per-feature 1D least squares or correlation-based scores) can identify sparse relevant features faster than vanilla least squares while behaving similarly to least squares on dense targets.

  • Warmuth et al. (2023): No worst-case online regret bound is currently given for these priming-based closed-form updates, and the existence of sparsity-adaptive (e.g., logn\log n or klog(n/k)k\log(n/k)) regret bounds remains open.

§ References

[1]

Open Problem: Learning sparse linear concepts by priming the features

Manfred K. Warmuth, Ehsan Amid (2023)

Conference on Learning Theory (COLT), PMLR 195

📍 Open-problem note in COLT proceedings.

[2]

Open Problem: Learning sparse linear concepts by priming the features (PDF)

Manfred K. Warmuth, Ehsan Amid (2023)

Conference on Learning Theory (COLT), PMLR 195

📍 Proceedings PDF.

§ Tags