Unsolved

Improper learning of mixtures of Gaussians

Posed by Elad Hazan et al. (2018)

§ Problem Statement

Setup

Fix integers d1d\ge 1 and k2k\ge 2, an accuracy parameter ϵ(0,1)\epsilon\in(0,1), and a radius R>0R>0. Let X:={xRd:x2R}X:=\{x\in\mathbb{R}^d:\|x\|_2\le R\}, and let DD be an unknown distribution supported on XX (in particular, the motivating case is when DD is an overcomplete mixture of kk Gaussians with k>dk>d).

A (deterministic) bb-bit pointwise compression scheme is a pair (f,ρ)(f,\rho) with encoder f:X{0,1}bf:X\to\{0,1\}^b and decoder ρ:{0,1}bRd\rho:\{0,1\}^b\to\mathbb{R}^d. Its squared reconstruction error on DD is

RD(f,ρ):=ExDρ(f(x))x22.\mathcal{R}_D(f,\rho):=\mathbb{E}_{x\sim D}\,\|\rho(f(x))-x\|_2^2.

Define the (spherical) kk-center quantization class Fk\mathcal{F}_k as follows: for any codebook (centers) μ1,,μkX\mu_1,\dots,\mu_k\in X, the encoder outputs an index i(x)argmini[k]xμi22i(x)\in\arg\min_{i\in[k]}\|x-\mu_i\|_2^2 (ties broken arbitrarily), represented using b:=log2kb:=\lceil\log_2 k\rceil bits, and the decoder outputs μi(x)\mu_{i(x)}. The optimal kk-center distortion is

OPTk(D):=inf(f,ρ)FkRD(f,ρ)=infμ1,,μkXExD[mini[k]xμi22].\mathrm{OPT}_k(D):=\inf_{(f,\rho)\in\mathcal{F}_k}\mathcal{R}_D(f,\rho)=\inf_{\mu_1,\dots,\mu_k\in X}\mathbb{E}_{x\sim D}\Big[\min_{i\in[k]}\|x-\mu_i\|_2^2\Big].

An (improper) compression-based learner is an algorithm AA that, given mm i.i.d. samples from DD, outputs a (not necessarily kk-center) compression scheme (g,ρg)(g,\rho_g) with code length b=b(k,ϵ)b'=b'(k,\epsilon) such that, with probability at least 2/32/3 over the samples,

RD(g,ρg)OPTk(D)+ϵ.\mathcal{R}_D(g,\rho_g)\le \mathrm{OPT}_k(D)+\epsilon.

Unsolved Problem

Problem 2018. In the overcomplete regime k>dk>d, does there exist a computationally efficient (polynomial-time in d,k,1/ϵd,k,1/\epsilon and the sample size) improper learner achieving the above guarantee while using only

b(k,ϵ)=poly(logk,log(1/ϵ))b'(k,\epsilon)=\mathrm{poly}(\log k,\log(1/\epsilon))

bits per point (i.e., with no polynomial dependence on dd in the code length)? Alternatively, can one prove that no such efficient learner exists under this worst-case, compression-based notion of learning?

§ Discussion

Loading discussion…

§ Significance & Implications

This problem asks whether allowing improper reconstructions (arbitrary decoders/encoders learned from data) can yield a worst-case additive-ϵ\epsilon competitor to the optimal kk-center distortion while keeping the per-point description length near logk\log k (up to polylog factors), even when k>dk>d. A positive result would formalize an efficient, sample-based route to near-optimal quantization error for distributions such as overcomplete Gaussian mixtures without committing to a generative/proper mixture model. A negative result would isolate a concrete barrier showing that, even after relaxing to improper compression schemes, achieving OPTk(D)+ϵ\mathrm{OPT}_k(D)+\epsilon with only polylogarithmic bits is computationally out of reach in the overcomplete setting.

§ Known Partial Results

  • Hazan et al. (2018): The COLT 2018 open-problem note motivates the question from unsupervised learning of (overcomplete) mixtures of Gaussians, and explicitly switches to a worst-case, compression-based objective to allow improper learners.

  • Hazan et al. (2018): The benchmark objective OPTk(D)=infμ1,,μkE[minixμi22]\mathrm{OPT}_k(D)=\inf_{\mu_1,\dots,\mu_k}\mathbb{E}[\min_i\|x-\mu_i\|_2^2] is the population kk-means/kk-center squared-distortion objective; even the corresponding finite-sample optimization problem is NP-hard in general.

  • Hazan et al. (2018): If the code-length constraint is dropped, trivial worst-case additive-ϵ\epsilon reconstruction is possible by quantizing all of XX to an ϵ\epsilon-net, at the cost of bdlog(R/ϵ)b'\approx d\log(R/\epsilon) bits per point (far larger than poly(logk,log(1/ϵ))\mathrm{poly}(\log k,\log(1/\epsilon)) when dd is large).

  • Hazan et al. (2018): A different improper relaxation that reconstructs points from low-dimensional linear structure leads to PCA-type algorithms, but representing per-point coefficients to achieve error ϵ\epsilon requires code length scaling on the order of the intrinsic dimension times log(R/ϵ)\log(R/\epsilon), which does not match the desired polylogarithmic dependence on kk.

  • Hazan et al. (2018): The note points to bi-criteria approximations for kk-means (e.g., allowing more than kk centers) and to convex/spectral approaches in the broader comparative/worst-case unsupervised-learning literature as potential starting points, but does not resolve whether polylog-bit improper learning is achievable for k>dk>d.

§ References

[1]

Open problem: Improper learning of mixtures of Gaussians

Elad Hazan, Livni Roi (2018)

Conference on Learning Theory (COLT), PMLR 75

📍 Open-problem note in COLT proceedings.

[2]

Open problem: Improper learning of mixtures of Gaussians (PDF)

Elad Hazan, Livni Roi (2018)

Conference on Learning Theory (COLT), PMLR 75

📍 Proceedings PDF.

§ Tags