Unsolved

Tight Characterization of Instance-Optimal Identity Testing

Posed by Clément Canonne (2024)

§ Problem Statement

Setup

Let qq be a known discrete probability distribution over a countable domain Ω\Omega, given to the tester via a succinct description, and let ε(0,1]\varepsilon\in(0,1]. The tester is given i.i.d. samples from an unknown distribution pp over Ω\Omega and must, with success probability at least 2/32/3 (over its internal randomness and the samples), distinguish between

H0:p=qandH1:TV(p,q)>ε,H_0: p=q \quad\text{and}\quad H_1: \mathrm{TV}(p,q) > \varepsilon,

where total variation distance is TV(p,q):=12xΩp(x)q(x)\mathrm{TV}(p,q):=\tfrac12\sum_{x\in\Omega}|p(x)-q(x)|. Define the instance-optimal sample complexity n(q,ε)n^*(q,\varepsilon) as the minimum integer nn for which there exists such a (possibly randomized) tester that uses at most nn samples from pp.

Unsolved Problem

Characterize n(q,ε)n^*(q,\varepsilon) up to universal constant factors for all discrete qq and all ε(0,1]\varepsilon\in(0,1], in terms of a simple, explicit instance-dependent functional of the reference distribution and accuracy: identify a functional Φ(q,ε)\Phi(q,\varepsilon) and absolute constants c,C>0c,C>0 such that

cΦ(q,ε)n(q,ε)CΦ(q,ε)c\,\Phi(q,\varepsilon) \le n^*(q,\varepsilon) \le C\,\Phi(q,\varepsilon)

for all q,εq,\varepsilon. Equivalently, what is the correct instance-dependent quantity that governs the optimal sample complexity of identity testing to a fixed reference distribution qq at distance parameter ε\varepsilon?

§ Discussion

Loading discussion…

§ Significance & Implications

Worst-case identity-testing bounds optimize over the hardest reference distributions, but can substantially overestimate the samples needed for a given, fixed qq. A tight instance-optimal characterization would (i) precisely separate which parts of qq drive the statistical difficulty at accuracy ε\varepsilon, (ii) pin down the best achievable adaptivity (designing a single tester whose sample usage matches n(q,ε)n^*(q,\varepsilon) up to constants for every instance), and (iii) provide a sharp benchmark for lower bounds and algorithm design across qualitatively different regimes (e.g., near-uniform versus heavy-tailed or highly non-uniform references).

§ Known Partial Results

  • Canonne (2024): introduced the instance-optimal identity-testing formulation and gave upper and lower bounds on n(q,ε)n^*(q,\varepsilon) expressed via an 2/3\ell_{2/3}-type quantity applied to a suitable truncation of the reference distribution qq.

  • Canonne (2024): In this line of work, the 2/3\ell_{2/3} quasi-norm is typically written as v2/3:=(ivi2/3)3/2\|v\|_{2/3}:=(\sum_i |v_i|^{2/3})^{3/2} for a nonnegative vector vv.

  • Canonne (2024): The resulting upper and lower bounds do not, in general, match within constant factors: for some choices of qq the multiplicative gap between the best known upper and lower bounds can be arbitrarily large.

  • Canonne (2024): Consequently, the identification of the correct functional Φ(q,ε)\Phi(q,\varepsilon) that tightly characterizes n(q,ε)n^*(q,\varepsilon) up to universal constant factors remains open.

§ References

[1]

Open Problem: Tight Characterization of Instance-Optimal Identity Testing

Clément Canonne (2024)

Conference on Learning Theory (COLT), PMLR 247

📍 Open-problem note in COLT proceedings.

[2]

Open Problem: Tight Characterization of Instance-Optimal Identity Testing (PDF)

Clément Canonne (2024)

Conference on Learning Theory (COLT), PMLR 247

📍 Proceedings PDF.

§ Tags