log(n) factor in "Local Glivenko-Cantelli"

§ Problem Statement

Setup

Let $p=(p_j)_{j\in\mathbb{N}}$ be a nonincreasing sequence with $0\le p_j\le 1/2$ for all $j$ and $p_j\to 0$ as $j\to\infty$ . For each $n\in\mathbb{N}$ , let $(Y_j)_{j\in\mathbb{N}}$ be independent random variables with $Y_j\sim\mathrm{Binomial}(n,p_j)$ . Define the centered empirical errors $\bar Y_j:=Y_j/n-p_j$ and

\Delta_n(p):=\mathbb{E}\,\sup_{j\in\mathbb{N}}\,|\bar Y_j|.

Using natural logarithms, define the complexity parameters

S(p):=\sup_{j\in\mathbb{N}} p_j\,\log(j+1),\qquad T(p):=\sup_{j\in\mathbb{N}}\frac{\log(j+1)}{\log(1/p_j)}.

Cohen and Kontorovich (COLT 2023) proved that for all $n\ge e^3$ ,

\Delta_n(p)\le c\left(\sqrt{\frac{S(p)}{n}}+\frac{T(p)\log n}{n}\right)

for an absolute constant $c>0$ .

Unsolved Problem

Is the factor $\log n$ necessary? Does there exist an absolute constant $C>0$ such that for all such sequences $p$ and all $n\ge e^3$ ,

\Delta_n(p)\le C\left(\sqrt{\frac{S(p)}{n}}+\frac{T(p)}{n}\right)?

§ Discussion

Loading discussion…

§ Significance & Implications

This problem asks for the sharp (up to absolute constants) finite-sample behavior of the expected sup-norm estimation error when simultaneously estimating countably many Bernoulli means with nonuniform, decaying marginals. The current best general upper bound matches known lower-bound scalings in $S(p)$ and $T(p)$ except for a multiplicative $\log n$ in the $T(p)/n$ term; removing it would yield a rate that is simultaneously consistent with both asymptotic lower bounds and would sharpen sample-size guarantees in regimes where $T(p)$ dominates the error.

§ Known Partial Results

Cohen et al. (2023): (Characterization of consistency) Cohen and Kontorovich (2023) show that $\Delta_n(p)\to 0$ as $n\to\infty$ if and only if $T(p)<\infty$ .
Cohen et al. (2023): (Finite-sample upper bound) For all $n\ge e^3$ , they prove $\Delta_n(p)\le c\left(\sqrt{S(p)/n}+T(p)\log n/n\right)$ for an absolute constant $c>0$ .
Cohen et al. (2023): (Asymptotic lower bounds) They provide lower bounds of the form $\liminf_{n\to\infty}\sqrt{n}\,\Delta_n(p)\ge c\,\sqrt{S(p)}$ and $\liminf_{n\to\infty} n\,\Delta_n(p)\ge c\,T(p)$ (with a universal constant $c>0$ ), showing that the dependence on $S(p)$ and $T(p)$ is asymptotically necessary up to absolute constants.
Cohen et al. (2023): (Statistical interpretation) The binomial formulation corresponds to the coordinate-wise empirical mean error for product Bernoulli measures under $\ell_\infty$ : for i.i.d. samples $X_1,\dots,X_n\in\{0,1\}^d$ with coordinate means $p(1),\dots,p(d)$ , one has $\mathbb{E}\,\|\hat p-p\|_\infty=\mathbb{E}\max_{j\in[d]}|\hat p(j)-p(j)|$ , and after sorting the coordinates by decreasing $p(j)$ this reduces to the stated setting.

§ References

[1]

Open problem: log(n) factor in "Local Glivenko-Cantelli"

Doron Cohen, Aryeh Kontorovich (2023)

Conference on Learning Theory (COLT), PMLR 195

📍 Open-problem note in COLT proceedings.

Link ↗

[2]

Open problem: log(n) factor in "Local Glivenko-Cantelli" (PDF)

Doron Cohen, Aryeh Kontorovich (2023)

Conference on Learning Theory (COLT), PMLR 195

📍 Proceedings PDF.

Link ↗

§ Tags

colt-open-problem learning-theory