Unsolved

The Dependence of Sample Complexity Lower Bounds on Planning Horizon

Posed by Nan Jiang et al. (2018)

§ Problem Statement

Setup

Let HNH\in\mathbb{N}. Consider episodic reinforcement learning in an unknown, finite-horizon Markov decision process (MDP) M=(S,A,H,{Pt}t=1H,{rt}t=1H,ρ)M=(\mathcal{S},\mathcal{A},H,\{P_t\}_{t=1}^H,\{r_t\}_{t=1}^H,\rho) where ρ\rho is an initial-state distribution over S\mathcal{S}, each reward function satisfies rt(s,a)[0,1]r_t(s,a)\in[0,1], and Pt(s,a)P_t(\cdot\mid s,a) is the stage-tt transition kernel. A (possibly nonstationary) policy is π=(π1,,πH)\pi=(\pi_1,\ldots,\pi_H), and its value is

VMπ(ρ)=E[t=1Hrt(St,At)],V_M^\pi(\rho)=\mathbb{E}\Big[\sum_{t=1}^H r_t(S_t,A_t)\Big],

with S1ρS_1\sim\rho, Atπt(St)A_t\sim\pi_t(\cdot\mid S_t), and St+1Pt(St,At)S_{t+1}\sim P_t(\cdot\mid S_t,A_t). An algorithm interacts with MM for nn episodes (observing the resulting trajectories) and outputs a policy π^\hat\pi.

Fix (ε,δ)(0,1)×(0,1)(\varepsilon,\delta)\in(0,1)\times(0,1). For a class of MDPs M\mathcal{M}, define the episodic PAC sample complexity as the smallest nn for which there exists an algorithm such that for all MMM\in\mathcal{M},

Pr[VMπ(ρ)VMπ^(ρ)ε]1δ,\Pr\big[V_M^{\pi^*}(\rho)-V_M^{\hat\pi}(\rho)\le \varepsilon\big]\ge 1-\delta,

where π\pi^* is an optimal policy for MM.

Unsolved Problem

Identify a suitable, natural family of MDP classes {MH}H1\{\mathcal{M}_H\}_{H\ge 1} and prove an information-theoretic lower bound on PAC sample complexity that has an unavoidable, nontrivial dependence on the planning horizon HH once strong technical assumptions used in recent horizon-light PAC-RL upper bounds are removed. Concretely, can one construct MH\mathcal{M}_H so that, after controlling other measures of problem size/complexity so they do not grow rapidly with HH (e.g., keeping S,A|\mathcal{S}|,|\mathcal{A}| or an appropriate function-approximation complexity measure fixed or only mildly growing), every PAC algorithm still requires at least nf(H,ε,δ)n\ge f(H,\varepsilon,\delta) episodes with ff increasing in HH (e.g., polynomial or stronger), and such HH-dependence cannot be eliminated without reinstating those strong assumptions?

§ Discussion

Loading discussion…

§ Significance & Implications

Long-horizon RL is widely viewed as difficult due to delayed consequences and multi-step uncertainty, yet several modern PAC-RL results exhibit only weak ("superficial") horizon dependence under additional technical assumptions. A rigorous lower bound that forces increasing sample complexity with HH for assumption-light MDP classes would clarify whether these horizon-light guarantees rely on restrictive modeling choices, and would delineate when long-horizon learning is statistically harder even before computational considerations.

§ Known Partial Results

  • Jiang et al. (2018): Jiang and Agarwal (COLT 2018) highlight a tension between the perceived difficulty of long planning horizons and recent PAC-RL upper bounds whose horizon dependence appears only superficial, and they pose the question of whether stronger HH-dependence emerges when key assumptions are removed.

  • Jiang et al. (2018): The note emphasizes that a meaningful horizon-dependent lower bound should not obtain HH-dependence merely by encoding longer horizons via larger state spaces; instead, the construction should isolate difficulty attributable to long-horizon interaction itself.

  • Jiang et al. (2018): The note sketches desired characteristics for hard instances, informally pointing to compounding multi-step uncertainty and delayed effects as mechanisms by which distinguishing near-optimal from suboptimal behavior could require more data as HH grows.

§ References

[1]

Open Problem: The Dependence of Sample Complexity Lower Bounds on Planning Horizon

Nan Jiang, Alekh Agarwal (2018)

Conference on Learning Theory (COLT), PMLR 75

📍 Open-problem note in COLT proceedings.

[2]

Open Problem: The Dependence of Sample Complexity Lower Bounds on Planning Horizon (PDF)

Nan Jiang, Alekh Agarwal (2018)

Conference on Learning Theory (COLT), PMLR 75

📍 Proceedings PDF.

§ Tags