The Dependence of Sample Complexity Lower Bounds on Planning Horizon
Posed by Nan Jiang et al. (2018)
§ Problem Statement
Setup
Let . Consider episodic reinforcement learning in an unknown, finite-horizon Markov decision process (MDP) where is an initial-state distribution over , each reward function satisfies , and is the stage- transition kernel. A (possibly nonstationary) policy is , and its value is
with , , and . An algorithm interacts with for episodes (observing the resulting trajectories) and outputs a policy .
Fix . For a class of MDPs , define the episodic PAC sample complexity as the smallest for which there exists an algorithm such that for all ,
where is an optimal policy for .
Unsolved Problem
Identify a suitable, natural family of MDP classes and prove an information-theoretic lower bound on PAC sample complexity that has an unavoidable, nontrivial dependence on the planning horizon once strong technical assumptions used in recent horizon-light PAC-RL upper bounds are removed. Concretely, can one construct so that, after controlling other measures of problem size/complexity so they do not grow rapidly with (e.g., keeping or an appropriate function-approximation complexity measure fixed or only mildly growing), every PAC algorithm still requires at least episodes with increasing in (e.g., polynomial or stronger), and such -dependence cannot be eliminated without reinstating those strong assumptions?
§ Discussion
§ Significance & Implications
Long-horizon RL is widely viewed as difficult due to delayed consequences and multi-step uncertainty, yet several modern PAC-RL results exhibit only weak ("superficial") horizon dependence under additional technical assumptions. A rigorous lower bound that forces increasing sample complexity with for assumption-light MDP classes would clarify whether these horizon-light guarantees rely on restrictive modeling choices, and would delineate when long-horizon learning is statistically harder even before computational considerations.
§ Known Partial Results
Jiang et al. (2018): Jiang and Agarwal (COLT 2018) highlight a tension between the perceived difficulty of long planning horizons and recent PAC-RL upper bounds whose horizon dependence appears only superficial, and they pose the question of whether stronger -dependence emerges when key assumptions are removed.
Jiang et al. (2018): The note emphasizes that a meaningful horizon-dependent lower bound should not obtain -dependence merely by encoding longer horizons via larger state spaces; instead, the construction should isolate difficulty attributable to long-horizon interaction itself.
Jiang et al. (2018): The note sketches desired characteristics for hard instances, informally pointing to compounding multi-step uncertainty and delayed effects as mechanisms by which distinguishing near-optimal from suboptimal behavior could require more data as grows.
§ References
Open Problem: The Dependence of Sample Complexity Lower Bounds on Planning Horizon
Nan Jiang, Alekh Agarwal (2018)
Conference on Learning Theory (COLT), PMLR 75
📍 Open-problem note in COLT proceedings.
Open Problem: The Dependence of Sample Complexity Lower Bounds on Planning Horizon (PDF)
Nan Jiang, Alekh Agarwal (2018)
Conference on Learning Theory (COLT), PMLR 75
📍 Proceedings PDF.