Order Optimal Regret Bounds for Kernel-Based Reinforcement Learning
Posed by Sattar Vakili (2024)
§ Problem Statement
Setup
Consider an episodic finite-horizon Markov decision process (MDP) with horizon , (possibly infinite) state space , finite action space , and unknown time-dependent reward and transition operators and for . Let and let be a positive semidefinite kernel with reproducing kernel Hilbert space (RKHS) and norm . Assume the following kernel-based function approximation structure:
-
(Rewards in RKHS) For each , and .
-
(Kernelized transitions via test functions) Fix a normed function class of bounded real-valued functions on with norm . For each and each , define
Assume for all , and there exists such that for all and all with , one has .
An agent interacts with the MDP for episodes. In episode , it chooses a (possibly history-dependent) policy and observes a trajectory . Define the expected regret after episodes by
where is the optimal value function and is the value function of under the true MDP.
Unsolved Problem
Characterize the minimax expected regret
as a function of , , , , and a kernel complexity measure associated with over (e.g., an effective dimension or an information-gain quantity). In particular, provide matching upper and lower bounds up to at most polylogarithmic factors, and determine whether the best known kernel-based RL upper bounds have the correct dependence on , , and the kernel complexity term (or improve either the upper bounds and/or the lower bounds to close any gap).
§ Discussion
§ Significance & Implications
This problem asks for sharp sample-efficiency limits for exploration and control when rewards and transition-related quantities admit RKHS structure on state-action pairs. A resolved minimax rate would identify which kernel complexity measure (and which dependence on horizon ) fundamentally governs achievable regret in multi-step decision making, beyond what is known for prediction or bandits, where uncertainty does not propagate through Bellman recursion. Such a characterization would also provide a clean nonparametric benchmark that strictly extends linear-structure MDP theory while remaining analyzable with RKHS tools.
§ Known Partial Results
Vakili (2024): For tabular episodic MDPs and for linear-structure MDP models, there are well-established minimax-style regret benchmarks (including explicit dependence on horizon and model dimension), which serve as reference points for what order-optimality should mean in simpler classes.
Vakili (2024): For kernelized nonlinear prediction and contextual bandits, regret bounds commonly scale as for a kernel complexity term (often phrased via effective dimension or information gain), motivating the conjecture that an analogous complexity term should control kernel-based RL rates as well.
Vakili (2024): Existing analyses for kernel-based RL under RKHS/kernelized-MDP assumptions achieve sublinear regret, but the best available bounds are not known to match lower bounds in their dependence on and/or the kernel complexity term.
Vakili (2024): A central difficulty relative to bandits is error propagation through Bellman recursion: transition uncertainty affects multi-step value estimates, so confidence sets must control both function approximation error and its compounding across time steps.
Vakili (2024): The COLT 2024 open-problem note frames closing the remaining upper/lower-bound gap (including the correct role of kernel complexity and horizon) as unresolved.
§ References
Open Problem: Order Optimal Regret Bounds for Kernel-Based Reinforcement Learning
Sattar Vakili (2024)
Conference on Learning Theory (COLT), PMLR 247
📍 Open-problem note in COLT proceedings.
Open Problem: Order Optimal Regret Bounds for Kernel-Based Reinforcement Learning (PDF)
Sattar Vakili (2024)
Conference on Learning Theory (COLT), PMLR 247
📍 Proceedings PDF.