Unsolved

Order Optimal Regret Bounds for Kernel-Based Reinforcement Learning

Posed by Sattar Vakili (2024)

§ Problem Statement

Setup

Consider an episodic finite-horizon Markov decision process (MDP) with horizon HH, (possibly infinite) state space S\mathcal S, finite action space A\mathcal A, and unknown time-dependent reward and transition operators rh:S×A[0,1]r_h:\mathcal S\times\mathcal A\to[0,1] and Ph(s,a)P_h(\cdot\mid s,a) for h{1,,H}h\in\{1,\dots,H\}. Let X=S×A\mathcal X=\mathcal S\times\mathcal A and let k:X×XRk:\mathcal X\times\mathcal X\to\mathbb R be a positive semidefinite kernel with reproducing kernel Hilbert space (RKHS) Hk\mathcal H_k and norm Hk\|\cdot\|_{\mathcal H_k}. Assume the following kernel-based function approximation structure:

  1. (Rewards in RKHS) For each hh, rhHkr_h\in\mathcal H_k and rhHkBr\|r_h\|_{\mathcal H_k}\le B_r.

  2. (Kernelized transitions via test functions) Fix a normed function class G\mathcal G of bounded real-valued functions on S\mathcal S with norm G\|\cdot\|_{\mathcal G}. For each hh and each fGf\in\mathcal G, define

(Phf)(s,a):=E[f(Sh+1)Sh=s,Ah=a].(P_h f)(s,a):=\mathbb E[f(S_{h+1})\mid S_h=s, A_h=a].

Assume PhfHkP_h f\in\mathcal H_k for all fGf\in\mathcal G, and there exists BpB_p such that for all hh and all ff with fG1\|f\|_{\mathcal G}\le 1, one has PhfHkBp\|P_h f\|_{\mathcal H_k}\le B_p.

An agent interacts with the MDP for TT episodes. In episode tt, it chooses a (possibly history-dependent) policy πt\pi_t and observes a trajectory (St,1,At,1,Rt,1,,St,H,At,H,Rt,H,St,H+1)(S_{t,1},A_{t,1},R_{t,1},\dots,S_{t,H},A_{t,H},R_{t,H},S_{t,H+1}). Define the expected regret after TT episodes by

Regret(T):=t=1T(V1(St,1)V1πt(St,1)),\mathrm{Regret}(T):=\sum_{t=1}^T \big( V_1^*(S_{t,1})-V_1^{\pi_t}(S_{t,1}) \big),

where V1V_1^* is the optimal value function and V1πtV_1^{\pi_t} is the value function of πt\pi_t under the true MDP.

Unsolved Problem

Characterize the minimax expected regret

infalgorithms supMDPs satisfying (1)-(2) E[Regret(T)]\inf_{\text{algorithms}}\ \sup_{\text{MDPs satisfying (1)-(2)}}\ \mathbb E[\mathrm{Regret}(T)]

as a function of TT, HH, BrB_r, BpB_p, and a kernel complexity measure associated with kk over X\mathcal X (e.g., an effective dimension or an information-gain quantity). In particular, provide matching upper and lower bounds up to at most polylogarithmic factors, and determine whether the best known kernel-based RL upper bounds have the correct dependence on TT, HH, and the kernel complexity term (or improve either the upper bounds and/or the lower bounds to close any gap).

§ Discussion

Loading discussion…

§ Significance & Implications

This problem asks for sharp sample-efficiency limits for exploration and control when rewards and transition-related quantities admit RKHS structure on state-action pairs. A resolved minimax rate would identify which kernel complexity measure (and which dependence on horizon HH) fundamentally governs achievable regret in multi-step decision making, beyond what is known for prediction or bandits, where uncertainty does not propagate through Bellman recursion. Such a characterization would also provide a clean nonparametric benchmark that strictly extends linear-structure MDP theory while remaining analyzable with RKHS tools.

§ Known Partial Results

  • Vakili (2024): For tabular episodic MDPs and for linear-structure MDP models, there are well-established minimax-style regret benchmarks (including explicit dependence on horizon and model dimension), which serve as reference points for what order-optimality should mean in simpler classes.

  • Vakili (2024): For kernelized nonlinear prediction and contextual bandits, regret bounds commonly scale as O~(TΓT)\tilde O(\sqrt{T\,\Gamma_T}) for a kernel complexity term ΓT\Gamma_T (often phrased via effective dimension or information gain), motivating the conjecture that an analogous complexity term should control kernel-based RL rates as well.

  • Vakili (2024): Existing analyses for kernel-based RL under RKHS/kernelized-MDP assumptions achieve sublinear regret, but the best available bounds are not known to match lower bounds in their dependence on HH and/or the kernel complexity term.

  • Vakili (2024): A central difficulty relative to bandits is error propagation through Bellman recursion: transition uncertainty affects multi-step value estimates, so confidence sets must control both function approximation error and its compounding across time steps.

  • Vakili (2024): The COLT 2024 open-problem note frames closing the remaining upper/lower-bound gap (including the correct role of kernel complexity and horizon) as unresolved.

§ References

[1]

Open Problem: Order Optimal Regret Bounds for Kernel-Based Reinforcement Learning

Sattar Vakili (2024)

Conference on Learning Theory (COLT), PMLR 247

📍 Open-problem note in COLT proceedings.

[2]

Open Problem: Order Optimal Regret Bounds for Kernel-Based Reinforcement Learning (PDF)

Sattar Vakili (2024)

Conference on Learning Theory (COLT), PMLR 247

📍 Proceedings PDF.

§ Tags