Unsolved

Risk of Ruin in Multiarmed Bandits

Posed by Filipo S. Perotto et al. (2019)

§ Problem Statement

Setup

Define a stochastic KK-armed bandit with arms i{1,,K}i\in\{1,\dots,K\}. Pulling arm ii at round tt yields a reward Xi,tνiX_{i,t}\sim \nu_i, where νi\nu_i is an unknown distribution on R\mathbb{R}; rewards may be positive or negative. A (non-anticipating) policy selects an arm AtA_t each round based on past actions and observed rewards. The agent maintains a budget process (Bt)t=0T(B_t)_{t=0}^T with initial budget B0=b0>0B_0=b_0>0 and update

Bt=Bt1+XAt,t(t1).B_t = B_{t-1} + X_{A_t,t}\quad (t\ge 1).

For a finite horizon TT, define the (path-dependent) ruin event

RT:={t{1,,T}:  Bt<0},\mathsf{R}_T := \{\exists t\in\{1,\dots,T\}:\; B_t<0\},

and denote its probability under the policy and instance (ν1,,νK)(\nu_1,\dots,\nu_K) by P(RT)\mathbb{P}(\mathsf{R}_T). A survival multiarmed bandit (S-MAB) problem asks to learn an exploration/exploitation strategy while controlling ruin risk: the objective is inherently multi-criteria, trading off reward (e.g., expected cumulative reward E[t=1TXAt,t]\mathbb{E}[\sum_{t=1}^T X_{A_t,t}]) against safety as quantified by P(RT)\mathbb{P}(\mathsf{R}_T).

Unsolved Problem

(a) Regret definition. Provide a formal regret notion for S-MAB that captures the reward--survival trade-off, including (i) a comparator class (e.g., policies that satisfy a specified ruin-risk target) and (ii) a loss/objective (e.g., constrained reward shortfall, Lagrangian trade-off, or Pareto regret) that is compatible with the path-dependent event RT\mathsf{R}_T. (b) Reductions. Determine whether, and under what conditions, an S-MAB can be reduced to a risk-averse MAB (RA-MAB) or a budgeted MAB (B-MAB) formulation so that existing theoretical guarantees transfer, without losing the essential feature that safety is a hitting-probability constraint/event for (Bt)(B_t). (c) Optimal strategy and guarantees. Characterize strategies that are optimal for the chosen regret/trade-off criterion in (a), and establish instance-dependent and/or worst-case performance guarantees that jointly address learning (exploration) and control of P(RT)\mathbb{P}(\mathsf{R}_T).

§ Discussion

Loading discussion…

§ Significance & Implications

S-MAB makes safety a first-class, path-dependent criterion: a single unfavorable sequence of observed rewards can trigger irreversible failure via the hitting event RT\mathsf{R}_T. This differs from objectives based only on expectations or terminal/budget constraints because controlling P(RT)\mathbb{P}(\mathsf{R}_T) depends on the entire trajectory of (Bt)(B_t) and interacts directly with exploration. A precise regret/trade-off formalism and matching algorithms would clarify what guarantees are achievable (and when) for sequential learning problems where exploration can itself create catastrophic risk.

§ Known Partial Results

  • Perotto et al. (2019): introduce and formalize survival multiarmed bandits (S-MAB), where an evolving budget with both positive and negative rewards is coupled to an explicit ruin event defined by hitting a negative budget.

  • Perotto et al. (2019): The note motivates S-MAB as distinct from (and potentially related to) budgeted MAB and risk-averse MAB formulations because the safety objective is the probability of a path-dependent ruin event.

  • Perotto et al. (2019): The source explicitly identifies three open directions: (a) defining an appropriate regret notion for the multiobjective reward/survival setting, (b) establishing reductions to RA-MAB or B-MAB that would transfer known guarantees, and (c) developing strategies and analytical tools tailored to controlling P(RT)\mathbb{P}(\mathsf{R}_T) while learning.

§ References

[1]

Open Problem: Risk of Ruin in Multiarmed Bandits

Filipo S. Perotto, Mathieu Bourgais, Bruno C. Silva, Laurent Vercouter (2019)

Conference on Learning Theory (COLT), PMLR 99

📍 Open-problem note in COLT proceedings.

[2]

Open Problem: Risk of Ruin in Multiarmed Bandits (PDF)

Filipo S. Perotto, Mathieu Bourgais, Bruno C. Silva, Laurent Vercouter (2019)

Conference on Learning Theory (COLT), PMLR 99

📍 Proceedings PDF.

§ Tags