Risk of Ruin in Multiarmed Bandits
Posed by Filipo S. Perotto et al. (2019)
§ Problem Statement
Setup
Define a stochastic -armed bandit with arms . Pulling arm at round yields a reward , where is an unknown distribution on ; rewards may be positive or negative. A (non-anticipating) policy selects an arm each round based on past actions and observed rewards. The agent maintains a budget process with initial budget and update
For a finite horizon , define the (path-dependent) ruin event
and denote its probability under the policy and instance by . A survival multiarmed bandit (S-MAB) problem asks to learn an exploration/exploitation strategy while controlling ruin risk: the objective is inherently multi-criteria, trading off reward (e.g., expected cumulative reward ) against safety as quantified by .
Unsolved Problem
(a) Regret definition. Provide a formal regret notion for S-MAB that captures the reward--survival trade-off, including (i) a comparator class (e.g., policies that satisfy a specified ruin-risk target) and (ii) a loss/objective (e.g., constrained reward shortfall, Lagrangian trade-off, or Pareto regret) that is compatible with the path-dependent event . (b) Reductions. Determine whether, and under what conditions, an S-MAB can be reduced to a risk-averse MAB (RA-MAB) or a budgeted MAB (B-MAB) formulation so that existing theoretical guarantees transfer, without losing the essential feature that safety is a hitting-probability constraint/event for . (c) Optimal strategy and guarantees. Characterize strategies that are optimal for the chosen regret/trade-off criterion in (a), and establish instance-dependent and/or worst-case performance guarantees that jointly address learning (exploration) and control of .
§ Discussion
§ Significance & Implications
S-MAB makes safety a first-class, path-dependent criterion: a single unfavorable sequence of observed rewards can trigger irreversible failure via the hitting event . This differs from objectives based only on expectations or terminal/budget constraints because controlling depends on the entire trajectory of and interacts directly with exploration. A precise regret/trade-off formalism and matching algorithms would clarify what guarantees are achievable (and when) for sequential learning problems where exploration can itself create catastrophic risk.
§ Known Partial Results
Perotto et al. (2019): introduce and formalize survival multiarmed bandits (S-MAB), where an evolving budget with both positive and negative rewards is coupled to an explicit ruin event defined by hitting a negative budget.
Perotto et al. (2019): The note motivates S-MAB as distinct from (and potentially related to) budgeted MAB and risk-averse MAB formulations because the safety objective is the probability of a path-dependent ruin event.
Perotto et al. (2019): The source explicitly identifies three open directions: (a) defining an appropriate regret notion for the multiobjective reward/survival setting, (b) establishing reductions to RA-MAB or B-MAB that would transfer known guarantees, and (c) developing strategies and analytical tools tailored to controlling while learning.
§ References
Open Problem: Risk of Ruin in Multiarmed Bandits
Filipo S. Perotto, Mathieu Bourgais, Bruno C. Silva, Laurent Vercouter (2019)
Conference on Learning Theory (COLT), PMLR 99
📍 Open-problem note in COLT proceedings.
Open Problem: Risk of Ruin in Multiarmed Bandits (PDF)
Filipo S. Perotto, Mathieu Bourgais, Bruno C. Silva, Laurent Vercouter (2019)
Conference on Learning Theory (COLT), PMLR 99
📍 Proceedings PDF.