This thesis investigates the design and analysis of algorithms for online decision-making under uncertainty, with a particular focus on the multi-armed bandit and online learning frameworks. At the core of these settings lies a simple yet powerful interaction protocol: at each round the learner receives a question (information available prior to a decision), provides an answer (an action or prediction), and observes a feedback signal that may be partial, noisy, or structured. This iterative loop captures a wide range of sequential decision problems and serves as the unifying perspective for the contributions of this thesis. We address this challenge across four distinct, fundamental problems in online learning. First, we address settings in which the learner receives no additional information prior to making a decision (i.e., the question is implicit), and must adapt to a nonstationary environment under bandit feedback. We introduce a parameter-free algorithm achieving optimal dynamic regret guarantees without prior knowledge of the environment's variability. This result resolves a long-standing open problem in the field. Next, we enrich the answer phase by allowing the learner to abstain. We propose an efficient algorithm, that leverages this option to obtain cumulative reward guarantees that can outperform standard methods, while matching their worst-case performance. This framework is then extended to the adversarial contextual bandit problem with abstention. We investigate problems with combinatorial answers, where the learner selects a subset of actions and the reward feedback is defined by a sum-max function, a rich subclass of monotone submodular functions. We design an algorithm for this bandit problem, and prove it achieves significant improvement in regret with respect to standard algorithms for monotone submodular functions. Finally, we consider the Stochastic Shortest Path problem, where the additional structure is dictated by an underlying Markov Decision Process. We analyze the impact of sparse cost feedback, demonstrating that standard methods fail to exploit it. We design a novel family of regularizers that achieves an optimal, sparsity-adaptive regret bound. This result, complemented by a matching lower bound, provides a complete characterization of the benefits of sparsity in this setting. The contributions in this thesis demonstrate that by designing principled algorithms that recognize and exploit structure within the question, answer, feedback loop, we can achieve strong, adaptive guarantees for learning under uncertainty.
LEARNING UNDER STRUCTURE AND UNCERTAINTY: ALGORITHMS FOR BANDIT AND ONLINE DECISION MAKING / A. Rumi ; tutor: N. Cesa-Bianchi ; cosupervisor: F. Vitale ; coordinator: R. Sassi. Dipartimento di Informatica Giovanni Degli Antoni, 2025. 38. ciclo, Anno Accademico 2025/2026.
LEARNING UNDER STRUCTURE AND UNCERTAINTY: ALGORITHMS FOR BANDIT AND ONLINE DECISION MAKING
A. Rumi
2025
Abstract
This thesis investigates the design and analysis of algorithms for online decision-making under uncertainty, with a particular focus on the multi-armed bandit and online learning frameworks. At the core of these settings lies a simple yet powerful interaction protocol: at each round the learner receives a question (information available prior to a decision), provides an answer (an action or prediction), and observes a feedback signal that may be partial, noisy, or structured. This iterative loop captures a wide range of sequential decision problems and serves as the unifying perspective for the contributions of this thesis. We address this challenge across four distinct, fundamental problems in online learning. First, we address settings in which the learner receives no additional information prior to making a decision (i.e., the question is implicit), and must adapt to a nonstationary environment under bandit feedback. We introduce a parameter-free algorithm achieving optimal dynamic regret guarantees without prior knowledge of the environment's variability. This result resolves a long-standing open problem in the field. Next, we enrich the answer phase by allowing the learner to abstain. We propose an efficient algorithm, that leverages this option to obtain cumulative reward guarantees that can outperform standard methods, while matching their worst-case performance. This framework is then extended to the adversarial contextual bandit problem with abstention. We investigate problems with combinatorial answers, where the learner selects a subset of actions and the reward feedback is defined by a sum-max function, a rich subclass of monotone submodular functions. We design an algorithm for this bandit problem, and prove it achieves significant improvement in regret with respect to standard algorithms for monotone submodular functions. Finally, we consider the Stochastic Shortest Path problem, where the additional structure is dictated by an underlying Markov Decision Process. We analyze the impact of sparse cost feedback, demonstrating that standard methods fail to exploit it. We design a novel family of regularizers that achieves an optimal, sparsity-adaptive regret bound. This result, complemented by a matching lower bound, provides a complete characterization of the benefits of sparsity in this setting. The contributions in this thesis demonstrate that by designing principled algorithms that recognize and exploit structure within the question, answer, feedback loop, we can achieve strong, adaptive guarantees for learning under uncertainty.| File | Dimensione | Formato | |
|---|---|---|---|
|
phd_unimi_R13858.pdf
accesso aperto
Descrizione: Doctoral thesis
Tipologia:
Post-print, accepted manuscript ecc. (versione accettata dall'editore)
Licenza:
Creative commons
Dimensione
3.56 MB
Formato
Adobe PDF
|
3.56 MB | Adobe PDF | Visualizza/Apri |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.




