Motivated by the fact that humans like some level of unpredictability or novelty, and might therefore get quickly bored when interacting with a stationary policy, we introduce a novel non-stationary bandit problem, where the expected reward of an arm is fully determined by the time elapsed since the arm last took part in a switch of actions. Our model generalizes previous notions of delay-dependent rewards, and also relaxes most assumptions on the reward function. This enables the modeling of phenomena such as progressive satiation and periodic behaviours. Building upon the Combinatorial Semi-Bandits (CSB) framework, we design an algorithm and prove a bound on its regret with respect to the optimal non-stationary policy (which is NP-hard to compute). Similarly to previous works, our regret analysis is based on defining and solving an appropriate trade-off between approximation and estimation. Preliminary experiments confirm the superiority of our algorithm over both the oracle greedy approach and a vanilla CSB solver.

A Last Switch Dependent Analysis of Satiation and Seasonality in Bandits / P. Laforgue, G. Clerici, N. Cesa-Bianchi, R. Gilad-Bachrach (PROCEEDINGS OF MACHINE LEARNING RESEARCH). - In: International Conference on Artificial Intelligence and Statistics / [a cura di] G. Camps-Valls, F.J. R. Ruiz, I. Valera. - [s.l] : MICROTOME PUBLISHING, 2022. - pp. 971-990 (( convegno AISTATS nel 2022.

A Last Switch Dependent Analysis of Satiation and Seasonality in Bandits

P. Laforgue
Primo
;
G. Clerici
Secondo
;
N. Cesa-Bianchi
Penultimo
;
2022

Abstract

Motivated by the fact that humans like some level of unpredictability or novelty, and might therefore get quickly bored when interacting with a stationary policy, we introduce a novel non-stationary bandit problem, where the expected reward of an arm is fully determined by the time elapsed since the arm last took part in a switch of actions. Our model generalizes previous notions of delay-dependent rewards, and also relaxes most assumptions on the reward function. This enables the modeling of phenomena such as progressive satiation and periodic behaviours. Building upon the Combinatorial Semi-Bandits (CSB) framework, we design an algorithm and prove a bound on its regret with respect to the optimal non-stationary policy (which is NP-hard to compute). Similarly to previous works, our regret analysis is based on defining and solving an appropriate trade-off between approximation and estimation. Preliminary experiments confirm the superiority of our algorithm over both the oracle greedy approach and a vanilla CSB solver.
Settore INF/01 - Informatica
2022
https://proceedings.mlr.press/v151/laforgue22a/laforgue22a.pdf
Book Part (author)
File in questo prodotto:
File Dimensione Formato  
laforgue22a.pdf

accesso aperto

Tipologia: Publisher's version/PDF
Dimensione 4.19 MB
Formato Adobe PDF
4.19 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/922783
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 3
  • ???jsp.display-item.citation.isi??? 0
social impact