We propose an algorithm for stochastic and adversarial multiarmed bandits with switching costs, where the algorithm pays a price λλ every time it switches the arm being played. Our algorithm is based on adaptation of the Tsallis-INF algorithm of Zimmert and Seldin (2021) and requires no prior knowledge of the regime or time horizon. In the oblivious adversarial setting it achieves the minimax optimal regret bound of O((λK)1/3T2/3+KT−−−√)O((λK)1/3T2/3+KT), where TT is the time horizon and KK is the number of arms. In the stochastically constrained adversarial regime, which includes the stochastic regime as a special case, it achieves a regret bound of O((λK)2/3T1/3+lnT)∑i≠i∗Δ−1i)O((λK)2/3T1/3+ln⁡T)∑i≠i∗Δi−1), where ΔiΔi are suboptimality gaps and i∗i∗ is the unique optimal arm. In the special case of λ=0λ=0 (no switching costs), both bounds are minimax optimal within constants. We also explore variants of the problem, where switching cost is allowed to change over time. We provide experimental evaluation showing competitiveness of our algorithm with the relevant baselines in the stochastic, stochastically constrained adversarial, and adversarial regimes with fixed switching cost.

An Algorithm for Stochastic and Adversarial Bandits with Switching Costs / C. Rouyer, Y. Seldin, N. Cesa Bianchi (PROCEEDINGS OF MACHINE LEARNING RESEARCH). - In: International Conference on Machine Learning / [a cura di] M. Meila, T. Zhang. - [s.l] : PMLR, 2021. - pp. 9127-9135 (( convegno International Conference on Machine Learning.

An Algorithm for Stochastic and Adversarial Bandits with Switching Costs

N. Cesa Bianchi
2021

Abstract

We propose an algorithm for stochastic and adversarial multiarmed bandits with switching costs, where the algorithm pays a price λλ every time it switches the arm being played. Our algorithm is based on adaptation of the Tsallis-INF algorithm of Zimmert and Seldin (2021) and requires no prior knowledge of the regime or time horizon. In the oblivious adversarial setting it achieves the minimax optimal regret bound of O((λK)1/3T2/3+KT−−−√)O((λK)1/3T2/3+KT), where TT is the time horizon and KK is the number of arms. In the stochastically constrained adversarial regime, which includes the stochastic regime as a special case, it achieves a regret bound of O((λK)2/3T1/3+lnT)∑i≠i∗Δ−1i)O((λK)2/3T1/3+ln⁡T)∑i≠i∗Δi−1), where ΔiΔi are suboptimality gaps and i∗i∗ is the unique optimal arm. In the special case of λ=0λ=0 (no switching costs), both bounds are minimax optimal within constants. We also explore variants of the problem, where switching cost is allowed to change over time. We provide experimental evaluation showing competitiveness of our algorithm with the relevant baselines in the stochastic, stochastically constrained adversarial, and adversarial regimes with fixed switching cost.
Settore INF/01 - Informatica
   European Learning and Intelligent Systems Excellence (ELISE)
   ELISE
   EUROPEAN COMMISSION
   H2020
   951847

   Algorithms, Games, and Digital Markets (ALGADIMAR)
   ALGADIMAR
   MINISTERO DELL'ISTRUZIONE E DEL MERITO
   2017R9FHSR_006
2021
http://proceedings.mlr.press/v139/rouyer21a/rouyer21a.pdf
Book Part (author)
File in questo prodotto:
File Dimensione Formato  
rouyer21a.pdf

accesso aperto

Tipologia: Publisher's version/PDF
Dimensione 429.45 kB
Formato Adobe PDF
429.45 kB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/857465
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 4
  • ???jsp.display-item.citation.isi??? 0
social impact