IRIS Institutional Research Information System - AIR Archivio Istituzionale della Ricerca

We investigate a nonstochastic bandit setting in which the loss of an action is not immediately charged to the player, but rather spread over at most d consecutive steps in an adversarial way. This implies that the instantaneous loss observed by the player at the end of each round is a sum of as many as d loss components of previously played actions. Hence, unlike the standard bandit setting with delayed feedback, here the player cannot observe the individual delayed losses, but only their sum. Our main contribution is a general reduction transforming a standard bandit algorithm into one that can operate in this harder setting. We also show how the regret of the transformed algorithm can be bounded in terms of the regret of the original algorithm. Our reduction cannot be improved in general: we prove a lower bound on the regret of any bandit algorithm in this setting that matches (up to log factors) the upper bound obtained via our reduction. Finally, we show how our reduction can be extended to more complex bandit settings, such as combinatorial linear bandits and online bandit convex optimization.

Nonstochastic Bandits with Composite Anonymous Feedback / N. Cesa-Bianchi, C. Gentile, Y. Mansour (PROCEEDINGS OF MACHINE LEARNING RESEARCH). - In: Conference On Learning Theory / [a cura di] S. Bubeck, V. Perchet, P. Rigollet. - [s.l] : PMLR, 2018. - pp. 750-773 (( Intervento presentato al 31. convegno Learning Theory tenutosi a Stockholm nel 2018.

Nonstochastic Bandits with Composite Anonymous Feedback

N. Cesa-Bianchi;Gentile, Claudio;Mansour, Yishay

2018

Abstract

We investigate a nonstochastic bandit setting in which the loss of an action is not immediately charged to the player, but rather spread over at most d consecutive steps in an adversarial way. This implies that the instantaneous loss observed by the player at the end of each round is a sum of as many as d loss components of previously played actions. Hence, unlike the standard bandit setting with delayed feedback, here the player cannot observe the individual delayed losses, but only their sum. Our main contribution is a general reduction transforming a standard bandit algorithm into one that can operate in this harder setting. We also show how the regret of the transformed algorithm can be bounded in terms of the regret of the original algorithm. Our reduction cannot be improved in general: we prove a lower bound on the regret of any bandit algorithm in this setting that matches (up to log factors) the upper bound obtained via our reduction. Finally, we show how our reduction can be extended to more complex bandit settings, such as combinatorial linear bandits and online bandit convex optimization.

Scheda breve

Scheda completa

Scheda completa (DC)

	Parole chiave
	
				Nonstochastic bandits; composite losses; delayed feedback; bandit convex optimization
			
	Settori scientifico-disciplinari del contributo (sola visualizzazione)
	
				Settore INF/01 - Informatica
			
	Data di pubblicazione
	
				2018
			
	URL
	
				http://proceedings.mlr.press/v75/cesa-bianchi18a.html
			
	Tipologia
	
				Book Part (author)
			
	Appare nelle tipologie:
	
				03 - Contributo in volume

File in questo prodotto:

File	Dimensione	Formato
cesa-bianchi18b.pdf accesso riservato Tipologia: Publisher's version/PDF Dimensione 404.69 kB Formato Adobe PDF Visualizza/Apri Richiedi una copia	404.69 kB	Adobe PDF	Visualizza/Apri Richiedi una copia

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/596747

Citazioni

ND

34

ND

ND

social impact