IRIS Institutional Research Information System - AIR Archivio Istituzionale della Ricerca

We derive a new analysis of Follow The Regularized Leader (FTRL) for online learning with delayed bandit feedback. By separating the cost of delayed feedback from that of bandit feedback, our analysis allows us to obtain new results in three important settings. On the one hand, we derive the first optimal (up to logarithmic factors) regret bounds for combinatorial semi-bandits with delay and adversarial Markov decision processes with delay (and known transition functions). On the other hand, we use our analysis to derive an efficient algorithm for linear bandits with delay achieving near-optimal regret bounds. Our novel regret decomposition shows that FTRL remains stable across multiple rounds under mild assumptions on the Hessian of the regularizer.

A Unified Analysis of Nonstochastic Delayed Feedback for Combinatorial Semi-Bandits, Linear Bandits, and MDPs / D. van der Hoeven, L. Zierahn, T. Lancewicki, A. Rosenberg, N. Cesa Bianchi (PROCEEDINGS OF MACHINE LEARNING RESEARCH). - In: Proceedings of Thirty Sixth Conference on Learning Theory / [a cura di] G. Neu, L. Rosasco. - [s.l] : PMLR, 2023. - pp. 1285-1321 (( Intervento presentato al 36. convegno Annual Conference on Learning Theory tenutosi a Bangalore nel 2023.

A Unified Analysis of Nonstochastic Delayed Feedback for Combinatorial Semi-Bandits, Linear Bandits, and MDPs

D. van der Hoeven;L. Zierahn;T. Lancewicki;A. Rosenberg;N. Cesa Bianchi

2023

Abstract

We derive a new analysis of Follow The Regularized Leader (FTRL) for online learning with delayed bandit feedback. By separating the cost of delayed feedback from that of bandit feedback, our analysis allows us to obtain new results in three important settings. On the one hand, we derive the first optimal (up to logarithmic factors) regret bounds for combinatorial semi-bandits with delay and adversarial Markov decision processes with delay (and known transition functions). On the other hand, we use our analysis to derive an efficient algorithm for linear bandits with delay achieving near-optimal regret bounds. Our novel regret decomposition shows that FTRL remains stable across multiple rounds under mild assumptions on the Hessian of the regularizer.

Scheda breve

Scheda completa

Scheda completa (DC)

	Parole chiave
	
				Online learning; bandit feedback; delayed feedback
			
	Settori scientifico-disciplinari del contributo (sola visualizzazione)
	
				Settore INF/01 - Informatica
			
	Titolo del progetto
	
	Titolo Progetto
	
									Algorithms, Games, and Digital Markets (ALGADIMAR)
								
	Acronimo
	
									ALGADIMAR
								
	Nome finanziatore
	
										MINISTERO DELL'ISTRUZIONE E DEL MERITO
									
	N. Contratto
	
									2017R9FHSR_006
								
	Titolo Progetto
	
									European Learning and Intelligent Systems Excellence (ELISE)
								
	Acronimo
	
									ELISE
								
	Nome finanziatore
	
										EUROPEAN COMMISSION
									
	Finanziamento
	
									H2020
								
	N. Contratto
	
									951847
								
	Data di pubblicazione
	
				2023
			
	URL
	
				https://proceedings.mlr.press/v195/hoeven23a.html
			
	Tipologia
	
				Book Part (author)
			
	Appare nelle tipologie:
	
				03 - Contributo in volume

File in questo prodotto:

File	Dimensione	Formato
hoeven23a.pdf accesso aperto Tipologia: Publisher's version/PDF Dimensione 473.54 kB Formato Adobe PDF Visualizza/Apri	473.54 kB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/1024140

Citazioni

ND

0

0

social impact