IRIS Institutional Research Information System - AIR Archivio Istituzionale della Ricerca

We study the problem of online learning in adversarial bandit problems under a partial observability model called off-policy feedback. In this sequential decision making problem, the learner cannot directly observe its rewards, but instead sees the ones obtained by another unknown policy run in parallel (behavior policy). Instead of a standard exploration-exploitation dilemma, the learner has to face another challenge in this setting: due to limited observations outside of their control, the learner may not be able to estimate the value of each policy equally well. To address this issue, we propose a set of algorithms that guarantee regret bounds that scale with a natural notion of mismatch between any comparator policy and the behavior policy, achieving improved performance against comparators that are well-covered by the observations. We also provide an extension to the setting of adversarial linear contextual bandits, and verify the theoretical guarantees via a set of experiments. Our key algorithmic idea is adapting the notion of pessimistic reward estimators that has been recently popular in the context of off-policy reinforcement learning.

Online Learning with Off-Policy Feedback / G. Gabbianelli, G. Neu, M. Papini (PROCEEDINGS OF MACHINE LEARNING RESEARCH). - In: International Conference on Algorithmic Learning Theory / [a cura di] S. Agrawal, F. Orabona. - [s.l] : PMLR, 2023. - pp. 620-641 (( 34. International Conference on Algorithmic Learning Theory Singapore .

Online Learning with Off-Policy Feedback

Gabbianelli G.;Neu G.;M. Papini^Ultimo

2023

Abstract

We study the problem of online learning in adversarial bandit problems under a partial observability model called off-policy feedback. In this sequential decision making problem, the learner cannot directly observe its rewards, but instead sees the ones obtained by another unknown policy run in parallel (behavior policy). Instead of a standard exploration-exploitation dilemma, the learner has to face another challenge in this setting: due to limited observations outside of their control, the learner may not be able to estimate the value of each policy equally well. To address this issue, we propose a set of algorithms that guarantee regret bounds that scale with a natural notion of mismatch between any comparator policy and the behavior policy, achieving improved performance against comparators that are well-covered by the observations. We also provide an extension to the setting of adversarial linear contextual bandits, and verify the theoretical guarantees via a set of experiments. Our key algorithmic idea is adapting the notion of pessimistic reward estimators that has been recently popular in the context of off-policy reinforcement learning.

Scheda breve

Scheda completa

Scheda completa (DC)

	Parole chiave
	
				online learning; off-policy; partial monitoring; bandit problems
			
	Settori scientifico-disciplinari del contributo (validi dal 09/05/2024)
	
				Settore IINF-05/A - Sistemi di elaborazione delle informazioni
Settore INFO-01/A - Informatica
			
	Data di pubblicazione
	
				2023
			
	URL
	
				https://proceedings.mlr.press/v201/gabbianelli23a.html
			
	Tipologia
	
				Book Part (author)
			
	Appare nelle tipologie:
	
				03 - Contributo in volume

File in questo prodotto:

File	Dimensione	Formato
gabbianelli23a.pdf accesso aperto Tipologia: Publisher's version/PDF Licenza: Creative commons Dimensione 275.12 kB Formato Adobe PDF Visualizza/Apri	275.12 kB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/1226143

Citazioni

ND

3

2

ND

social impact