Online Learning with Off-Policy Feedback in Adversarial MDPs

Bacchiocchi, F.; Stradi, F.; Papini, M.; Metelli, A.; Gatti, N.

doi:10.24963/ijcai.2024/409

In this paper, we face the challenge of online learning in adversarial Markov decision processes with off-policy feedback. In this setting, the learner chooses a policy, but, differently from the traditional on-policy setting, the environment is explored by means of a different, fixed, and possibly unknown policy (named colleague's policy). The off-policy feedback presents an additional issue that is not present in traditional settings: the learner is charged with the regret of its chosen policy but it observes only the rewards gained by the colleague's policy. First, we present a lower-bound for the setting we propose, which shows that the optimal dependency of the sublinear regret is w.r.t. the dissimilarity between the optimal policy in hindsight and the colleague's policy. Then, we propose novel algorithms that, by employing pessimistic estimators---commonly adopted in the off-line reinforcement learning literature---ensure sublinear regret bounds depending on the desired dissimilarity, even when the colleague's policy is unknown.

Online Learning with Off-Policy Feedback in Adversarial MDPs / F. Bacchiocchi, F. Stradi, M. Papini, A. Metelli, N. Gatti - In: Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence / [a cura di] K. Larson. - [s.l] : International Joint Conferences on Artificial Intelligence Organization, 2024. - ISBN 978-1-956792-04-1. - pp. 3697-3705 (( 33. International Joint Conference on Artificial Intelligence Jeju Island 2024 [10.24963/ijcai.2024/409].

Online Learning with Off-Policy Feedback in Adversarial MDPs

F. Bacchiocchi;FE. Stradi;M. Papini;AM. Metelli;N. Gatti

2024

Abstract

In this paper, we face the challenge of online learning in adversarial Markov decision processes with off-policy feedback. In this setting, the learner chooses a policy, but, differently from the traditional on-policy setting, the environment is explored by means of a different, fixed, and possibly unknown policy (named colleague's policy). The off-policy feedback presents an additional issue that is not present in traditional settings: the learner is charged with the regret of its chosen policy but it observes only the rewards gained by the colleague's policy. First, we present a lower-bound for the setting we propose, which shows that the optimal dependency of the sublinear regret is w.r.t. the dissimilarity between the optimal policy in hindsight and the colleague's policy. Then, we propose novel algorithms that, by employing pessimistic estimators---commonly adopted in the off-line reinforcement learning literature---ensure sublinear regret bounds depending on the desired dissimilarity, even when the colleague's policy is unknown.

Scheda breve

Scheda completa

Scheda completa (DC)

	Settori scientifico-disciplinari del contributo (validi dal 09/05/2024)
	
				Settore IINF-05/A - Sistemi di elaborazione delle informazioni
Settore INFO-01/A - Informatica
			
	Data di pubblicazione
	
				2024
			
	DOI
	
				https://dx.doi.org/10.24963/ijcai.2024/409
			
	URL
	
				https://www.ijcai.org/proceedings/2024/409
			
	Tipologia
	
				Book Part (author)
			
	Appare nelle tipologie:
	
				03 - Contributo in volume

File in questo prodotto:

File	Dimensione	Formato
0409.pdf accesso aperto Tipologia: Publisher's version/PDF Licenza: Creative commons Dimensione 227.53 kB Formato Adobe PDF Visualizza/Apri	227.53 kB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/1226141

Citazioni

ND

2

1

ND

IRIS Institutional Research Information System - AIR Archivio Istituzionale della Ricerca