IRIS Institutional Research Information System - AIR Archivio Istituzionale della Ricerca

Policy Search (PS) is an effective approach to Reinforcement Learning (RL) for solving control tasks with continuous state-action spaces. In this paper, we address the exploration-exploitation trade-off in PS by proposing an approach based on Optimism in the Face of Uncertainty. We cast the PS problem as a suitable Multi Armed Bandit (MAB) problem, defined over the policy parameter space, and we propose a class of algorithms that effectively exploit the problem structure, by leveraging Multiple Importance Sampling to perform an off-policy estimation of the expected return. We show that the regret of the proposed approach is bounded by O˜(T−−√) for both discrete and continuous parameter spaces. Finally, we evaluate our algorithms on tasks of varying difficulty, comparing them with existing MAB and RL algorithms.

Optimistic Policy Optimization via Multiple Importance Sampling / M. Papini, A.M. Metelli, L. Lupo, M. Restelli (PROCEEDINGS OF MACHINE LEARNING RESEARCH). - In: Proceedings of the 36th International Conference on Machine Learning[s.l] : PMLR, 2019. - pp. 4989-4999 (( International Conference on Machine Learning : June, 9-15 Long Beach (California, USA) 2019.

Optimistic Policy Optimization via Multiple Importance Sampling

M. Papini;Metelli, Alberto Maria;Lupo, Lorenzo;Restelli, Marcello

2019

Abstract

Policy Search (PS) is an effective approach to Reinforcement Learning (RL) for solving control tasks with continuous state-action spaces. In this paper, we address the exploration-exploitation trade-off in PS by proposing an approach based on Optimism in the Face of Uncertainty. We cast the PS problem as a suitable Multi Armed Bandit (MAB) problem, defined over the policy parameter space, and we propose a class of algorithms that effectively exploit the problem structure, by leveraging Multiple Importance Sampling to perform an off-policy estimation of the expected return. We show that the regret of the proposed approach is bounded by O˜(T−−√) for both discrete and continuous parameter spaces. Finally, we evaluate our algorithms on tasks of varying difficulty, comparing them with existing MAB and RL algorithms.

Scheda breve

Scheda completa

Scheda completa (DC)

	Settori scientifico-disciplinari del contributo (validi dal 09/05/2024)
	
				Settore INFO-01/A - Informatica
Settore IINF-05/A - Sistemi di elaborazione delle informazioni
			
	Data di pubblicazione
	
				2019
			
	URL
	
				https://proceedings.mlr.press/v97/papini19a.html
			
	Tipologia
	
				Book Part (author)
			
	Appare nelle tipologie:
	
				03 - Contributo in volume

File in questo prodotto:

File	Dimensione	Formato
papini19a.pdf accesso aperto Tipologia: Publisher's version/PDF Licenza: Creative commons Dimensione 516.62 kB Formato Adobe PDF Visualizza/Apri	516.62 kB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/1225937

Citazioni

ND

8

2

11

social impact