IRIS Institutional Research Information System - AIR Archivio Istituzionale della Ricerca

In many Reinforcement Learning (RL) applications, the goal is to find an optimal deterministic policy. However, most RL algorithms require the policy to be stochastic in order to avoid instabilities and perform a sufficient amount of exploration. Adjusting the level of stochasticity during the learning process is non-trivial, as it is difficult to assess whether the costs of random exploration will be repaid in the long run, and to contain the risk of instability. We study this problem in the context of policy gradients (PG) with Gaussian policies. Using tools from the safe PG literature, we design a surrogate objective for the policy variance that captures the effects this parameter has on the learning speed and on the quality of the final solution. Furthermore, we provide a way to optimize this objective which guarantees a stable improvement of the original performance measure. We evaluate the proposed methods on simulated continuous control tasks.

Balancing Learning Speed and Stability in Policy Gradient via Adaptive Exploration / M. Papini, A. Battistello, M. Restelli (PROCEEDINGS OF MACHINE LEARNING RESEARCH). - In: AISTATS / [a cura di] Silvia Chiappa and Roberto Calandra. - [s.l] : PMLR, 2020. - pp. 1188-1199 (( 23. International Conference on Artificial Intelligence and Statistics : August, 26 - 28 (online) 2020.

Balancing Learning Speed and Stability in Policy Gradient via Adaptive Exploration

M. Papini;A. Battistello;M. Restelli

2020

Abstract

In many Reinforcement Learning (RL) applications, the goal is to find an optimal deterministic policy. However, most RL algorithms require the policy to be stochastic in order to avoid instabilities and perform a sufficient amount of exploration. Adjusting the level of stochasticity during the learning process is non-trivial, as it is difficult to assess whether the costs of random exploration will be repaid in the long run, and to contain the risk of instability. We study this problem in the context of policy gradients (PG) with Gaussian policies. Using tools from the safe PG literature, we design a surrogate objective for the policy variance that captures the effects this parameter has on the learning speed and on the quality of the final solution. Furthermore, we provide a way to optimize this objective which guarantees a stable improvement of the original performance measure. We evaluate the proposed methods on simulated continuous control tasks.

Scheda breve

Scheda completa

Scheda completa (DC)

	Settori scientifico-disciplinari del contributo (validi dal 09/05/2024)
	
				Settore IINF-05/A - Sistemi di elaborazione delle informazioni
Settore INFO-01/A - Informatica
			
	Data di pubblicazione
	
				2020
			
	URL
	
				http://proceedings.mlr.press/v108/papini20a.html
			
	Tipologia
	
				Book Part (author)
			
	Appare nelle tipologie:
	
				03 - Contributo in volume

File in questo prodotto:

File	Dimensione	Formato
papini20a.pdf accesso aperto Tipologia: Publisher's version/PDF Licenza: Creative commons Dimensione 1.03 MB Formato Adobe PDF Visualizza/Apri	1.03 MB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/1226139

Citazioni

ND

15

10

9

social impact