Smoothing policies and safe policy gradients

Papini, M.; Pirotta, M.; Restelli, M.

doi:10.1007/s10994-022-06232-6

Policy gradient (PG) algorithms are among the best candidates for the much-anticipated applications of reinforcement learning to real-world control tasks, such as robotics. However, the trial-and-error nature of these methods poses safety issues whenever the learning process itself must be performed on a physical system or involves any form of human-computer interaction. In this paper, we address a specific safety formulation, where both goals and dangers are encoded in a scalar reward signal and the learning agent is constrained to never worsen its performance, measured as the expected sum of rewards. By studying actor-only PG from a stochastic optimization perspective, we establish improvement guarantees for a wide class of parametric policies, generalizing existing results on Gaussian policies. This, together with novel upper bounds on the variance of PG estimators, allows us to identify meta-parameter schedules that guarantee monotonic improvement with high probability. The two key meta-parameters are the step size of the parameter updates and the batch size of the gradient estimates. Through a joint, adaptive selection of these meta-parameters, we obtain a PG algorithm with monotonic improvement guarantees.

Smoothing policies and safe policy gradients / M. Papini, M. Pirotta, M. Restelli. - In: MACHINE LEARNING. - ISSN 0885-6125. - 111:11(2022 Nov), pp. 4081-4137. [10.1007/s10994-022-06232-6]

Smoothing policies and safe policy gradients

M. Papini^Primo;Pirotta M.;Restelli M.^Ultimo

2022

Abstract

Policy gradient (PG) algorithms are among the best candidates for the much-anticipated applications of reinforcement learning to real-world control tasks, such as robotics. However, the trial-and-error nature of these methods poses safety issues whenever the learning process itself must be performed on a physical system or involves any form of human-computer interaction. In this paper, we address a specific safety formulation, where both goals and dangers are encoded in a scalar reward signal and the learning agent is constrained to never worsen its performance, measured as the expected sum of rewards. By studying actor-only PG from a stochastic optimization perspective, we establish improvement guarantees for a wide class of parametric policies, generalizing existing results on Gaussian policies. This, together with novel upper bounds on the variance of PG estimators, allows us to identify meta-parameter schedules that guarantee monotonic improvement with high probability. The two key meta-parameters are the step size of the parameter updates and the batch size of the gradient estimates. Through a joint, adaptive selection of these meta-parameters, we obtain a PG algorithm with monotonic improvement guarantees.

Scheda breve

Scheda completa

Scheda completa (DC)

	Parole chiave
	
				Monotonic improvement; Policy gradient; Reinforcement learning; Safe learning
			
	Settori scientifico-disciplinari dell'articolo (validi dal 09/05/2024)
	
				Settore INFO-01/A - Informatica
Settore IINF-05/A - Sistemi di elaborazione delle informazioni
			
	Titolo del progetto
	
	Titolo Progetto
	
									Provably Efficient Algorithms for Large-Scale Reinforcement Learning
								
	Acronimo
	
									SCALER
								
	Nome finanziatore
	
										European Commission
									
	Finanziamento
	
									Horizon 2020 Framework Programme
								
	N. Contratto
	
									950180
								
	Data di pubblicazione
	
				nov-2022
			
	Data ahead of print o data di stampa
	
				20-ott-2022
			
	Rivista in ANCE
	
				MACHINE LEARNING
			
	DOI
	
				https://dx.doi.org/10.1007/s10994-022-06232-6
			
	Tipologia
	
				Article (author)
			
	Appare nelle tipologie:
	
				01 - Articolo su periodico

File in questo prodotto:

File	Dimensione	Formato
unpaywall-bitstream--1152295025.pdf accesso aperto Tipologia: Publisher's version/PDF Licenza: Creative commons Dimensione 4.63 MB Formato Adobe PDF Visualizza/Apri	4.63 MB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/1225915

Citazioni

ND

18

9

21

IRIS Institutional Research Information System - AIR Archivio Istituzionale della Ricerca