IRIS Institutional Research Information System - AIR Archivio Istituzionale della Ricerca

In this paper, we propose a novel reinforcementlearning algorithm consisting in a stochastic variance-reduced version of policy gradient for solving Markov Decision Processes (MDPs). Stochastic variance-reduced gradient (SVRG) methods have proven to be very successful in supervised learning. However, their adaptation to policy gradient is not straightforward and needs to account for I) a non-concave objective function; II) approximations in the full gradient computation; and III) a non-stationary sampling process. The result is SVRPG, a stochastic variancereduced policy gradient algorithm that leverages on importance weights to preserve the unbiasedness of the gradient estimate. Under standard assumptions on the MDP, we provide convergence guarantees for SVRPG with a convergence rate that is linear under increasing batch sizes. Finally, we suggest practical variants of SVRPG, and we empirically evaluate them on continuous MDPs.

Stochastic variance-reduced policy gradient / M. Papini, D. Binaghi, G. Canonaco, M. Pirotta, M. Restelli (PROCEEDINGS OF MACHINE LEARNING RESEARCH). - In: ICML[s.l] : International Machine Learning Society (IMLS), 2018. - ISBN 9781510867963. - pp. 6422-6431 (( 35. International Conference on Machine Learning : July, 10 - 15 Stockholm (Sweden) 2018.

Stochastic variance-reduced policy gradient

M. Papini^Primo;Binaghi D.;Canonaco G.;Pirotta M.;Restelli M.

2018

Abstract

In this paper, we propose a novel reinforcementlearning algorithm consisting in a stochastic variance-reduced version of policy gradient for solving Markov Decision Processes (MDPs). Stochastic variance-reduced gradient (SVRG) methods have proven to be very successful in supervised learning. However, their adaptation to policy gradient is not straightforward and needs to account for I) a non-concave objective function; II) approximations in the full gradient computation; and III) a non-stationary sampling process. The result is SVRPG, a stochastic variancereduced policy gradient algorithm that leverages on importance weights to preserve the unbiasedness of the gradient estimate. Under standard assumptions on the MDP, we provide convergence guarantees for SVRPG with a convergence rate that is linear under increasing batch sizes. Finally, we suggest practical variants of SVRPG, and we empirically evaluate them on continuous MDPs.

Scheda breve

Scheda completa

Scheda completa (DC)

	Settori scientifico-disciplinari del contributo (validi dal 09/05/2024)
	
				Settore INFO-01/A - Informatica
Settore IINF-05/A - Sistemi di elaborazione delle informazioni
			
	Data di pubblicazione
	
				2018
			
	URL
	
				https://proceedings.mlr.press/v80/papini18a.html
			
	Tipologia
	
				Book Part (author)
			
	Appare nelle tipologie:
	
				03 - Contributo in volume

File in questo prodotto:

File	Dimensione	Formato
papini18a.pdf accesso aperto Tipologia: Publisher's version/PDF Licenza: Creative commons Dimensione 513.95 kB Formato Adobe PDF Visualizza/Apri	513.95 kB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/1225935

Citazioni

ND

41

89

16

social impact