IRIS Institutional Research Information System - AIR Archivio Istituzionale della Ricerca

Policy gradient (PG) methods are effective reinforcement learning (RL) approaches, particularly for continuous problems. While they optimize stochastic (hyper)policies via action- or parameter-space exploration, real-world applications often require deterministic policies. Existing PG convergence guarantees to deterministic policies assume a fixed stochasticity in the (hyper)policy, tuned according to the desired final suboptimality, whereas practitioners commonly use a dynamic stochasticity level. This work provides the theoretical foundations for this practice. We introduce PES, a phase-based method that reduces stochasticity via a deterministic schedule while running PG subroutines with fixed stochasticity in each phase. Under gradient domination assumptions, PES achieves last-iterate convergence to the optimal deterministic policy with a sample complexity of order O˜(ϵ−5). Additionally, we analyze the common practice, termed SL-PG, of jointly learning stochasticity (via an appropriate parameterization) and (hyper)policy parameters. We show that SL-PG also ensures last-iterate convergence with a rate O˜(ϵ−3), but to the optimal stochastic (hyper)policy only, requiring stronger assumptions compared to PES.

Convergence Analysis of Policy Gradient Methods with Dynamic Stochasticity / A. Montenegro, M. Mussi, M. Papini, A.M. Metelli (PROCEEDINGS OF MACHINE LEARNING RESEARCH). - In: International Conference on Machine Learning / [a cura di] A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff, J. Zhu. - [s.l] : PMLR, 2025. - pp. 44652-44698 (( International Conference on Machine Learning Vancouver 2025.

Convergence Analysis of Policy Gradient Methods with Dynamic Stochasticity

A. Montenegro;M. Mussi;M. Papini^Penultimo;A. M. Metelli

2025

Abstract

Policy gradient (PG) methods are effective reinforcement learning (RL) approaches, particularly for continuous problems. While they optimize stochastic (hyper)policies via action- or parameter-space exploration, real-world applications often require deterministic policies. Existing PG convergence guarantees to deterministic policies assume a fixed stochasticity in the (hyper)policy, tuned according to the desired final suboptimality, whereas practitioners commonly use a dynamic stochasticity level. This work provides the theoretical foundations for this practice. We introduce PES, a phase-based method that reduces stochasticity via a deterministic schedule while running PG subroutines with fixed stochasticity in each phase. Under gradient domination assumptions, PES achieves last-iterate convergence to the optimal deterministic policy with a sample complexity of order O˜(ϵ−5). Additionally, we analyze the common practice, termed SL-PG, of jointly learning stochasticity (via an appropriate parameterization) and (hyper)policy parameters. We show that SL-PG also ensures last-iterate convergence with a rate O˜(ϵ−3), but to the optimal stochastic (hyper)policy only, requiring stronger assumptions compared to PES.

Scheda breve

Scheda completa

Scheda completa (DC)

	Parole chiave
	
				Reinforcement Learning; Policy Gradients; Convergence; Deterministic Policies
			
	Settori scientifico-disciplinari del contributo (validi dal 09/05/2024)
	
				Settore IINF-05/A - Sistemi di elaborazione delle informazioni
Settore INFO-01/A - Informatica
			
	Data di pubblicazione
	
				2025
			
	URL
	
				https://proceedings.mlr.press/v267/montenegro25a.html
			
	Tipologia
	
				Book Part (author)
			
	Appare nelle tipologie:
	
				03 - Contributo in volume

File in questo prodotto:

File	Dimensione	Formato
13207_Convergence_Analysis_of_ (3).pdf accesso aperto Tipologia: Publisher's version/PDF Licenza: Creative commons Dimensione 1.29 MB Formato Adobe PDF Visualizza/Apri	1.29 MB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/1226137

Citazioni

ND

0

ND

ND

social impact