Policy gradient (PG) methods are effective reinforcement learning (RL) approaches, particularly for continuous problems. While they optimize stochastic (hyper)policies via action- or parameter-space exploration, real-world applications often require deterministic policies. Existing PG convergence guarantees to deterministic policies assume a fixed stochasticity in the (hyper)policy, tuned according to the desired final suboptimality, whereas practitioners commonly use a dynamic stochasticity level. This work provides the theoretical foundations for this practice. We introduce PES, a phase-based method that reduces stochasticity via a deterministic schedule while running PG subroutines with fixed stochasticity in each phase. Under gradient domination assumptions, PES achieves last-iterate convergence to the optimal deterministic policy with a sample complexity of order O˜(ϵ−5). Additionally, we analyze the common practice, termed SL-PG, of jointly learning stochasticity (via an appropriate parameterization) and (hyper)policy parameters. We show that SL-PG also ensures last-iterate convergence with a rate O˜(ϵ−3), but to the optimal stochastic (hyper)policy only, requiring stronger assumptions compared to PES.

Convergence Analysis of Policy Gradient Methods with Dynamic Stochasticity / A. Montenegro, M. Mussi, M. Papini, A.M. Metelli (PROCEEDINGS OF MACHINE LEARNING RESEARCH). - In: International Conference on Machine Learning / [a cura di] A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff, J. Zhu. - [s.l] : PMLR, 2025. - pp. 44652-44698 (( International Conference on Machine Learning Vancouver 2025.

Convergence Analysis of Policy Gradient Methods with Dynamic Stochasticity

M. Papini
Penultimo
;
2025

Abstract

Policy gradient (PG) methods are effective reinforcement learning (RL) approaches, particularly for continuous problems. While they optimize stochastic (hyper)policies via action- or parameter-space exploration, real-world applications often require deterministic policies. Existing PG convergence guarantees to deterministic policies assume a fixed stochasticity in the (hyper)policy, tuned according to the desired final suboptimality, whereas practitioners commonly use a dynamic stochasticity level. This work provides the theoretical foundations for this practice. We introduce PES, a phase-based method that reduces stochasticity via a deterministic schedule while running PG subroutines with fixed stochasticity in each phase. Under gradient domination assumptions, PES achieves last-iterate convergence to the optimal deterministic policy with a sample complexity of order O˜(ϵ−5). Additionally, we analyze the common practice, termed SL-PG, of jointly learning stochasticity (via an appropriate parameterization) and (hyper)policy parameters. We show that SL-PG also ensures last-iterate convergence with a rate O˜(ϵ−3), but to the optimal stochastic (hyper)policy only, requiring stronger assumptions compared to PES.
Reinforcement Learning; Policy Gradients; Convergence; Deterministic Policies
Settore IINF-05/A - Sistemi di elaborazione delle informazioni
Settore INFO-01/A - Informatica
2025
https://proceedings.mlr.press/v267/montenegro25a.html
Book Part (author)
File in questo prodotto:
File Dimensione Formato  
13207_Convergence_Analysis_of_ (3).pdf

accesso aperto

Tipologia: Publisher's version/PDF
Licenza: Creative commons
Dimensione 1.29 MB
Formato Adobe PDF
1.29 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/1226137
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
  • OpenAlex ND
social impact