Policy gradient (PG) methods are effective reinforcement learning (RL) approaches, particularly for continuous problems. While they optimize stochastic (hyper)policies via action- or parameter-space exploration, real-world applications often require deterministic policies. Existing PG convergence guarantees to deterministic policies assume a fixed stochasticity in the (hyper)policy, tuned according to the desired final suboptimality, whereas practitioners commonly use a dynamic stochasticity level. This work provides the theoretical foundations for this practice. We introduce PES, a phase-based method that reduces stochasticity via a deterministic schedule while running PG subroutines with fixed stochasticity in each phase. Under gradient domination assumptions, PES achieves last-iterate convergence to the optimal deterministic policy with a sample complexity of order O˜(ϵ−5). Additionally, we analyze the common practice, termed SL-PG, of jointly learning stochasticity (via an appropriate parameterization) and (hyper)policy parameters. We show that SL-PG also ensures last-iterate convergence with a rate O˜(ϵ−3), but to the optimal stochastic (hyper)policy only, requiring stronger assumptions compared to PES.
Convergence Analysis of Policy Gradient Methods with Dynamic Stochasticity / A. Montenegro, M. Mussi, M. Papini, A.M. Metelli (PROCEEDINGS OF MACHINE LEARNING RESEARCH). - In: International Conference on Machine Learning / [a cura di] A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff, J. Zhu. - [s.l] : PMLR, 2025. - pp. 44652-44698 (( International Conference on Machine Learning Vancouver 2025.
Convergence Analysis of Policy Gradient Methods with Dynamic Stochasticity
M. PapiniPenultimo
;
2025
Abstract
Policy gradient (PG) methods are effective reinforcement learning (RL) approaches, particularly for continuous problems. While they optimize stochastic (hyper)policies via action- or parameter-space exploration, real-world applications often require deterministic policies. Existing PG convergence guarantees to deterministic policies assume a fixed stochasticity in the (hyper)policy, tuned according to the desired final suboptimality, whereas practitioners commonly use a dynamic stochasticity level. This work provides the theoretical foundations for this practice. We introduce PES, a phase-based method that reduces stochasticity via a deterministic schedule while running PG subroutines with fixed stochasticity in each phase. Under gradient domination assumptions, PES achieves last-iterate convergence to the optimal deterministic policy with a sample complexity of order O˜(ϵ−5). Additionally, we analyze the common practice, termed SL-PG, of jointly learning stochasticity (via an appropriate parameterization) and (hyper)policy parameters. We show that SL-PG also ensures last-iterate convergence with a rate O˜(ϵ−3), but to the optimal stochastic (hyper)policy only, requiring stronger assumptions compared to PES.| File | Dimensione | Formato | |
|---|---|---|---|
|
13207_Convergence_Analysis_of_ (3).pdf
accesso aperto
Tipologia:
Publisher's version/PDF
Licenza:
Creative commons
Dimensione
1.29 MB
Formato
Adobe PDF
|
1.29 MB | Adobe PDF | Visualizza/Apri |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.




