Policy gradient (PG) algorithms are among the best candidates for the much-anticipated applications of reinforcement learning to real-world control tasks, such as robotics. However, the trial-and-error nature of these methods poses safety issues whenever the learning process itself must be performed on a physical system or involves any form of human-computer interaction. In this paper, we address a specific safety formulation, where both goals and dangers are encoded in a scalar reward signal and the learning agent is constrained to never worsen its performance, measured as the expected sum of rewards. By studying actor-only PG from a stochastic optimization perspective, we establish improvement guarantees for a wide class of parametric policies, generalizing existing results on Gaussian policies. This, together with novel upper bounds on the variance of PG estimators, allows us to identify meta-parameter schedules that guarantee monotonic improvement with high probability. The two key meta-parameters are the step size of the parameter updates and the batch size of the gradient estimates. Through a joint, adaptive selection of these meta-parameters, we obtain a PG algorithm with monotonic improvement guarantees.

Smoothing policies and safe policy gradients / M. Papini, M. Pirotta, M. Restelli. - In: MACHINE LEARNING. - ISSN 0885-6125. - 111:11(2022 Nov), pp. 4081-4137. [10.1007/s10994-022-06232-6]

Smoothing policies and safe policy gradients

M. Papini
Primo
;
2022

Abstract

Policy gradient (PG) algorithms are among the best candidates for the much-anticipated applications of reinforcement learning to real-world control tasks, such as robotics. However, the trial-and-error nature of these methods poses safety issues whenever the learning process itself must be performed on a physical system or involves any form of human-computer interaction. In this paper, we address a specific safety formulation, where both goals and dangers are encoded in a scalar reward signal and the learning agent is constrained to never worsen its performance, measured as the expected sum of rewards. By studying actor-only PG from a stochastic optimization perspective, we establish improvement guarantees for a wide class of parametric policies, generalizing existing results on Gaussian policies. This, together with novel upper bounds on the variance of PG estimators, allows us to identify meta-parameter schedules that guarantee monotonic improvement with high probability. The two key meta-parameters are the step size of the parameter updates and the batch size of the gradient estimates. Through a joint, adaptive selection of these meta-parameters, we obtain a PG algorithm with monotonic improvement guarantees.
Monotonic improvement; Policy gradient; Reinforcement learning; Safe learning
Settore INFO-01/A - Informatica
Settore IINF-05/A - Sistemi di elaborazione delle informazioni
   Provably Efficient Algorithms for Large-Scale Reinforcement Learning
   SCALER
   European Commission
   Horizon 2020 Framework Programme
   950180
nov-2022
20-ott-2022
Article (author)
File in questo prodotto:
File Dimensione Formato  
unpaywall-bitstream--1152295025.pdf

accesso aperto

Tipologia: Publisher's version/PDF
Licenza: Creative commons
Dimensione 4.63 MB
Formato Adobe PDF
4.63 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/1225915
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 17
  • ???jsp.display-item.citation.isi??? 9
  • OpenAlex 20
social impact