Policy optimization is an effective reinforcement learning approach to solve continuous control tasks. Recent achievements have shown that alternating online and offline optimization is a successful choice for efficient trajectory reuse. However, deciding when to stop optimizing and collect new trajectories is non-trivial, as it requires to account for the variance of the objective function estimate. In this paper, we propose a novel, model-free, policy search algorithm, POIS, applicable in both action-based and parameter-based settings. We first derive a high-confidence bound for importance sampling estimation; then we define a surrogate objective function, which is optimized offline whenever a new batch of trajectories is collected. Finally, the algorithm is tested on a selection of continuous control tasks, with both linear and deep policies, and compared with state-of-the-art policy optimization methods.

Policy optimization via importance sampling / A.M. Metelli, M. Papini, F. Faccio, M. Restelli (ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS). - In: NeurIPS 2018 / [a cura di] S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, R. Garnett. - [s.l] : Neural information processing systems foundation, 2018. - pp. 5442-5454 (( 32. Conference on Neural Information Processing Systems, NeurIPS 2018 Montreal 2018.

Policy optimization via importance sampling

M. Papini
Secondo
;
F. Faccio
Penultimo
;
2018

Abstract

Policy optimization is an effective reinforcement learning approach to solve continuous control tasks. Recent achievements have shown that alternating online and offline optimization is a successful choice for efficient trajectory reuse. However, deciding when to stop optimizing and collect new trajectories is non-trivial, as it requires to account for the variance of the objective function estimate. In this paper, we propose a novel, model-free, policy search algorithm, POIS, applicable in both action-based and parameter-based settings. We first derive a high-confidence bound for importance sampling estimation; then we define a surrogate objective function, which is optimized offline whenever a new batch of trajectories is collected. Finally, the algorithm is tested on a selection of continuous control tasks, with both linear and deep policies, and compared with state-of-the-art policy optimization methods.
Settore INFO-01/A - Informatica
Settore IINF-05/A - Sistemi di elaborazione delle informazioni
2018
Book Part (author)
File in questo prodotto:
File Dimensione Formato  
NeurIPS-2018-policy-optimization-via-importance-sampling-Paper.pdf

accesso riservato

Tipologia: Publisher's version/PDF
Licenza: Nessuna licenza
Dimensione 675.3 kB
Formato Adobe PDF
675.3 kB Adobe PDF   Visualizza/Apri   Richiedi una copia
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/1225936
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 74
  • ???jsp.display-item.citation.isi??? 29
  • OpenAlex 23
social impact