Policy optimization is an effective reinforcement learning approach to solve continuous control tasks. Recent achievements have shown that alternating online and offline optimization is a successful choice for efficient trajectory reuse. However, deciding when to stop optimizing and collect new trajectories is non-trivial, as it requires to account for the variance of the objective function estimate. In this paper, we propose a novel, model-free, policy search algorithm, POIS, applicable in both action-based and parameter-based settings. We first derive a high-confidence bound for importance sampling estimation; then we define a surrogate objective function, which is optimized offline whenever a new batch of trajectories is collected. Finally, the algorithm is tested on a selection of continuous control tasks, with both linear and deep policies, and compared with state-of-the-art policy optimization methods.
Policy optimization via importance sampling / A.M. Metelli, M. Papini, F. Faccio, M. Restelli (ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS). - In: NeurIPS 2018 / [a cura di] S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, R. Garnett. - [s.l] : Neural information processing systems foundation, 2018. - pp. 5442-5454 (( 32. Conference on Neural Information Processing Systems, NeurIPS 2018 Montreal 2018.
Policy optimization via importance sampling
M. PapiniSecondo
;F. FaccioPenultimo
;
2018
Abstract
Policy optimization is an effective reinforcement learning approach to solve continuous control tasks. Recent achievements have shown that alternating online and offline optimization is a successful choice for efficient trajectory reuse. However, deciding when to stop optimizing and collect new trajectories is non-trivial, as it requires to account for the variance of the objective function estimate. In this paper, we propose a novel, model-free, policy search algorithm, POIS, applicable in both action-based and parameter-based settings. We first derive a high-confidence bound for importance sampling estimation; then we define a surrogate objective function, which is optimized offline whenever a new batch of trajectories is collected. Finally, the algorithm is tested on a selection of continuous control tasks, with both linear and deep policies, and compared with state-of-the-art policy optimization methods.| File | Dimensione | Formato | |
|---|---|---|---|
|
NeurIPS-2018-policy-optimization-via-importance-sampling-Paper.pdf
accesso riservato
Tipologia:
Publisher's version/PDF
Licenza:
Nessuna licenza
Dimensione
675.3 kB
Formato
Adobe PDF
|
675.3 kB | Adobe PDF | Visualizza/Apri Richiedi una copia |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.




