IRIS Institutional Research Information System - AIR Archivio Istituzionale della Ricerca

Obtaining no-regret guarantees for reinforcement learning (RL) in the case of problems with continuous state and/or action spaces is still one of the major open challenges in the field. Recently, a variety of solutions have been proposed, but besides very specific settings, the general problem remains unsolved. In this paper, we introduce a novel structural assumption on the Markov decision processes (MDPs), namely ν− smoothness, that generalizes most of the settings proposed so far (e.g., linear MDPs and Lipschitz MDPs). To face this challenging scenario, we propose two algorithms for regret minimization in ν− smooth MDPs. Both algorithms build upon the idea of constructing an MDP representation through an orthogonal feature map based on Legendre polynomials. The first algorithm, Legendre-Eleanor, archives the no-regret property under weaker assumptions but is computationally inefficient, whereas the second one, Legendre-LSVI, runs in polynomial time, although for a smaller class of problems. After analyzing their regret properties, we compare our results with state-of-the-art ones from RL theory, showing that our algorithms achieve the best guarantees.

No-Regret Reinforcement Learning in Smooth MDPs / D. Maran, A. Maria Metelli, M. Papini, M. Restelli (PROCEEDINGS OF MACHINE LEARNING RESEARCH). - In: International Conference on Machine Learning / [a cura di] R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, F. Berkenkamp. - [s.l] : PMLR, 2024. - pp. 1-30 (( 41. International Conference on Machine Learning Wien 2024.

No-Regret Reinforcement Learning in Smooth MDPs

Davide Maran;Alberto Maria Metelli;M. Papini;Marcello Restelli

2024

Abstract

Obtaining no-regret guarantees for reinforcement learning (RL) in the case of problems with continuous state and/or action spaces is still one of the major open challenges in the field. Recently, a variety of solutions have been proposed, but besides very specific settings, the general problem remains unsolved. In this paper, we introduce a novel structural assumption on the Markov decision processes (MDPs), namely ν− smoothness, that generalizes most of the settings proposed so far (e.g., linear MDPs and Lipschitz MDPs). To face this challenging scenario, we propose two algorithms for regret minimization in ν− smooth MDPs. Both algorithms build upon the idea of constructing an MDP representation through an orthogonal feature map based on Legendre polynomials. The first algorithm, Legendre-Eleanor, archives the no-regret property under weaker assumptions but is computationally inefficient, whereas the second one, Legendre-LSVI, runs in polynomial time, although for a smaller class of problems. After analyzing their regret properties, we compare our results with state-of-the-art ones from RL theory, showing that our algorithms achieve the best guarantees.

Scheda breve

Scheda completa

Scheda completa (DC)

	Settori scientifico-disciplinari del contributo (validi dal 09/05/2024)
	
				Settore IINF-05/A - Sistemi di elaborazione delle informazioni
Settore INFO-01/A - Informatica
			
	Data di pubblicazione
	
				2024
			
	URL
	
				https://proceedings.mlr.press/v235/maran24a.html
			
	Tipologia
	
				Book Part (author)
			
	Appare nelle tipologie:
	
				03 - Contributo in volume

File in questo prodotto:

File	Dimensione	Formato
No-Regret Reinforcement Learning in Smooth MDPs.pdf accesso aperto Tipologia: Publisher's version/PDF Licenza: Creative commons Dimensione 515.07 kB Formato Adobe PDF Visualizza/Apri	515.07 kB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/1226146

Citazioni

ND

1

ND

ND

social impact