In this paper, we face the challenge of online learning in adversarial Markov decision processes with off-policy feedback. In this setting, the learner chooses a policy, but, differently from the traditional on-policy setting, the environment is explored by means of a different, fixed, and possibly unknown policy (named colleague's policy). The off-policy feedback presents an additional issue that is not present in traditional settings: the learner is charged with the regret of its chosen policy but it observes only the rewards gained by the colleague's policy. First, we present a lower-bound for the setting we propose, which shows that the optimal dependency of the sublinear regret is w.r.t. the dissimilarity between the optimal policy in hindsight and the colleague's policy. Then, we propose novel algorithms that, by employing pessimistic estimators---commonly adopted in the off-line reinforcement learning literature---ensure sublinear regret bounds depending on the desired dissimilarity, even when the colleague's policy is unknown.
Online Learning with Off-Policy Feedback in Adversarial MDPs / F. Bacchiocchi, F. Stradi, M. Papini, A. Metelli, N. Gatti - In: Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence / [a cura di] K. Larson. - [s.l] : International Joint Conferences on Artificial Intelligence Organization, 2024. - ISBN 978-1-956792-04-1. - pp. 3697-3705 (( 33. International Joint Conference on Artificial Intelligence Jeju Island 2024 [10.24963/ijcai.2024/409].
Online Learning with Off-Policy Feedback in Adversarial MDPs
M. Papini;
2024
Abstract
In this paper, we face the challenge of online learning in adversarial Markov decision processes with off-policy feedback. In this setting, the learner chooses a policy, but, differently from the traditional on-policy setting, the environment is explored by means of a different, fixed, and possibly unknown policy (named colleague's policy). The off-policy feedback presents an additional issue that is not present in traditional settings: the learner is charged with the regret of its chosen policy but it observes only the rewards gained by the colleague's policy. First, we present a lower-bound for the setting we propose, which shows that the optimal dependency of the sublinear regret is w.r.t. the dissimilarity between the optimal policy in hindsight and the colleague's policy. Then, we propose novel algorithms that, by employing pessimistic estimators---commonly adopted in the off-line reinforcement learning literature---ensure sublinear regret bounds depending on the desired dissimilarity, even when the colleague's policy is unknown.| File | Dimensione | Formato | |
|---|---|---|---|
|
0409.pdf
accesso aperto
Tipologia:
Publisher's version/PDF
Licenza:
Creative commons
Dimensione
227.53 kB
Formato
Adobe PDF
|
227.53 kB | Adobe PDF | Visualizza/Apri |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.




