We study the problem of representation learning in stochastic contextual linear bandits. While the primary concern in this domain is usually to find realizable repre- sentations (i.e., those that allow predicting the reward function at any context-action pair exactly), it has been recently shown that representations with certain spectral properties (called HLS) may be more effective for the exploration-exploitation task, enabling LinUCB to achieve constant (i.e., horizon-independent) regret. In this paper, we propose BANDITSRL, a representation learning algorithm that com- bines a novel constrained optimization problem to learn a realizable representation with good spectral properties with a generalized likelihood ratio test to exploit the recovered representation and avoid excessive exploration. We prove that BAN- DITSRL can be paired with any no-regret algorithm and achieve constant regret whenever an HLS representation is available. Furthermore, BANDITSRL can be easily combined with deep neural networks and we show how regularizing towards HLS representations is beneficial in standard benchmarks.
Scalable Representation Learning in Linear Contextual Bandits with Constant Regret Guarantees / A. Tirinzoni, M. Papini, A. Touati, A. Lazaric, M. Pirotta (ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS). - In: 36th Conference on Neural Information Processing Systems (NeurIPS 2022) / [a cura di] S. Koyejo, S. Mohamed. - [s.l] : Neural information processing systems foundation, 2022. - ISBN 9781713871088. - pp. 2307-2319 (( 36. Conference on Neural Information Processing Systems (NeurIPS 2022) New Orleans 2022.
Scalable Representation Learning in Linear Contextual Bandits with Constant Regret Guarantees
M. PapiniSecondo
;
2022
Abstract
We study the problem of representation learning in stochastic contextual linear bandits. While the primary concern in this domain is usually to find realizable repre- sentations (i.e., those that allow predicting the reward function at any context-action pair exactly), it has been recently shown that representations with certain spectral properties (called HLS) may be more effective for the exploration-exploitation task, enabling LinUCB to achieve constant (i.e., horizon-independent) regret. In this paper, we propose BANDITSRL, a representation learning algorithm that com- bines a novel constrained optimization problem to learn a realizable representation with good spectral properties with a generalized likelihood ratio test to exploit the recovered representation and avoid excessive exploration. We prove that BAN- DITSRL can be paired with any no-regret algorithm and achieve constant regret whenever an HLS representation is available. Furthermore, BANDITSRL can be easily combined with deep neural networks and we show how regularizing towards HLS representations is beneficial in standard benchmarks.| File | Dimensione | Formato | |
|---|---|---|---|
|
NeurIPS-2022-scalable-representation-learning-in-linear-contextual-bandits-with-constant-regret-guarantees-Paper-Conference.pdf
accesso aperto
Tipologia:
Post-print, accepted manuscript ecc. (versione accettata dall'editore)
Licenza:
Creative commons
Dimensione
1.14 MB
Formato
Adobe PDF
|
1.14 MB | Adobe PDF | Visualizza/Apri |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.




