We study a contextual version of online combinatorial optimisation with full and semi-bandit feedback. In this sequential decision-making problem, an online learner has to select an action from a combinatorial decision space after seeing a vector-valued context in each round. As a result of its action, the learner incurs a loss that is a bilinear function of the context vector and the vector representation of the chosen action. We consider two natural versions of the problem: semi-bandit where the losses are revealed for each component appearing in the learner’s combinatorial action, and full-bandit where only the total loss is observed. We design computationally efficient algorithms based on a new loss estimator that takes advantage of the special structure of the problem, and show regret bounds order $\sqrt{T}$ with respect to the time horizon. The bounds demonstrate polynomial scaling with the relevant problem parameters which is shown to be nearly optimal. The theoretical results are complemented by a set of experiments on simulated data.

Nonstochastic Contextual Combinatorial Bandits / L. Zierahn, D. van der Hoeven, N. Cesa Bianchi, G. Neu (PROCEEDINGS OF MACHINE LEARNING RESEARCH). - In: International Conference on Artificial Intelligence and Statistics / [a cura di] F. Ruiz, J. Dy, J.-W. van de Meent. - [s.l] : PMLR, 2023. - pp. 8771-8813 (( convegno International Conference on Artificial Intelligence and Statistics tenutosi a Valencia nel 2023.

Nonstochastic Contextual Combinatorial Bandits

L. Zierahn
Primo
;
D. van der Hoeven
Secondo
;
N. Cesa Bianchi
Penultimo
;
2023

Abstract

We study a contextual version of online combinatorial optimisation with full and semi-bandit feedback. In this sequential decision-making problem, an online learner has to select an action from a combinatorial decision space after seeing a vector-valued context in each round. As a result of its action, the learner incurs a loss that is a bilinear function of the context vector and the vector representation of the chosen action. We consider two natural versions of the problem: semi-bandit where the losses are revealed for each component appearing in the learner’s combinatorial action, and full-bandit where only the total loss is observed. We design computationally efficient algorithms based on a new loss estimator that takes advantage of the special structure of the problem, and show regret bounds order $\sqrt{T}$ with respect to the time horizon. The bounds demonstrate polynomial scaling with the relevant problem parameters which is shown to be nearly optimal. The theoretical results are complemented by a set of experiments on simulated data.
Settore INF/01 - Informatica
   Algorithms, Games, and Digital Markets (ALGADIMAR)
   ALGADIMAR
   MINISTERO DELL'ISTRUZIONE E DEL MERITO
   2017R9FHSR_006

   European Learning and Intelligent Systems Excellence (ELISE)
   ELISE
   EUROPEAN COMMISSION
   951847
2023
https://proceedings.mlr.press/v206/zierahn23a.html
Book Part (author)
File in questo prodotto:
File Dimensione Formato  
zierahn23a.pdf

accesso aperto

Tipologia: Publisher's version/PDF
Dimensione 1.03 MB
Formato Adobe PDF
1.03 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/969059
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact