IRIS Institutional Research Information System - AIR Archivio Istituzionale della Ricerca

We study networks of communicating learning agents that cooperate to solve a common nonstochastic bandit problem. Agents use an underlying communication network to get messages about actions selected by other agents, and drop messages that took more than d hops to arrive, where d is a delay parameter. We introduce Exp3-Coop, a cooperative version of the Exp3 algorithm and prove that with K actions and N agents the average per-agent regret after T rounds is at most of order root(d + 1 + K/N alpha(<= d)) (T ln K), where alpha(<= d) is the independence number of the d-th power of the communication graph G. We then show that for any connected graph, for d = root K the regret bound is K-1/4 root T, strictly better than the minimax regret root KT for noncooperating agents. More informed choices of d lead to bounds which are arbitrarily close to the full information minimax regret root T ln K when G is dense. When G has sparse components, we show that a variant of Exp3-Coop, allowing agents to choose their parameters according to their centrality in G, strictly improves the regret. Finally, as a by-product of our analysis, we provide the first characterization of the minimax regret for bandit learning with delay

Delay and Cooperation in Nonstochastic Bandits / N. Cesa-Bianchi, C. Gentile, Y. Mansour. - In: JOURNAL OF MACHINE LEARNING RESEARCH. - ISSN 1533-7928. - 20:17(2019), pp. 1.1-1.38.

Delay and Cooperation in Nonstochastic Bandits

N. Cesa-Bianchi^Primo;C. Gentile;Y. Mansour

2019

Abstract

We study networks of communicating learning agents that cooperate to solve a common nonstochastic bandit problem. Agents use an underlying communication network to get messages about actions selected by other agents, and drop messages that took more than d hops to arrive, where d is a delay parameter. We introduce Exp3-Coop, a cooperative version of the Exp3 algorithm and prove that with K actions and N agents the average per-agent regret after T rounds is at most of order root(d + 1 + K/N alpha(<= d)) (T ln K), where alpha(<= d) is the independence number of the d-th power of the communication graph G. We then show that for any connected graph, for d = root K the regret bound is K-1/4 root T, strictly better than the minimax regret root KT for noncooperating agents. More informed choices of d lead to bounds which are arbitrarily close to the full information minimax regret root T ln K when G is dense. When G has sparse components, we show that a variant of Exp3-Coop, allowing agents to choose their parameters according to their centrality in G, strictly improves the regret. Finally, as a by-product of our analysis, we provide the first characterization of the minimax regret for bandit learning with delay

Scheda breve

Scheda completa

Scheda completa (DC)

	Parole chiave
	
			Multi-armed bandits; distributed learning; cooperative multi-agent systems; regret minimization; LOCAL communication
		
	Settori scientifico-disciplinari dell'articolo
	
			Settore INF/01 - Informatica
		
	Data di pubblicazione
	
			2019
		
	Rivista in ANCE
	
			JOURNAL OF MACHINE LEARNING RESEARCH
		
	URL
	
			http://jmlr.org/papers/v20/17-631.html
		
	Tipologia
	
			Article (author)
		
	Appare nelle tipologie:
	
			01 - Articolo su periodico

File in questo prodotto:

File	Dimensione	Formato
17-631.pdf accesso aperto Tipologia: Publisher's version/PDF Dimensione 492.86 kB Formato Adobe PDF Visualizza/Apri	492.86 kB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/623168

Citazioni

ND

4

28

social impact