Publication Details
Overview
 
 
Eugenio Bargiacchi, Bargiacchi, Eugenio, Avalos Martinez de Escobar, Raphaël, Verstraeten, Timothy, Libin, Pieter, , Diederik M. Roijers
 

Proc. of the Adaptive and Learning Agents Workshop

Contribution To Journal

Abstract 

In this paper, we provide PAC bounds for best-arm identification in multi-agent multi-armed bandits (MAMABs), via an algorithm we call multi-agent RMax (MARMax). In a MAMAB, the reward structure is expressed as a coordination graph, i.e., the total team reward is a sum over local reward functions that are conditioned on the actions of a subset of the agents. Assuming that these local rewards functions are bounded, we derive a PAC(delta, epsilon) learning algorithm that achieves an accuracy of epsilon relative to the maximum team reward, with a number of required samples at most linear in the size of the largest reward reward function with probability 1 -delta. Furthermore, we show that in practice MARMax performs much better than this theoretical upper bound on the sample complexity by an order of magnitude. We further improve on the empirical result by tempering the optimism used in MARMax, resulting in a new algorithm that we call MAVMax. We show empirically that MAVMax further cuts down the number of required samples by around 30% w.r.t. MARMax

Reference 
 
 
Link