Eugenio Bargiacchi, Bargiacchi, Eugenio, Avalos Martinez de Escobar, Raphaël, Verstraeten, Timothy, Libin, Pieter, , Diederik M. Roijers

Proc. of the Adaptive and Learning Agents Workshop

In this paper, we provide PAC bounds for best-arm identification in multi-agent multi-armed bandits (MAMABs), via an algorithm we call multi-agent RMax (MARMax). In a MAMAB, the reward structure is expressed as a coordination graph, i.e., the total team reward is a sum over local reward functions that are conditioned on the actions of a subset of the agents. Assuming that these local rewards functions are bounded, we derive a PAC(delta, epsilon) learning algorithm that achieves an accuracy of epsilon relative to the maximum team reward, with a number of required samples at most linear in the size of the largest reward reward function with probability 1 -delta. Furthermore, we show that in practice MARMax performs much better than this theoretical upper bound on the sample complexity by an order of magnitude. We further improve on the empirical result by tempering the optimism used in MARMax, resulting in a new algorithm that we call MAVMax. We show empirically that MAVMax further cuts down the number of required samples by around 30% w.r.t. MARMax

Bargiacchi, E , Avalos, R , Verstraeten, T , Libin, P , Nowe, A & Roijers, DM 2022, ' Multi-agent RMax for Multi-Agent Multi-Armed Bandits ', Proc. of the Adaptive and Learning Agents Workshop , vol. https://ala2022.github.io/, pp. 1-6. < https://ala2022.github.io/papers/ALA2022_paper_38.pdf >

Bargiacchi, E. , Avalos, R. , Verstraeten, T. , Libin, P. , Nowe, A. , & Roijers, D. M. (2022). Multi-agent RMax for Multi-Agent Multi-Armed Bandits . Proc. of the Adaptive and Learning Agents Workshop , https://ala2022.github.io/ , 1-6. https://ala2022.github.io/papers/ALA2022_paper_38.pdf

@article{baf3359d94924f3ba75b24bfdd697cef,

title = " Multi-agent RMax for Multi-Agent Multi-Armed Bandits " ,

abstract = " In this paper, we provide PAC bounds for best-arm identification in multi-agent multi-armed bandits (MAMABs), via an algorithm we call multi-agent RMax (MARMax). In a MAMAB, the reward structure is expressed as a coordination graph, i.e., the total team reward is a sum over local reward functions that are conditioned on the actions of a subset of the agents. Assuming that these local rewards functions are bounded, we derive a PAC(delta, epsilon) learning algorithm that achieves an accuracy of epsilon relative to the maximum team reward, with a number of required samples at most linear in the size of the largest reward reward function with probability 1 -delta. Furthermore, we show that in practice MARMax performs much better than this theoretical upper bound on the sample complexity by an order of magnitude. We further improve on the empirical result by tempering the optimism used in MARMax, resulting in a new algorithm that we call MAVMax. We show empirically that MAVMax further cuts down the number of required samples by around 30% w.r.t. MARMax " ,

author = " Eugenio Bargiacchi and Rapha{ " e}l Avalos and Timothy Verstraeten and Pieter Libin and Ann Nowe and Roijers, {Diederik M.} " ,

year = " 2022 " ,

language = " English " ,

volume = " https://ala2022.github.io/ " ,

pages = " 16 " ,

journal = " Proc. of the Adaptive and Learning Agents Workshop " ,

}