Thompson Sampling for Factored Multi-Agent Bandits
Host Publication: The 19th International Conference on Autonomous Agents and Multi-Agent Systems
Authors: T. Verstraeten, E. Bargiacchi, P. Libin, D. M Roijers and A. Nowé
Publication Date: May. 2020
Number of Pages: 2
Multi-agent decision making is prevalent in many real-world applications, such as wind farm control , traffic light control  and warehouse commissioning . In these settings, agents need to cooperate to maximize a shared team reward . Coordination in multi-agent settings is challenging, due to the combinatorial increase in terms of the number of agents. Therefore, it is computationally intractable to consider all agents' actions jointly. Fortunately, in many real-world settings, an agent is only directly influenced by a small subset of neighboring agents. In this case, the team reward can be factorized over the groups of agents that influence each other. Such a sparse factorization must be exploited to keep multi-agent decision problems tractable. In this work, we consider learning to coordinate in multi-agent multi-armed bandit problems (Section 1), and propose the Multi-Agent Thompson Sampling (MATS) algorithm (Section 2), which exploits sparse interactions in multi-agent systems. We assume that the groups of interacting agents are known beforehand, which is often the case in real-world applications. Our method uses the exploration-exploitation mechanism of Thompson sampling (TS). TS has been shown to achieve high empirical performance . Moreover, TS is a Bayesian method, which allows for the specification of prior knowledge through belief distributions. We argue that this is an important property to have in many practical applications, such as influenza mitigation [7, 8]. We compare MATS against Sparse Cooperative Q-Learning (SCQL) and Multi-Agent Upper Confidence Exploration (MAUCE) on two synthetic settings, i.e., Bernoulli 0101-Chain and Gem Mining (Section 3). MATS improves upon the state of the art with respect to accuracy, learning speed and computational speed.