Value-based reinforcement-learning algorithms provide state-of-the-art results in model-free discrete-action settings, and tend to outperform actor-critic algorithms. We argue that actor-critic algorithms are limited by their need for an on-policy critic. We propose Bootstrapped Dual Policy Iteration (BDPI), a novel model-free reinforcement-learning algorithm for continuous states and discrete actions, with an actor and several off-policy critics. Off-policy critics are compatible with experience replay, ensuring high sample-efficiency, without the need for off-policy corrections. The actor, by slowly imitating the average greedy policy of the critics, leads to high-quality and state-specific exploration, which we compare to Thompson sampling. Because the actor and critics are fully decoupled, BDPI is remarkably stable, and unusually robust to its hyper-parameters. BDPI is significantly more sample-efficient than Bootstrapped DQN, PPO, and ACKTR, on discrete, continuous and pixel-based tasks.
Steckelmacher, D, Plisnier, H, Roijers, D & Nowe, A 2020, Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics. in U Brefeld, E Fromont, A Hotho, A Knobbe, M Maathuis & C Robardet (eds), Lecture Notes in Artificial Intelligence: Machine Learning and Knowledge Discovery in Databases (ECML-PKDD proceedings), volume III. vol. 11908, 48, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 11908 LNAI, Springer, pp. 19-34, European Conference on Machine Learning 2019, Wurzburg, Bavaria, Germany, 16/09/19. https://doi.org/10.1007/978-3-030-46133-1_2
Steckelmacher, D., Plisnier, H., Roijers, D., & Nowe, A. (2020). Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics. In U. Brefeld, E. Fromont, A. Hotho, A. Knobbe, M. Maathuis, & C. Robardet (Eds.), Lecture Notes in Artificial Intelligence: Machine Learning and Knowledge Discovery in Databases (ECML-PKDD proceedings), volume III (Vol. 11908, pp. 19-34). Article 48 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 11908 LNAI). Springer. https://doi.org/10.1007/978-3-030-46133-1_2
@inproceedings{0a60087b9a5a483c8b2eea0a7fd027ac,
title = "Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics",
abstract = "Value-based reinforcement-learning algorithms provide state-of-the-art results in model-free discrete-action settings, and tend to outperform actor-critic algorithms. We argue that actor-critic algorithms are limited by their need for an on-policy critic. We propose Bootstrapped Dual Policy Iteration (BDPI), a novel model-free reinforcement-learning algorithm for continuous states and discrete actions, with an actor and several off-policy critics. Off-policy critics are compatible with experience replay, ensuring high sample-efficiency, without the need for off-policy corrections. The actor, by slowly imitating the average greedy policy of the critics, leads to high-quality and state-specific exploration, which we compare to Thompson sampling. Because the actor and critics are fully decoupled, BDPI is remarkably stable, and unusually robust to its hyper-parameters. BDPI is significantly more sample-efficient than Bootstrapped DQN, PPO, and ACKTR, on discrete, continuous and pixel-based tasks.",
author = "Denis Steckelmacher and Helene Plisnier and Diederik Roijers and Ann Nowe",
note = "Pages (from-to) not filled, because this information has not been disclosed by ECML to the authors, and Springer does not publish a table of contents of the proceedings book. The book would need to be bought to get this information.; European Conference on Machine Learning 2019, ECMLPKDD ; Conference date: 16-09-2019 Through 20-09-2019",
year = "2020",
doi = "10.1007/978-3-030-46133-1_2",
language = "English",
isbn = "978-3-030-46132-4",
volume = "11908",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
publisher = "Springer",
pages = "19--34",
editor = "Ulf Brefeld and Elisa Fromont and Andreas Hotho and Arno Knobbe and Marloes Maathuis and C{\'e}line Robardet",
booktitle = "Lecture Notes in Artificial Intelligence",
url = "https://ecmlpkdd2019.org/",
}