Mathieu Reymond, Conor F. Hayes, Denis Steckelmacher, Diederik M. Roijers, Ann Nowe
We propose a novel multi-objective reinforcement learning algorithm that successfully learns the optimal policy even for non-linear utility functions. Non-linear utility functions pose a challenge for SOTA approaches, both in terms of learning efficiency as well as the solution concept. A key insight is that, by proposing a critic that learns a multi-variate distribution over the returns, which is then combined with accumulated rewards, we can directly optimize on the utility function, even if it is non-linear. This allows us to vastly increase the range of problems that can be solved compared to those which can be handled by single-objective methods or multi-objective methods requiring linear utility functions, yet avoiding the need to learn the full Pareto front. We demonstrate our method on multiple multi-objective benchmarks, and show that it learns effectively where baseline approaches fail.
Reymond, M, Hayes, CF, Steckelmacher, D, Roijers, DM & Nowe, A 2023, 'Actor-critic multi-objective reinforcement learning for non-linear utility functions', Autonomous Agents and Multi-Agent Systems, vol. 37, no. 2, 23. https://doi.org/10.1007/s10458-023-09604-x
Reymond, M., Hayes, C. F., Steckelmacher, D., Roijers, D. M., & Nowe, A. (2023). Actor-critic multi-objective reinforcement learning for non-linear utility functions. Autonomous Agents and Multi-Agent Systems, 37(2), Article 23. https://doi.org/10.1007/s10458-023-09604-x
@article{f93c40c2c2c64d038f870f6efda010a7,
title = "Actor-critic multi-objective reinforcement learning for non-linear utility functions",
abstract = "We propose a novel multi-objective reinforcement learning algorithm that successfully learns the optimal policy even for non-linear utility functions. Non-linear utility functions pose a challenge for SOTA approaches, both in terms of learning efficiency as well as the solution concept. A key insight is that, by proposing a critic that learns a multi-variate distribution over the returns, which is then combined with accumulated rewards, we can directly optimize on the utility function, even if it is non-linear. This allows us to vastly increase the range of problems that can be solved compared to those which can be handled by single-objective methods or multi-objective methods requiring linear utility functions, yet avoiding the need to learn the full Pareto front. We demonstrate our method on multiple multi-objective benchmarks, and show that it learns effectively where baseline approaches fail.",
author = "Mathieu Reymond and Hayes, {Conor F.} and Denis Steckelmacher and Roijers, {Diederik M.} and Ann Nowe",
note = "Funding Information: Conor F. Hayes is funded by the University of Galway Hardiman Scholarship. This research was supported by funding from the Flemish Government under the âOnderzoeksprogramma Artifici{\"e}le Intelligentie (AI) Vlaanderenâ program. Publisher Copyright: {\textcopyright} 2023, Springer Science+Business Media, LLC, part of Springer Nature. Copyright: Copyright 2023 Elsevier B.V., All rights reserved.",
year = "2023",
month = apr,
day = "23",
doi = "10.1007/s10458-023-09604-x",
language = "English",
volume = "37",
journal = "Autonomous Agents and Multi-Agent Systems",
issn = "1387-2532",
publisher = "Springer Netherlands",
number = "2",
}