We introduce a principled method for performing zero-shot transfer in reinforcement learning (RL) by exploiting approximate models of the environment. Zero-shot transfer in RL has been investigated by leveraging methods rooted in generalized policy improvement (GPI) and successor features (SFs). Although computationally efficient, these methods are model-free: they analyze a library of policies---each solving a particular task---and identify which action the agent should take. We investigate the more general setting where, in addition to a library of policies, the agent has access to an approximate environment model. Even though model-based RL algorithms can identify near-optimal policies, they are typically computationally intensive. We introduce h-GPI, a multi-step extension of GPI that interpolates between these extremes---standard model-free GPI and fully model-based planning---as a function of a parameter, h, regulating the amount of time the agent has to reason. We prove that h-GPI's performance lower bound is strictly better than GPI's, and show that h-GPI generally outperforms GPI as h increases. Furthermore, we prove that as h increases, h-GPI's performance becomes arbitrarily less susceptible to sub-optimality in the agent's policy library. Finally, we introduce novel bounds characterizing the gains achievable by h-GPI as a function of approximation errors in both the agent's policy library and its (possibly learned) model. These bounds strictly generalize those known in the literature. We evaluate h-GPI on challenging tabular and continuous-state problems under value function approximation and show that it consistently outperforms GPI and state-of-the-art competing methods under various levels of approximation errors.
Nunes Alegre, L, Bazzan, A, Nowe, A & Castro da Silva, B 2023, Multi-Step Generalized Policy Improvement by Leveraging Approximate Models. in Advances in Neural Information Processing Systems 36 (NeurIPS 2023). vol. 36, Curran Associates, Inc., pp. 38181-38205, Advances in Neural Information Processing Systems, New Orleans, United States, 10/12/23. <https://proceedings.neurips.cc/paper_files/paper/2023/file/77c7faab15002432ba1151e8d5cc389a-Paper-Conference.pdf>
Nunes Alegre, L., Bazzan, A., Nowe, A., & Castro da Silva, B. (2023). Multi-Step Generalized Policy Improvement by Leveraging Approximate Models. In Advances in Neural Information Processing Systems 36 (NeurIPS 2023) (Vol. 36, pp. 38181-38205). Curran Associates, Inc.. https://proceedings.neurips.cc/paper_files/paper/2023/file/77c7faab15002432ba1151e8d5cc389a-Paper-Conference.pdf
@inproceedings{e9d92e28fcc444a3a7edc51e6c8e2fe6,
title = "Multi-Step Generalized Policy Improvement by Leveraging Approximate Models",
abstract = "We introduce a principled method for performing zero-shot transfer in reinforcement learning (RL) by exploiting approximate models of the environment. Zero-shot transfer in RL has been investigated by leveraging methods rooted in generalized policy improvement (GPI) and successor features (SFs). Although computationally efficient, these methods are model-free: they analyze a library of policies---each solving a particular task---and identify which action the agent should take. We investigate the more general setting where, in addition to a library of policies, the agent has access to an approximate environment model. Even though model-based RL algorithms can identify near-optimal policies, they are typically computationally intensive. We introduce h-GPI, a multi-step extension of GPI that interpolates between these extremes---standard model-free GPI and fully model-based planning---as a function of a parameter, h, regulating the amount of time the agent has to reason. We prove that h-GPI's performance lower bound is strictly better than GPI's, and show that h-GPI generally outperforms GPI as h increases. Furthermore, we prove that as h increases, h-GPI's performance becomes arbitrarily less susceptible to sub-optimality in the agent's policy library. Finally, we introduce novel bounds characterizing the gains achievable by h-GPI as a function of approximation errors in both the agent's policy library and its (possibly learned) model. These bounds strictly generalize those known in the literature. We evaluate h-GPI on challenging tabular and continuous-state problems under value function approximation and show that it consistently outperforms GPI and state-of-the-art competing methods under various levels of approximation errors.",
keywords = "Reinforcement Learning",
author = "{Nunes Alegre}, Lucas and Ana Bazzan and Ann Nowe and {Castro da Silva}, Bruno",
year = "2023",
language = "English",
volume = "36",
pages = "38181--38205",
booktitle = "Advances in Neural Information Processing Systems 36 (NeurIPS 2023)",
publisher = "Curran Associates, Inc.",
note = " Advances in Neural Information Processing Systems : https://neurips.cc/virtual/2023/index.html, NeurIPS 2023 ; Conference date: 10-12-2023 Through 16-12-2023",
}