Publication Details
Steckelmacher, Denis



Reinforcement Learning allows an artificial agent to learn how to perform a task, given only sensory inputs, and rewards to be maximized. Reinforcement Learning has been successfully applied to several industrial settings, such as energy production and management, transportation, networks, banking, and health care. However, a challenge of Reinforcement Learning, that prevents its wide-spread deployment in many real-world settings, is the need for the agent to learn. When no simulator for a task is available, the agent must learn on the physical system, or the user-facing software. In this case, any mistake made by the agent while learning, due to its still incomplete knowledge or the task, or its inherent need to explore, may have costly effects. In this thesis, we consider real-world tasks that may benefit from Reinforcement Learning, and for which there is no simulator available. We focus on sample-efficiency, that measures how little interactions with the environment the agent needs before learning the task. We believe that with high sample-efficiency (fast learning), few errors are made, leading to an acceptable cost of training the agent. We follow a research line that starts with theoretical insights and algorithms, and aims towards practical applications on physical platforms. We review several families of Reinforcement Learning algorithms, then introduce our first contribution, the model-free actor-critic Bootstrapped Dual Policy Iteration algorithm (BDPI). Thanks to its use of off-policy critics, and an explicit actor, BDPI achieves high-quality exploration and sample-efficiency. Our second theoretical contribution is a formalism, based on the Options framework, that allows an agent to learn partially-observable tasks (with a discrete memory) in a sample-efficient way. Finally, we focus on a real-world robot, a motorized wheelchair controlled with a joystick, and propose extensions to BDPI to make it able to (safely) learn a complex navigation task, directly on the wheelchair, without a model or pre-training, in about one hour of wall-clock time. We carefully evaluate our algorithms on numerous simulated tasks, and on two robotic platforms, one of them the wheelchair mentioned above. In all our experiments, our algorithms outperform state-of-the-art approaches. Our results demonstrate the applicability of Reinforcement Learning to real-world settings, where sample-efficiency is critical. In addition to our scientific results, we present numerous implementation details, and general introductions to algorithms. We hope that our original algorithms, and the way we present this thesis, will allow a wide range of organizations and companies to deploy Reinforcement Learning in challenging settings, leading to increased value.