Abstract

Offline reinforcement learning, by learning from a fixed dataset, makes it possible to learn agent behaviors without interacting with the environment. However, depending on the quality of the offline dataset, such pre-trained agents may have limited performance and would further need to be fine-tuned online by interacting with the environment. During online fine-tuning, the performance of the pre-trained agent may collapse quickly due to the sudden distribution shift from offline to online data. While constraints enforced by offline RL methods such as a behaviour cloning loss prevent this to an extent, these constraints also significantly slow down online fine-tuning by forcing the agent to stay close to the behavior policy. We propose to adaptively weigh the behavior cloning loss during online fine-tuning based on the agent's performance and training stability. Moreover, we use a randomized ensemble of Q functions to further increase the sample efficiency of online fine-tuning by performing a large number of learning updates. Experiments show that the proposed method yields state-of-the-art offline-to-online reinforcement learning performance on the popular D4RL benchmark.
Original languageEnglish
Title of host publicationProceedings of the European Symposium on Artificial Neural Networks, 2022
PublisherEuropean Symposium on Artificial Neural Networks (ESANN)
Number of pages6
ISBN (Electronic)9782875870841
DOIs
Publication statusPublished - 2022
MoE publication typeA4 Article in a conference publication
EventEuropean Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning - Bruges, Belgium
Duration: 5 Oct 20227 Oct 2022
Conference number: 30

Conference

ConferenceEuropean Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning
Abbreviated titleESANN
Country/TerritoryBelgium
CityBruges
Period05/10/202207/10/2022

Keywords

  • reinforcement learning
  • offline-to-online reinforcement learning

Fingerprint

Dive into the research topics of 'Adaptive behavior cloning regularization for stable offline-to-online reinforcement learning'. Together they form a unique fingerprint.

Cite this