Two-Step Offline Preference-Based Reinforcement Learning with Constrained Actions

[Submitted on 30 Dec 2023 (v1), last revised 25 Oct 2024 (this version, v3)]

Authors:Yinglun Xu, Tarun Suresh, Rohan Gumaste, David Zhu, Ruirui Li, Zhengyang Wang, Haoming Jiang, Xianfeng Tang, Qingyu Yin, Monica Xiao Cheng, Qi Zeng, Chao Zhang, Gagandeep Singh

View a PDF of the paper titled Two-Step Offline Preference-Based Reinforcement Learning with Constrained Actions, by Yinglun Xu and 12 other authors

View PDF
HTML (experimental)

Abstract:Preference-based reinforcement learning (PBRL) in the offline setting has succeeded greatly in industrial applications such as chatbots. A two-step learning framework where one applies a reinforcement learning step after a reward modeling step has been widely adopted for the problem. However, such a method faces challenges from the risk of reward hacking and the complexity of reinforcement learning. To overcome the challenge, our insight is that both challenges come from the state-actions not supported in the dataset. Such state-actions are unreliable and increase the complexity of the reinforcement learning problem at the second step. Based on the insight, we develop a novel two-step learning method called PRC: preference-based reinforcement learning with constrained actions. The high-level idea is to limit the reinforcement learning agent to optimize over a constrained action space that excludes the out-of-distribution state-actions. We empirically verify that our method has high learning efficiency on various datasets in robotic control environments.

Submission history

From: Yinglun Xu [view email]
[v1]
Sat, 30 Dec 2023 21:37:18 UTC (7,056 KB)
[v2]
Wed, 23 Oct 2024 19:38:34 UTC (7,533 KB)
[v3]
Fri, 25 Oct 2024 17:31:50 UTC (7,533 KB)

Source link
lol

Two-Step Offline Preference-Based Reinforcement Learning with Constrained Actions

Submission history

By stp2y

Leave a Reply Cancel reply