Skip to main navigation Skip to search Skip to main content

Contributions to Offline Reinforcement Learning: Mitigating Distribution Shift and Enhancing Goal-Conditioned Learning

  • Jing ZHANG

Student thesis: Doctoral thesis

Abstract

Reinforcement learning (RL) has achieved strong performance in manipulation and goal-conditioned tasks. However, its reliance on costly environment interactions limits broader applicability. Offline RL mitigates this issue by learning policies from pre-collected datasets, thereby extending the reach of RL. Nonetheless, the absence of online interaction introduces critical challenges—most notably distribution shift and extrapolation errors during policy improvement—which hinder the effectiveness of standard RL methods. This thesis provides a concise overview of these fundamental challenges and analyzes the key factors that must be addressed. We propose solutions to two central problems: (1) accurately estimating the behavior policy’s density and (2) managing uncertainty in the Q-value function. Precise density estimation is essential for controlling distributional shift between the behavior policy and the learned policy. To this end, we propose using a flow-based GAN, where the generator models the behavior policy’s density explicitly, enabling direct support-based distribution consistency control. To mitigate extrapolation errors in the Q-function, we emphasize reliable uncertainty estimation. Our approach involves sampling from the behavior policy’s Q-value distribution, learned via an efficient, high-fidelity consistency model. Beyond general offline RL, we also address challenges specific to goal-conditioned tasks. From a probabilistic graphical model perspective, we argue that performance failures often stem from the inability of terminal-only rewards to propagate effectively to earlier states in long-horizon tasks. To address this, we propose a reward stimulation method that injects signals at key waypoints along the trajectory. This preserves reward propagation without requiring explicit waypoint prediction—an especially challenging task in high-dimensional, continuous control environments. The proposed methods are supported by both theoretical analysis and empirical validation.

Date of Award2025
Original languageEnglish
Awarding Institution
  • The Hong Kong University of Science and Technology
SupervisorMolong DUAN (Supervisor) & Wenjia WANG (Supervisor)

Cite this

'