MSRL: Distributed Reinforcement Learning with Dataflow Fragments

Huanzhou Zhu, Weifeng Chen, Yaodong Yang, Bo Zhao, Yijie Chen, Gang Chen, Liang Shi, Peter Pietzuch, Lei Chen

Research output: Chapter in Book/Conference Proceeding/ReportConference Paper published in a bookpeer-review

Abstract

A wide range of reinforcement learning (RL) algorithms have been proposed, in which agents learn from interactions with a simulated environment. Executing such RL training loops is computationally expensive, but current RL systems fail to support the training loops of different RL algorithms efficiently on GPU clusters: they either hard-code algorithm-specific strategies for parallelization and distribution; or they accelerate only parts of the computation on GPUs (e.g., DNN policy updates). We observe that current systems lack an abstraction that decouples the definition of an RL algorithm from its strategy for distributed execution. We describe MSRL, a distributed RL training system that uses the new abstraction of a fragmented dataflow graph (FDG) to execute RL algorithms in a flexible way. An FDG is a heterogeneous dataflow representation of an RL algorithm, which maps functions from the RL training loop to independent parallel dataflow fragments. Fragments account for the diverse nature of RL algorithms: each fragment can execute on a different device using its own low-level dataflow implementation, e.g., an operator graph of a DNN engine, a CUDA GPU kernel, or a multi-threaded CPU process. At deployment time, a distribution policy governs how fragments are mapped to devices, without changes to the algorithm implementation. Our experiments show that MSRL exposes trade-offs between different execution strategies, while surpassing the performance of existing RL systems.

Original languageEnglish
Title of host publicationProceedings of the 2023 USENIX Annual Technical Conference, ATC 2023
PublisherUSENIX Association
Pages977-993
Number of pages17
ISBN (Electronic)9781939133359
Publication statusPublished - 2023
Event2023 USENIX Annual Technical Conference, ATC 2023 - Boston, United States
Duration: 10 Jul 202312 Jul 2023

Publication series

NameProceedings of the 2023 USENIX Annual Technical Conference, ATC 2023

Conference

Conference2023 USENIX Annual Technical Conference, ATC 2023
Country/TerritoryUnited States
CityBoston
Period10/07/2312/07/23

Bibliographical note

Publisher Copyright:
© 2023 by The USENIX Association All Rights Reserved.

Fingerprint

Dive into the research topics of 'MSRL: Distributed Reinforcement Learning with Dataflow Fragments'. Together they form a unique fingerprint.

Cite this