TY - JOUR
T1 - Load Balancing with Multi-Level Signals for Lossless Datacenter Networks
AU - Hu, Jinbin
AU - Zeng, Chaoliang
AU - Wang, Zilong
AU - Zhang, Junxue
AU - Guo, Kun
AU - Xu, Hong
AU - Huang, Jiawei
AU - Chen, Kai
N1 - Publisher Copyright:
© 1993-2012 IEEE.
PY - 2024/6/1
Y1 - 2024/6/1
N2 - Various datacenter network (DCN) load balancing schemes have been proposed in the past decade. Unfortunately, most of these solutions designed for lossy DCNs do not work well for Priority Flow Control (PFC) enabled lossless DCNs, primarily due to the reason that the individual congestion signals used in these solutions, e.g., link load, queue length, Round Trip Time (RTT) and Explicit Congestion Notification (ECN), may not be able to correctly or timely reflect the hop-by-hop PFC pausing. This paper first reveals the above problems via extensive experiments, and then based on the insights learned, we present Proteus, a PFC-aware load balancing scheme that is resilient to PFC pausing by exploring a combination of multi-level congestion signals. At its heart, Proteus leverages RTT-level signals (i.e., RTT and link utilization) to detect path status for initial routing decision, and exploits sub-RTT level signal (i.e., cumulative sojourn time) to reflect instantaneous PFC pausing and make timely rerouting choices based on the idea of better-late-than-never. We have implemented Proteus in the hardware programmable switch. Our testbed experiments as well as large-scale simulations show that Proteus can effectively handle PFC pausing under realistic workloads and achieve up to 35%, 31%, 28%, 22% and 46%, 42%, 34%, 29% better average FCT and 99th percentile FCT than CONGA, DRILL, Hermes and MP-RDMA, respectively.
AB - Various datacenter network (DCN) load balancing schemes have been proposed in the past decade. Unfortunately, most of these solutions designed for lossy DCNs do not work well for Priority Flow Control (PFC) enabled lossless DCNs, primarily due to the reason that the individual congestion signals used in these solutions, e.g., link load, queue length, Round Trip Time (RTT) and Explicit Congestion Notification (ECN), may not be able to correctly or timely reflect the hop-by-hop PFC pausing. This paper first reveals the above problems via extensive experiments, and then based on the insights learned, we present Proteus, a PFC-aware load balancing scheme that is resilient to PFC pausing by exploring a combination of multi-level congestion signals. At its heart, Proteus leverages RTT-level signals (i.e., RTT and link utilization) to detect path status for initial routing decision, and exploits sub-RTT level signal (i.e., cumulative sojourn time) to reflect instantaneous PFC pausing and make timely rerouting choices based on the idea of better-late-than-never. We have implemented Proteus in the hardware programmable switch. Our testbed experiments as well as large-scale simulations show that Proteus can effectively handle PFC pausing under realistic workloads and achieve up to 35%, 31%, 28%, 22% and 46%, 42%, 34%, 29% better average FCT and 99th percentile FCT than CONGA, DRILL, Hermes and MP-RDMA, respectively.
KW - Datacenter
KW - load balancing
KW - lossless networks
UR - https://www.webofscience.com/wos/woscc/full-record/WOS:001178984000001
UR - https://openalex.org/W4392024151
UR - https://www.scopus.com/pages/publications/85186075472
U2 - 10.1109/TNET.2024.3366336
DO - 10.1109/TNET.2024.3366336
M3 - Journal Article
SN - 1063-6692
VL - 32
SP - 2736
EP - 2748
JO - IEEE/ACM Transactions on Networking
JF - IEEE/ACM Transactions on Networking
IS - 3
ER -