Rack-scale multi-domain optical networks for high-performance computing systems

  • Peng YANG

Student thesis: Doctoral thesis

Abstract

Rack-scale computing systems are expected to meet the computation and energy requirements of big data and emerging large-scale applications. They need to efficiently coordinate both on-chip and off-chip resources from hundreds of multi-core processors and memory/storage. The intra-chip and inter-chip communication networks are critical to improving the coordination efficiency and computing system performance. Optical interconnects are promising to address these challenges due to their superiority in bandwidth, latency, and energy consumption compared to electrical interconnects. For this end, in this dissertation, we investigate various design concerns of optical networks, including the intra/inter-chip optical network architectural designs, path reservation, and control in the multi-domain circuit switching. To take advantages of optical interconnects for both intra-chip and inter-chip communication and break the performance gap between on-chip and off-chip network, we propose a unified intra/inter-chip optical network for the multi-chip/socket system on a motherboard, called SUPERB. SUPERB achieves obvious performance improvement compared to traditional electrical mesh design. The intra/inter-chip optical network architecture for rack-scale computing systems (RSON) is also proposed to achieve low-latency and high-bandwidth interconnect services. The inter-chip communication flows and circuit switching control for optical networks can cause severe performance degradation if not properly designed. This is especially true when multiple domains involve in communication. We propose a forward propagation strategy that parallels the path reservation process with the application level inter-chip connection setup for the underlying optical network fabric. This can optimize the connection setup and path reservation procedure. A preemptive chain feedback (PCF) scheme to minimize multi-domain path reservation overheads is also proposed. PCF scheme preemptively allocates network resources with the help of multi-cell reservation window and quickly releases unused paths with the feedback mechanism. This solution increases the network resources utilization while minimizing overheads during path reservations. The proposed architecture and techniques are holistically evaluated via cycle-accurate full-system simulator driven by statistic application models. Experimental results show that RSON can achieve up to 5.4X higher performance under the same energy consumption than state-of-the-art InfiniBand interconnected 64-server node rack systems. Moreover, the PCF can further improve RSON network throughput by 80% while keeping good scalability using synthetic traffics than the baseline handshake control scheme. Realistic benchmark results show that PCF scheme can on average reduce 60% energy consumption per unit performance compared to handshake scheme for the emerging dense 256-server node systems.
Date of Award2018
Original languageEnglish
Awarding Institution
  • The Hong Kong University of Science and Technology

Cite this

'