Abstract
The increasing scale of modern machine learning (ML) models has created immense demand for computational resources. ML computing for AI involves highly distributed, parallelized workloads with intricate patterns while presenting challenges in resource management across distributed environments. Efficiently managing and executing these workloads using existing computing abstractions and mechanisms presents significant challenges to underlying systems.This thesis addresses these challenges by designing a comprehensive system to optimize ML workloads, built upon SING, an infrastructure for GPU clusters. SING provides efficient resource allocation and scheduling in multi-tenant environments, serving as a foundation for enhanced scalability, usability, and resource utilization.
Building on this foundation, three complementary systems are introduced: GREEN, a carbon-efficient scheduler that aligns ML workloads with low-carbon energy periods, reducing environmental impact while maintaining performance; Sequoia, a compiler framework that optimizes distributed data processing for ML applications by simplifying development and minimizing communication overhead; and G3, a scalable system for graph neural network training that introduces hybrid parallelism, locality-aware partitioning, and multi-level pipelines to enable efficient processing of billion-edge graphs.
Together, these contributions improve the performance, sustainability, and scalability of ML systems, addressing critical infrastructure gaps and advancing the state of the art in ML system design. This work provides the foundation for meeting the demands of next-generation AI applications in research and industry.
| Date of Award | 2025 |
|---|---|
| Original language | English |
| Awarding Institution |
|
| Supervisor | Kai CHEN (Supervisor) |
Cite this
- Standard