2024-25 Spring - COMP4901Y - Large-Scale Machine Learning for Foundation Models

Course

Description

In recent years, foundation models have fundamentally revolutionized the state-of-the-art of artificial intelligence. Thus, the computation in the training or inference of the foundation model could be one of the most important workflows running on top of modern computer systems. This course unravels the secrets of the efficient deployment of such workflows from the system perspective. Specifically, we will i) explain how a modern machine learning system (i.e., PyTorch) works; ii) understand the performance bottleneck of machine learning computation over modern hardware (e.g., Nvidia GPUs); iii) discuss four main parallel strategies in foundation model training (data-, pipeline-, tensor model-, optimizer- parallelism); iv) real-world deployment of foundation model including efficient inference and fine-tuning.
Course period1/02/2530/06/25
Course levelUG
Course formatLecture