This course provides a comprehensive introduction to recent advances in multimodal machine learning, with a focus on vision-language research. Major topics include multimodal translation, multimodal reasoning, multimodal alignment, multimodal information extraction, and recent deep learning techniques in multimodal research (such as graph convolution network, Transformer architecture, deep reinforcement learning, and causal inference). The course structure will primarily consist of instructor presentation, student presentation, in-class discussion, and a course project.