Visual scene understanding is an important and fundamental field for advanced application scenarios such as self-driving, robotics, and AR/VR. This course majorly focuses on delivering deep learning-based visual scene understanding techniques in both 2D and 3D perspectives. In the 2D part, it introduces topics including image and scene classification, semantic segmentation, and object detection/tracking. In the 3D part, it delivers how 3D scene understanding can be performed through learning from 2D images, point clouds or multi-modal data, involving topics such as scene depth estimation, camera pose prediction, 3D scene reconstruction, and visual SLAM. Representative deep scene understanding architectures and frameworks in supervised, self-supervised, and open-world learning settings will also be introduced.