Recovering 3D structures from 2D images and videos

  • Likang WANG

Student thesis: Doctoral thesis

Abstract

As three-dimensional beings, we interact with and perceive our world using our vision and tactile senses. But converting our three-dimensional perceptions into captured data is an immensely complex endeavor. While cameras can mimic our visual system, their typical output is primarily in two dimensions. This limited perspective often leaves our understanding of the world fragmentary, akin to a blind person trying to grasp the concept of an elephant. Countless researchers have allocated decades to the task of elevating these two-dimensional images to the three-dimensional scale, but achieving a satisfactory blend of quality and efficiency in this reconstruction is notably formidable. To conquer this challenge, our initial focus is on delving into the potential of reconstructive quality. Specifically, we suggest a novel coarse-to-fine strategy to reconstruct the scene. This process commences with the estimation of each pixel's preliminary spatial location in the photo. Following this, we unveil a self-supervised methodology capable of gauging the deviation between our tentative predictions and the fact-based truth. This function steers us to invest our efforts predominantly in the precise zones, thereby performing a thorough refinement. Consequently, our approach effectuates a considerable enhancement in reconstructive quality within the given constraints of time and space. We subsequently explore the possibility of acquiring superior reconstructive results while concurrently meeting the demands of real-time inference efficiency. To this end, we propose two groundbreaking solutions. Firstly, while we strive for the optimal quality in three-dimensional scene reconstruction, we also seek the minimum total reconstruction time.To achieve this, we advocate for a feature fusion method endowed with the capability to simultaneously extract and conserve the low-frequency and high-frequency data between video frames. This method renders significant improvements on extensive planes and intricate details without imposing any additional computational burden. Moreover, considering the sparsity of the three-dimensional space, we put forward an accurate and efficient loss amendment strategy that enables more exhaustive scene recovery. Secondly, our goal evolves into achieving the maximum level of accuracy in detail recovery, while ensuring the lowest possible inference latency. Here, we take into account the semantic consistency across frames, which aids in the rapid preliminary filtration of points in a three-dimensional space. This is then followed by a rigorous assessment focused solely on a subset of spatial zones. This procedure not only ensures swift updates but also delivers a notably improved detail quality. Despite the advancements triggered by our novel methods in the field of three-dimensional structure recovery from two-dimensional images and videos, our methodology isn't foolproof. Therefore, we also shed light on the limitations of our current work and suggest potential avenues for future exploration.
Date of Award2024
Original languageEnglish
Awarding Institution
  • The Hong Kong University of Science and Technology
SupervisorLei CHEN (Supervisor)

Cite this

'