Identifying robust and accurate visual correspondences across images, also known as image matching, has been a long-standing topic in computer vision research. Particularly, image matching serves a fundamental step in reconstructing real-world geometry from multi-view photos, receiving widespread attention from a wide variery of industrial applications, including metaverse, AR/VR and autonomous driving. Traditionally, image matching involves a series of discrete steps and hand-crafted algorithms. Although proven effective in general cases, their manually designed features and matching strategy are often insufficient to cope with challenging matching scenarios, such as low-texture regions, large perspective changes and low overlap rate. In this thesis, we are dedicated to improving the accuracy frontier and robustness of image matching algorithms, particularly through the utilization of deep learning techniques. We first propose a graph neural network (GNN), which inherits traditional keypoint-based matching scheme, to regularize matching cost through reasoning about visual similarity and matching consensus. Specifically, to avoid exhaustive interaction among image keypoints, we leverage a small set of pre-seleceted relatively reliable matches, referred to as seed matches, to guide matching of a whole keypoint set. By integrating seed matches with a series of efficient attentive operations, we prove that even a very limited set of seeds could provide strong clues to assist matching of other keypoints. Through comprehensive experiments, we demonstrate that our approach achieves competitive performance compared with state-of-the-art GNN-based matcher while maintaining modest computational costs. Jumping out of keypoint-based matching, we then presenet an end-to-end Transformer-based matcher that directly works on raw image pairs and skip the step of keypoint detection. To tackle the quadratic complexity caused by dense operation for vanilla transformer, we propose a global-local attention framework to ensure both global long-range interaction and local fine-level interaction. Specially, instead of setting local attention span as a fixed size, we adjust it according to learned matching uncertainty, which balances matching coverage and interaction granularity in an adaptive way. Through comprehensive evaluation, we prove that our designed attention framework significantly improve the quality of obtained matches and boosts the accuracy of camera pose estimation. Particularly, we outperform our counterparts that also adopt efficient Transformer design by a large margin. Finally, taking one step further from our previous work, we propose a geometry-aware deformable attention to enhance local attention in Transformer-based matcher. Towards better modeling of ubiquitous local deformation caused by view-point changes, we estimate patch-wise parametric deformation filed from intermediate matching results, which are used to shape local attention pattern. Through this design, we embed deformation priors into the process of matching in a principled and intuitive manner. Experiments show that our design considerably improves the effectiveness of global-local attention framework and produces high quality visual correspondences for geometry estimation task. With intensive investigation and innovation, we aspire to further advance the performance of image matching for geometric estimation tasks and empower a wider range of 2D and 3D applications.
| Date of Award | 2023 |
|---|
| Original language | English |
|---|
| Awarding Institution | - The Hong Kong University of Science and Technology
|
|---|
| Supervisor | Long QUAN (Supervisor) |
|---|
Learning visual correspondences for geometry recovery
CHEN, H. (Author). 2023
Student thesis: Doctoral thesis