Deep learning and convolutional neural networks have revolutionized computer vision and become a dominant tool in many applications, such as image classification, semantic segmentation, object recognition, and image retrieval. Their strength lies in the ability to learn an efficient representation of images that makes a subsequent learning task easier. This thesis presents deep learning approaches for a number of fundamental computer vision problems that are closely related to each other; image matching, image-based localization, ego-motion estimation, and scene understanding. In image matching, the thesis studies two methods utilizing a Siamese network architecture for learning both patch-level and image-level descriptors for measuring similarity using Euclidean distance. Next, it introduces a coarse-to-fine CNN-based approach for dense pixel correspondence estimation that can leverage the advantages of optical flow methods and extend them to the case of wide baseline between two images. The method demonstrates good generalization performance and it is applicable for image matching as well as for image alignment and relative camera pose estimation. One of the contributions of the thesis is a novel approach for recovering the absolute camera pose from ego-motion. In contrast to the existing CNN-based localization algorithms, the proposed method can be directly applied to scenes which are not available at training stage and it does not require scene-specific training of the network, thus, improving the scalability. The thesis also shows that Siamese architecture can be successfully utilized in the problem of relative camera pose estimation achieving better performance in challenging scenarios compared to traditional image descriptors. Lastly, the thesis demonstrates how the advances of visual geometry can help to efficiently learn depth, camera ego-motion, and optical flow for the task of scene understanding. More specifically, it introduces a method that can leverage temporally consistent geometric priors between frames of monocular video sequences and jointly estimate ego-motion and depth maps in a self-supervised manner.
|Julkaisun otsikon käännös||Deep Learning Methods for Image Matching and Camera Relocalization|
|Tila||Julkaistu - 2020|
|OKM-julkaisutyyppi||G5 Tohtorinväitöskirja (artikkeli)|