Learning Latent Image Representations with Prior Knowledge

Yuxin Hou

Research output: ThesisDoctoral ThesisCollection of Articles


Deep learning has become a dominant tool in many computer vision applications due to the superior performance of extracting low-dimensional latent representations from images. However, though there is prior knowledge for many applications already, most existing methods learn image representations from large-scale training data in a black-box way, which is not good for interpretability and controllability. This thesis explores approaches that integrate different types of prior knowledge into deep neural networks. Instead of learning image representations from scratch, leveraging the prior knowledge in latent space can softly regularize the training and obtain more controllable representations.The models presented in the thesis mainly address three different problems: (i) How to encode epipolar geometry in deep learning architectures for multi-view stereo. The key of multi-view stereo is to find the matched correspondence across images. In this thesis, a learning-based method inspired by the classical plane sweep algorithm is studied. The method aims to improve the correspondence matching in two parts: obtaining better potential correspondence candidates with a novel plane sampling strategy and learning the multiplane representations instead of using hand-crafted cost metrics. (ii) How to capture the correlations of input data in the latent space. Multiple methods that introduce Gaussian process in the latent space to encode view priors are explored in the thesis. According to the availability of relative motion of frames, there is a hierarchy of three covariance functions which are presented as Gaussian process priors, and the correlated latent representations can be obtained via latent nonparametric fusion. Experimental results show that the correlated representations lead to more temporally consistent predictions for depth estimation, and they can also be applied to generative models to synthesize images in new views. (iii) How to use the known factors of variation to learn disentangled representations. Both equivariant representations and factorized representations are studied for novel view synthesis and interactive fashion retrieval respectively. In summary, this thesis presents three different types of solutions that use prior domain knowledge to learn more powerful image representations. For depth estimation, the presented methods integrate the multi-view geometry into the deep neural network. For image sequences, the correlated representations obtained from inter-frame reasoning make more consistent and stable predictions. The disentangled representations provide explicit flexible control over specific known factors of variation.
Translated title of the contributionLearning Latent Image Representations with Prior Knowledge
Original languageEnglish
QualificationDoctor's degree
Awarding Institution
  • Aalto University
  • Kannala, Juho, Supervising Professor
  • Solin, Arno, Supervising Professor
Print ISBNs978-952-64-1072-2
Electronic ISBNs978-952-64-1073-9
Publication statusPublished - 2022
MoE publication typeG5 Doctoral dissertation (article)


  • deep learning
  • machine learning
  • computer vision
  • multi view stereo
  • novel view synthesis
  • Gaussian processes


Dive into the research topics of 'Learning Latent Image Representations with Prior Knowledge'. Together they form a unique fingerprint.

Cite this