Deep Generative Neural Network Models for Capturing Complex Patterns in Visual Data

Ari Heljakka

Research output: ThesisDoctoral ThesisCollection of Articles


Deep learning methods underlie much of the recent rapid progress in computer vision. These approaches, however, tend to require costly labeled data. Task-specific models such as classifiers are not intended for learning maximally general internal representations. Furthermore, these models cannot simulate the data-generating process to synthesize new samples nor modify input samples. Unsupervised deep generative models have the potential to avoid these problems. However, the two dominant families of generative models, Generative Adversarial Networks (GAN)and Variational Autoencoders (VAE), each come with their characteristic problems. GAN-based models are architecturally relatively complex, with a disposable discriminator network but, usually, no encoder to accept inputs. Also, GAN training is often unstable and prone to ignoring parts oft he training distribution ("mode collapse" or "mode dropping"). VAEs, on the other hand, tend to overestimate the variance in some regions of the distribution, resulting in blurry generated images. This work introduces and evaluates models and techniques that considerably reduce the problems above, and generate sharp image outputs with a simple autoencoder architecture. This is achieved by virtue of two overarching principles. First, a suitable combination of techniques from GAN models is integrated into the recently introduced VAE-like Adversarial Generator-Encoder. Second,the recursive nature of the networks is leveraged in several ways. The Automodulator represents a new category of autoencoders characterized by the use of the latent representation for modulating the statistics of the decoder layers. The network can take multiple images as inputs from which it generates a fused synthetic sample, with some scales of the output driven by one input and the other scales by another, allowing instantaneous 'style-mixing' and other new applications. Finally, with a Gaussian process framework, the image encoder-decoder setup is extended from single images to image sequences, including video and camera runs. To this end, auxiliary image metadata is leveraged in a form of a non-parametric prior in the latent space of a generative model. This allows to, for instance, smoothen and freely interpolate the image sequence. In doing so, an elegant connection is provided between Gaussian processes and computer vision methods,suggesting far-reaching implications in combining the two. This work provides several examples in which the adversarial training principle, without its typical manifestation in a GAN-like network architecture, is sufficient for high-fidelity image manipulation and synthesis. Hence, this often overlooked distinction appears increasingly significant.
Translated title of the contributionDeep Generative Neural Network Models for Capturing Complex Patterns in Visual Data
Original languageEnglish
QualificationDoctor's degree
Awarding Institution
  • Aalto University
  • Kannala, Juho, Supervising Professor
  • Solin, Arno, Supervising Professor
Print ISBNs978-952-64-0211-6
Electronic ISBNs978-952-64-0212-3
Publication statusPublished - 2020
MoE publication typeG5 Doctoral dissertation (article)


  • deep learning
  • machine learning
  • deep autoencoders
  • generative models
  • Gaussian processes
  • image-to-image translation
  • automodulators


Dive into the research topics of 'Deep Generative Neural Network Models for Capturing Complex Patterns in Visual Data'. Together they form a unique fingerprint.

Cite this