An Introduction to Deep Learning for Generative Models

Back in October, me and Aida released a Deep Learning based Twitter music bot, called “LnH: The Band”, that is capable of composing new music on-demand from a few genres by simply tweeting at it. It has so far succeeded in composing more than 700 new songs.

I have been interested for many years in the intersection of art, creativity, and technology and the recent advances in Deep Learning is enabling us to make rapid progress in bridging those disciplines. Using algorithms to impersonate or assist an artist and create artefacts is not something new and currently is considered under a couple of fields such as Computational Creativity and Creative Coding.

This is a two episode article where I discuss how recent advances in supervised learning can be used for generative purposes (this article) and how one can use these models to create music (next episode). Before we start, I will try to keep the article as high level as possible, however, some of the concepts will be understood best with simple mathematical notations.

Machine Learning, Deep Learning, and Generative Models⌗

Recent advances in Machine Learning, and particularly, Deep Learning have resulted in algorithms and architectures that are able to model complex structured data types such as images, sounds, and text.

These advancements have been mainly focused on supervised learning algorithms which try to learn a statistical model for estimating a function called the posterior probability $$p(\mathbf{y} \vert \mathbf{x})$$ from an input sample $$\mathbf{x}$$ to an output sample $$\mathbf{y}$$. You can imagine $$\mathbf{x}$$ to be an image and $$\mathbf{y}$$ be the kind of object that is in the image (e.g. a cat).

The probability written as $$p(\mathbf{y} \vert \mathbf{x})$$ tells us how much the model believes that there is a cat given an input image compared to all possibilities it knows about (e.g. other animals). Algorithms which try to model this probability map directly are often referred to as Discriminative Models or Predictive Models.

Generative Models on the other hand try to learn a related function called the joint probability $$p(\mathbf{y} , \mathbf{x})$$. You could read this as how much the model believes that $$\mathbf{x}$$ is an image and there is a cat $$\mathbf{y}$$ in it at the same time.

These two probabilities are of course related and could be written as $$p(\mathbf{y} , \mathbf{x}) = p(\mathbf{x}) p(\mathbf{y} \vert \mathbf{x}) $$ with $$p(\mathbf{x})$$ being how likely it’s that the input $$\mathbf{x}$$ is an image. The $$p(\mathbf{x})$$ probability is usually called a density function in literature.

The main reason to call these algorithms generative relates to the fact that the model has access to the probability of both input and output at the same time. Using this, one can, for example, generate images of animals by sampling animal kinds $$\mathbf{y}$$ and new images $$\mathbf{x}$$ from $$p(\mathbf{y} , \mathbf{x})$$.

One can take another step further and learn only the density function $$p(\mathbf{x})$$ of which only depends on the input space. These algorithms are considered Unsupervised Generative Models (as there’s no access to data on the kind of input). They are also generative as one can again sample from the distribution captured by the model.

For the remainder of this article will focus on these unsupervised generative models and will use generative and unsupervised generative interchangeably.

An appropriate application of these generative models to spaces that are considered of artistic value (such as music, painting, …) would result in algorithms that are capable of generating new artefacts on-demand. Next, we will look into a few classes of these models.

References and further readings:⌗

Bischop and Lasserre (2007), Generative or Discriminative? Getting the Best of Both Worlds

From prediction to generation⌗

As mentioned above, discriminative modelling has been at the forefront of the recent success and progress in the field of machine learning. These models are capable of making predictions that depend on a given input. However, on their own they are not directly able to generate new samples.

The basic idea behind many of the recent progress on generative modelling is simply to convert the generation problem to a prediction one and use the repertoire of deep learning algorithms to learn such a problem. Modern deep learning algorithms are capable of modelling very complex mappings and offer flexibility of defining problems in terms of computational graphs that can be optimised by variants of back-propagation algorithm on fast hardwares such as GPUs.

The following sections will review the three major categories of these approaches.

Auto-Encoder (AE) models⌗

The simplest form of converting a generative problem to a discriminative one would be to learn a direct mapping from the input space to itself. Using the previous example of images, suppose we wanted to learn an identity map that for each image $$\mathbf{x}$$ would ideally predict exactly the same image, i.e. $$\mathbf{x} = f(\mathbf{x})$$ for $$f$$ being the predictive model.

On its own such a model would not be of any use, but as we will see by using a specific architecture with certain constraints, we can create a generative model.

The basic idea here is that we can create a model composed of two components: an encoder model $$q_e(\mathbf{h} \vert \mathbf{x})$$ that maps the input to another space, often referred to as hidden or latent space represented by $$\mathbf{h}$$, and a decoder model $$q_d(\mathbf{x} \vert \mathbf{h})$$ that learns the inverse mapping from the latent to input space.

These two components can be connected together to create an end-to-end trainable model and it is often the case that we impose a constraint on the latent space $$\mathbf{h}$$. The most common constraint is to create a bottleneck such that $$\mathbf{h}$$ has a lower dimension compared to $$\mathbf{x}$$, forcing the model to learn a lower dimensional representation of the input space (as can be seen from the following figure courtesy of deeplearning4j project, the left hand-side shows the encoder network and the right hand-side shows the decoder).

This way encoder can be seen as a compression algorithm and the decoder as a decompressor or reconstruction algorithm. In practice, both encoder and decoder models are deep neural networks of varying architectures (e.g. MLPs, ConvNets, RNNs, AttentionNets) to get desired outcomes.

Once such a model is learnt, we can unplug the decoder from the encoder and use them independently. For example, in order to generate a new sample, one could first generate a sample from the latent space (by let’s say combining the latent vectors of two inputs or directly sampling from the latent space) and then present that to the decoder to create a new sample from the output space.

To see these kind of models in action, I would suggest having a look at the online demo of Digit Fantasies by a Deep Generative Model. You can play with changing the latent space and generating new images of handwritten digits (an example displayed bellow).

Other examples could be the following two approaches to generation of natural images by DRAW on the left and a more recent version of Variational Auto-Encoders on the right.

References and further readings:⌗

Boulard and Kamp (1988), Auto-association by multilayer perceptrons and singular value decomposition
Kingma and Welling (2013), Auto-encoding Variational Bayes
Gregor, Danihelka, Graves, Rezende, and Wierstra (2015), DRAW: A Recurrent Neural Network For Image Generation
Kulkarni, Whitney, Kohli, Tenenbaum (2015), Deep Convolutional Inverse Graphics Network
Kingma, Salimans, and Welling (2016), Improving Variational Inference with Inverse Autoregressive Flow
Chollet (2016), Building Autoencoders in Keras

Generative Adversarial (GAN) models⌗

As we could see from the architecture of Auto-Encoders, one can imagine a general concept of creating modular networks that work with a special relationship to each other and training such models in an end-to-end format can help us learn latent spaces leading to generation of new samples.

Another version of this concept is the Generative Adversarial Models framework, where we have a generator model $$q_g(\mathbf{x} \vert \mathbf{h})$$ for mapping a small dimensional latent space of $$\mathbf{h}$$ (often modelled as noise sampled from a simple distribution) to the input space of $$\mathbf{x}$$. One can interpret this as having a similar role to the decoder in AEs. So far not much new here!

The trick is now to introduce a discriminative model $$p_d(\mathbf{y} \vert \mathbf{x})$$ where it tries to associate an input instance $$\mathbf{x}$$ to a yes/no binary answer $$\mathbf{y}$$ on whether the input was generated by the generator model or it was a genuine sample from the dataset we are training on.

Let’s use the same image example we previously used. Imagine the generator model creates a new image and also we have a real image from our dataset. If our generator was good the discriminator model will not able to distinguish between the two images easily. However, if our generator was poor, it would be very easy to tell which one is fake and which one is real.

When these two models are coupled, one can train them end-to-end (often in a stage-wise fashion) by ensuring that the generator is getting better over time to fool the discriminator while the discriminator is getting trained to work on harder and harder problem of detecting fakes. Ideally, we want to get to a generator model that its outputs are indistinguishable from the real data we used for training from the discriminator model’s point of view.

During initial parts of the training, the discriminator can easily detect samples coming from the dataset vs the synthetic ones generated by the generator which is just starting to learn. However, as generator gets better at modelling the dataset, we start seeing more and more generated samples that look similar to the dataset. An example of this can be seen in the following image which depicts the generated images of a GAN model learning over time (courtesy of OpenAI).

Recent versions of these models have tried to focus on improving the stability of training, using special architectures more suitable for image generation such as DCGAN and LapGAN, adding class information to the input space to generate images from a specific class (CGAN), unsupervised latent code discovery for interpretable semantic attributes of a dataset (InfoGAN), and combining AEs with GANs for Adversarial Auto-encoders.

You can check this demo for a simple GAN training simulation and this demo for a variant of VAEs+GANs for abstract image generation (example image below courtesy of Otoro.net).

References and further readings:⌗

Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, and Bengio (2014), Generative Adversarial Networks
Mirza and Osindero (2014), Conditional Generative Adversarial Nets
Radford, Metz, and Chintala (2015), Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks
Denton, Chintala, Szlam, and Fergus (2015), Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks
Salimans, Goodfellow, Zaremba, Cheung, Radford, and Chen (2016), Improved Techniques for Training GANs
Chen, Duan, Houthooft, Schulman, Sutskever, and Abbeel (2016), InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets
Makhzani, Shlens, Jaitly, Goodfellow, and Frey (2016), Adversarial Autoencoders

Sequence models⌗

If the data we are trying to model is characterised as a sequence over some dimensions (for example 1-d for time, 2-d for space, 3-d for spatio-temporal), then we can use special algorithms called Sequence Models. These models are able to learn the probability of the form $$p(\mathbf{y} \vert \mathbf{x}{n}, \dots,\mathbf{x}{1})$$ where $$i$$ is an index signifying the location in the sequence and $$\mathbf{x}_{i}$$ is the i-th input sample.

An example of this would the written text: each word is a sequence of characters, each sentence is a sequence of words, each paragraph is a sequence of sentences, and so on. Our output $$\mathbf{y}$$ could be for example if the sentence has a positive sentiment associated with it or a negative one.

Using a similar trick from AEs, one can decide to replace $$\mathbf{y}$$ with the next item in the sequence, i.e. $$\mathbf{y} =\mathbf{x}{n + 1}$$, allowing the model to learn $$p(\mathbf{x}{n + 1} \vert \mathbf{x}{n}, \dots,\mathbf{x}{1})$$.

In other words, with these models we can use the past history of a sequence of input data to make a prediction on what is likely to follow. Also note that one can use the chain rule of probability to estimate the probability of the overall sequence based on a recursive operation as $$p(\mathbf{x}{n + 1}, \mathbf{x}{n}, \dots,\mathbf{x}{1})=\prod{i = 1}^{n}p(\mathbf{x}{i + 1} \vert \mathbf{x}{i}, \dots,\mathbf{x}_{1})$$. Remember that from our definition of unsupervised generative models, this probability expresses how much model believes in the sequence to be a real one (i.e. coming from the same distribution as the dataset).

A special branch of neural networks called Recurrent Neural Networks are specially suited for these tasks as they are able to keep a summary of the past inputs (often called (hidden) state $$\mathbf{h}_{n}$$ in memory and simplify the model to a two stage operation:

Given a new input from the sequence $$\mathbf{x}_{n}$$ and the old state $$\mathbf{h}_{n - 1}$$ compute a new state $$\mathbf{h}_{n}$$ by the encoder function $$q_e(\mathbf{h}_{n} \vert \mathbf{h}_{n - 1},\mathbf{x}_{n})$$.
Use the new state to compute how likely it is that the next input in the sequence is $$\mathbf{x}_{n + 1}$$ by the decoder function $$p_d(\mathbf{x}_{n + 1} \vert \mathbf{h}_{n})$$.

As you can see, there is massive overlap between these generative sequence models and AEs we discussed previously.

A very popular example of sequence modelling is their application to NLP or text modelling. The generation procedure often is a recursive process of:

Choose a symbol (e.g. character) from the decoder using the probability map $$p_d(\mathbf{x}{n + 1} \vert \mathbf{h}{n})$$
Append that symbol to the list of generated sequence
Use the new symbol as the input for the next step of the algorithm to update the state
Repeat until a stop event is generated.

The following diagram (adapted from here) shows an example of this process where the symbol leads to generation of character “h” which then gets used as input to generate “e” and so on until the model generates an symbol.

Depending on the type of problem (similar to AEs and GANs), researchers have come up with specialisation either in models for the encoder (such as LSTMs, GRUs, ConvRNNs, Bidirectional RNNs, [Recursive Tree Models]<(http://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf), Hierarchical AEs, Attention Networks, Spatial inputs with PixelRNNs or PixelCNN) or in models for the decoder (such as simple MLPs, LSTMs for Sequence to Sequence Learning, Mixture Density Models). Note that while most of sequence to sequence models (such as those used in Seq2Seq and Neural Language Translation) often use output spaces that are different than their input space, they still fit nicely as extensions of generative models derived from only one space.

To check a few demos on how these models work, you can visit this demo on text generation and this one on hand-writing image generation.

References and further readings:⌗

Hochreiter and Schmidhuber (1997), Long Short-Term Memory
Graves (2013), Generating Sequences With Recurrent Neural Networks
Socher, Perelygin, Wu, Chuang, Manning, Ng, and Potts (2013), Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
Chung, Gulcehre,Cho, and Bengio (2014), Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling
Pinheiro and Collobert (2014), Recurrent Convolutional Neural Networks for Scene Labeling
Mnih, Heess, Graves, and Kavukcuoglu (2014), Recurrent Models of Visual Attention
Sutskever, Vinyals, and Le (2014), Sequence to Sequence Learning with Neural Networks
Berglund, Raiko, Honkala, Kärkkäinen, Vetek, and Karhunen (2015), Bidirectional Recurrent Neural Networks as Generative Models – Reconstructing Gaps in Time Series
Li, Luong, and Jurafsky (2015), A Hierarchical Neural Autoencoder for Paragraphs and Documents
Karpathy (2015), The Unreasonable Effectiveness of Recurrent Neural Networks
Oord, Kalchbrenner, and Kavukcuoglu (2016), Pixel Recurrent Neural Networks
Choi, Kim, and Kim (2016), List of Papers on Recurrent Neural Networks
FastML (2015), Deep nets generating stuff
Olah (2016), Attention and Augmented Recurrent Neural Networks

Final words:⌗

So far we have seen a brief overview of recent developments in generative models and their ability to create new samples and artefacts. The most interesting observation is that using a few simple tricks one can cast a generative modelling problem to a prediction problem which opens up many possibilities in terms of types of input/output spaces, architectures, and training algorithms.

Furthermore, while many of these algorithms differ significantly in their implementation details and objectives, they have a striking similarity which is reflected in their dual architecture where either two competing (GANs), inverting (AEs), or completing (Seq Models) networks work in tandem to create a generative model.

Another interesting aspect of these models is that they are often agnostic with respect to the data domain they are trained on. Researchers have used a diverse range of applications such as images, videos, text (including poems, lyrics, books, news), audio, and music. But there is really nothing holding these models back from being applied to any other domain: as longs as you can describe your input data either symbolically or numerically, you can test these models.

In the following article, we will look into music specifically and explore generative models used in that space. See you soon!

PS: I’ve purposefully omitted algorithms for Neural Style Transfer as they are currently limited to images and do not fit into the generative models description as nicely as the three frameworks mentioned above do.