VAE: Variational Autoencoder

Here are some digits, first of all, I ask you “Can you recognize them?”

Figure 1

Maybe not all of them, some digits are actually blurred while others are quite ambiguous. Now let me show you another set of digits.

Figure 2

It sounds like someone wrote many “6”s, all of the are very similar, but not the same.

The second question I want to propose now is “Have this digits been written by a human or by a machine”? If you have read others similar posts, maybe you already know the answer is the second one, otherwise, I wouldn’t have asked this question.

More precisely, this is the result of a deep generative network called Variational Autoencoder. These kind of models have often been used to draw images, achieve state-of-the-art results in semi-supervised learning, as well as for interpolation and anomaly detection. In fact, their distinct feature is that they are able to generate new data that hasn’t been seen before, in fact, they are a kind of generative model. This is because their output is drawn from the probability distribution of the input data, so that the generated sample is similar to the input data, but not the same. Of, course we can control this grade of similarity by simply tweaking some parameters.

Variational Autoencoder

Variational Autoencoders (VAEs) are deep generative models in which we have some data X distributed according to an unknown distribution Preal(x). Its task is to learn a distribution Pgen(x) such that Pgen(x) is as similar as possible to Preal(x). Nevertheless, it is intractable to learn Pgen(x) directly, but it may be easier to choose some distribution P(z) and instead model P(x|z). Thus, VAE learns Pgen(x) by first learning an encoded representation of x (encode), which we will call z, drawn from a normal distribution P(z). Afterward, it will use z to generate a new sample x‘ (decode), which will be similar to the original x, drawn from the latent distribution P(x|z). Pgen(x) and P(x|z) are both modeled by a neural network called encoder and decoder respectively.

Simplified representation of VAE


The encoded representation of the input data isn’t nothing else than the data itself but represented with less features than the original one, this process is also known as dimensionality reduction, the same process performed by the model PCA we discussed in a previous post. The main difference is that instead of learning how to encode a sample x into z, VAE will only learn to construct two parameters, a mean and a standard deviation and will use them to sample z from a normal distribution using those parameters.


This process is perfomed by a deep neural network also known as variational or encoder whose goal is to learn the mean u and the standard deviation σ of the normal distribution representing the latent variable z which we will define as q(z|x). Moreover, we will refer to the parameters of the variational as ϕ.


The second part of the network is the decoder also known as recognition model. It is a deep neural network as well and it learns the distribution of x given z defined as p(x|z) and its goal is to generate a sample x‘ drawn from the distribution p(x|z) such that x‘ is as similar as possible to x. We will refer to its parameters as θ.


To sum up, to achieve its task the model has to learn two distribution probabilities: qϕ(z|x), pθ(x|z). Now, let’s jump into the details and see how VAE is trained and why it works.

Training phase

As for many others probabilistic models, we are going to maximize the log likelihood of the data in our model:

max_\theta \, log \, p_\theta(x)

However, the model we defined has not only the observation (x) but also the latent representation (z). This makes it hard for us to compute pθ(x) , which we call the marginal likelihood of x, because we only know the joint likelihood of the model:

log \, p_\theta(x)=\int p_\theta(x,z)=\int p_\theta(x|z)p(z)dx

Since directly optimizing log pθ(x) is infeasible, we choose to optimize a lower bound of it. The lower bound is constructed as:

log \, p_\theta(x)\geq log \, p_\theta(x)-KL(q_\phi(z|x)||p_\theta(z|x)) (1)

Where qϕ(z|x) is a user-specified distribution of z (called variational posterior) that is chosen to match the true posterior pθ(z|x). KL(qϕ(z|x)∥pθ(z|x)) is the Kullback–Leibler divergence between qϕ(z|x) and pθ(z|x). Generally, the Kullback-Liebler divergence of two distributions, q(z) and p(z), is defined as:


It is a measure of how similar are the distributions q(z) and p(z), moreover:

  • KL(q(z)||p(z)) >= 0 for any q and p.
  • KL(q(z)||p(z)) = 0 if and only if q(z)=p(z) for all possible values of z and p and q are the same distribution.

Going back to disequation (1), we have that the lower bound is equal to the marginal log likelihood if and only if qϕ(z|x)=pθ(z|x), when the Kullback–Leibler divergence between them is equal to zero. Thus, to minimize KL(qϕ(z|x)∥pθ(z|x)) we have to find θ* such that:

\theta^*=argmin_\theta \: KL(q_\phi(z|x)||p_\theta(z|x))

We can rewrite KL(qϕ(z|x)∥pθ(z|x)) in the following form by applying logarithms proprieties to the definition of the Kullback–Leibler divergence:

KL(q_\phi(z|x)||p_\theta(z|x))=E_{q_\phi(z|x)}[log \, q_\phi(z|x)-log \, p_\theta(x|z)]

Remembering that p(x,z)=p(x|z)p(z) we have that:

KL(q_\phi(z|x)||p_\theta(z|x))=E_{q_\phi(z|x)}[log \, q_\phi(z|x)-log \, p_\theta(x,z)]+log \, p_\theta(x)

By replacing the above formula back in (1) we get that:

log \, p_\theta(x)\geq log \, p_\theta(x)+E_{q_\phi(z|x)}[log \, p_\theta(x,z)-log \, q_\phi(z|x)]-log \, p_\theta(x)

So we can eliminate pθ(x), which can’t be calculated, and obtain the following inequality:

log \, p_\theta(x)\geq \, E_{q_\phi(z|x)}[log \, p_\theta(x,z)-log \, q_\phi(z|x)]

Where E[log pθ(z, x)] − E[log qϕ(z|x)] is also called as the ELBO (Evidence Lower Bound). The ELBO is the negative KL-divergence minus the log of the evidence, log p(x), which is constant respect to q(z).

Now, again exploiting the usual formula p(x,z)=p(x|z)p(z), we can rewrite the ELBO in the following way:

ELBO=E_{q_\phi(z|x)}[log \, p_\theta(x|z)+log \, p_\theta(z)-log \, q_\phi(z|x)]

Which is equal to:

L(\theta,\phi)=ELBO=E_{q_\phi(z|x)}[log \, p_\theta(x|z)]-KL(q_\phi(z|x)||p_\theta(z))

Finally, in order to maximize L(θ,ϕ), we have to find qϕ(z|x) so that KL(qϕ(z|x)||pθ(z|x)) is close to zero. Thus, VAE will learn an inference model qϕ(z|x) that approximates the intractable posterior pθ(z|x) by optimizing the variational lower bound. Eqϕ(z|x)[log pθ(x,z)] will be the reconstruction sample.

In a variational autoencoder, the variational posterior qϕ(z|x) is parameterized by a neural network g (encoder), which accepts an input x, and outputs the mean and variance of z of a normal distribution:

u_z(x,\phi), \, \sigma_z(x,\phi)=g(x)


In the same way pθ(x,z) is parameterized by another neural newtork f (decoder) which receives an input z from drawn from the normal distribution learnt by g and outputs x‘ reconstruction of x from a probability distribution learnt by f.


The reparameterization trick

During training, our goal is to maximize the likelihood L(θ,ϕ) of the model.

L(\theta,\phi)=E_{q_\phi(z|x)}[log \, p_\theta(x|z)]-KL(q_\phi(z|x)||p_\theta(z))

We can simply do this using backpropagation. However, the term Eqϕ(z|x)[log pθ(x,z)] involves a sampling step from the qϕ(z|x) distribution which is a random node that we don’t know how to backpropagate. To get around this problem, we use a simple and smart trick called the reparameterization trick or SGVB estimator.

It consists in introducing a new parameter ϵ which allows us to reparameterize z in a way that allows backpropagation to flow through the deterministic nodes. In the original formula we have that:

z \sim q_\phi(z|x)=N(z|u_z(x,\phi),\sigma_z^2(x,\phi))

That we reparameterize as:


Where ϵ is drawn from a normal distribution with mean equal to 0 and standard deviation equal to 1. In this way z will be the result of a linear transformation of ϵ and the sampling operation will now only involve ϵ, which we don’t need to backpropagate.

Training VAE with backpropagation

Random sample generation

Once the VAE is trained, we have basically two ways to generate new samples. The first way, that is the most common one, is the method we discussed at the beginning of the post. We just feed the encoder with a sample x and the output of the decoder will be another sample x similar to x‘, so far nothing new. Furthermore, we can control to what extent will x‘ similar to x by manually tweaking the standard deviation calculated by the encoder. An higher standard deviation means having more variance, thus, leading to more different outputs,. On the other hand, less standard deviation results in having x‘ very close to x, but still not the same. The digits in the figure 2 have been generated using this method.

The second way to use a VAE consists of generating new random samples from a normal distribution instead than from a sample x. Hence, we can just ignore the encoder and simply generate new samples out of a normal distribution using only the decoder part of the network to decode them. Of course, the decoder has no way to know whether the latent variable z has been generated by a normal distribution or by the encoder, but it only learnt how to generate x from z, that’s all. Thus, the output will be similar to one of the inputs used for training but we don’t know which one it would be. The digits in the figure 1 have been generated using this method.


Some examples of already trained VAEs model implemented in Python using Tensorflow and Zhusuan can be found on the following posts:


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s