Generating handwritten digits with VAE and Zhusuan

In the last post we talked about Variational Auteoncoders (VAE), powerful generative machine learning model able to generate new data based on previously seen samples. In this post we are going to implement one and use it to generate handritten digits. Will you recognize which digits have been written by a human and which ones have been written by a machine?

Set of digits generated with VAE

We are going to implement a VAE model in Python exploiting two very useful machine learning libraries: Tensorflow and Zhusuan. It is well known that Tensorflow is the main library when having to deal with machine learning such that most of the models are actually implemented upon it. Thus, it is assumed the reader already having a basic knowledge about it.


ZhuSuan is a Python probabilistic programming library for Bayesian deep learning, which conjoins the complimentary advantages of Bayesian methods and deep learning. ZhuSuan is built upon Tensorflow. Unlike existing deep learning libraries, which are mainly designed for deterministic neural networks and supervised tasks, ZhuSuan provides deep learning style primitives and algorithms for building probabilistic models and applying Bayesian inference. It has been developed in 2017 by some students and teachers of Tsinghua University.

To have Zhusuan installed, please refer to the official Github page:

A basic tutorial on Zhusuan can be found here:


The model will be trained on the MNIST dataset which is composed by a training set of 60000 handwritten digits and a test set of 10000 handwritten digits. Each picture has a dimension of 28×28 greyscale pixels, thus the input and the output layers of the VAE will by composed by 784 neurons, while encoder and decoder will have a symmetric structure both containing two hidden layers of 500 neurons each. Finally, the latent variable z, output of the encoder and input of the decoder, is drawn from a normal distribution with the learnt mean and variance as parameters, while the output of the decoder is drawn from a simple Bernoulli distribution built on top of the last hidden layer.

Once the network is trained, it is also possible to generate new random samples of digits by running only the decoder and drawing z from a normal distribution with mean equal to 0 and variance equal to 1. Conversely, to generate specific digits, z will simply be generated by the encoder just by giving in input the desired digit we want to generate. In this post, we are going to implement both ways of digits generation.

By default, we have the latent variable z composed by 40 neurons, this means that each image will be compressed from 784 to 40 pixel representation before being regenerated by the decoder. Furthermore, we are not going to use a validation test, thus all the 60000 instances will be available for training divided in batches of 128 samples.

We are going to train the model for a total of 1000 epochs and save its paramaters every 100 epochs in order to continue training the model from the latest chekpoint in case we would need to stop it. Overall, the training process should not take more than three hours on Intel Core i5 CPU. Finally, once the training phase will be over, we will use the VAE to generate some handwritten digits.

Important: basic knowledge about Variational Autoencoders is required in order to better understand the following code. You can use the following post as a reference.

from __future__ import absolute_import
from __future__ import print_function
from __future__ import division
import os
import time

import tensorflow as tf
from six.moves import range
import numpy as np
import zhusuan as zs

from examples import conf
from examples.utils import dataset, save_image_collections

# build the decoder
@zs.meta_bayesian_net(scope="gen", reuse_variables=True)
def build_gen(x_dim, z_dim, n, n_particles=1):
    bn = zs.BayesianNet()
    z_mean = tf.zeros([n, z_dim])
    z = bn.normal("z", z_mean, std=1., group_ndims=1, n_samples=n_particles)
    h = tf.layers.dense(z, 500, activation=tf.nn.relu)
    h = tf.layers.dense(h, 500, activation=tf.nn.relu)
    x_logits = tf.layers.dense(h, x_dim)
    bn.deterministic("x_mean", tf.sigmoid(x_logits))
    bn.bernoulli("x", x_logits, group_ndims=1)
    return bn

# build the encoder
@zs.meta_bayesian_net(scope="q_net", reuse_variables=True)
def build_q_net(x, z_dim, n_z_per_x, std_noise=0):
    bn = zs.BayesianNet()
    h = tf.layers.dense(tf.cast(x, tf.float32), 500, activation=tf.nn.relu)
    h = tf.layers.dense(h, 500, activation=tf.nn.relu)
    z_mean = tf.layers.dense(h, z_dim)
    z_logstd = tf.layers.dense(h, z_dim) + std_noise
    bn.normal("z", z_mean, logstd=z_logstd, group_ndims=1, n_samples=n_z_per_x)
    return bn

def main():
    # Load MNIST dataset
    data_path = os.path.join(conf.data_dir, "mnist.pkl.gz")
    x_train, t_train, x_valid, t_valid, x_test, t_test = \
    x_train = np.vstack([x_train, x_valid])
    x_test = np.random.binomial(1, x_test, size=x_test.shape)

    # Define model parameters x and z
    x_dim = x_train.shape[1]
    z_dim = 40

    # how many samples to draw from the nornmal distribution
    n_particles = tf.placeholder(tf.int32, shape=[], name="n_particles")

    # input data to feed the variational
    x_input = tf.placeholder(tf.float32, shape=[None, x_dim], name="x")
    x = tf.cast(tf.less(tf.random_uniform(tf.shape(x_input)), x_input),

    # input batch size
    n = tf.placeholder(tf.int32, shape=[], name="n")

    # add random noise to the variance of the q_model so to 
    # get more various samples when generating new digits with the decoder
    std_noise = tf.placeholder_with_default(0., shape=[], name="std_noise")

    # build the model (encoder) and the q_model (variational or decoder)
    model = build_gen(x_dim, z_dim, n, n_particles)
    q_model = build_q_net(x, z_dim, n_particles, std_noise)
    # create an instance of the variational with no observed variables
    variational = q_model.observe()

    # calculate ELBO
    lower_bound = zs.variational.elbo(
        model, {"x": x}, variational=variational, axis=0)
    cost = tf.reduce_mean(lower_bound.sgvb())
    lower_bound = tf.reduce_mean(lower_bound)

    # optimize the ELBO
    optimizer = tf.train.AdamOptimizer(learning_rate=0.001)
    infer_op = optimizer.minimize(cost)

    # define training/evaluation parameters
    epochs = 1000
    batch_size = 128
    iters = x_train.shape[0] // batch_size
    # path to store the generated digits
    result_path = "results/vae_digits"
    # path to store checkpoints during training
    checkpoints_path = "checkpoints/vae_digits"

    # used to save checkpoints during training
    saver = tf.train.Saver(max_to_keep=10)
    save_model_freq = 100

    # run the inference
    with tf.Session() as sess:

        # restore the model parameters from the latest checkpoint
        ckpt_file = tf.train.latest_checkpoint(checkpoints_path)
        begin_epoch = 1
        if ckpt_file is not None:
            print('Restoring model from {}...'.format(ckpt_file))
            begin_epoch = int(ckpt_file.split('.')[-2]) + 1
            saver.restore(sess, ckpt_file)

        # begin training
        for epoch in range(begin_epoch, epochs + 1):
            time_epoch = -time.time()
            lbs = []
            for t in range(iters):
                x_batch = x_train[t * batch_size:(t + 1) * batch_size]
                # apply gradient descent
                _, lb =[infer_op, lower_bound],
                                 feed_dict={x_input: x_batch,
                                            n_particles: 1,
                                            n: batch_size})

            time_epoch += time.time()
            print("Epoch {} ({:.1f}s): Lower bound = {}".format(
                epoch, time_epoch, np.mean(lbs)))

            # save model parameters
            if epoch % save_model_freq == 0:
                print('Saving model...')
                save_path = os.path.join(checkpoints_path, 
                if not os.path.exists(os.path.dirname(save_path)):
      , save_path)

        # random generation of images from latent distribution as N(0,1)
        # observe the variable "x_mean" which will contain the raw results
        x_gen = tf.reshape(model.observe()["x_mean"], [-1, 28, 28, 1])
        # generate 100 random output samples
        images =, feed_dict={n: 100, n_particles: 1})
        name = os.path.join(result_path, "random_samples.png")
        save_image_collections(images, name)

        # generation of 100 samples for each digit
        # map each digit to a corresponding sample from the test set so we can generate similar digits
        test_n = [3, 2, 1, 90, 95, 23, 11, 0, 84, 7]  # ex: instance 3 represents a 0, instance 2 represents a 1
        for i in range(len(test_n)):
            # get latent distribution from the variational giving as input a fixed sample from the dataset
            z = q_model.observe(x=np.expand_dims(x_test[test_n[i]], 0))['z']
            # run the computation graph adding noise to computed variance to get different output samples
            latent =, feed_dict={x_input: np.expand_dims(x_test[test_n[i]], 0),
                                            n: 1,
                                            n_particles: 100,
                                            std_noise: 0.7})
            # get the output images from the decoder giving as input the latent distribution z
            x_gen = tf.reshape(model.observe(z=latent)["x_mean"], [-1, 28, 28, 1])
            images =, feed_dict={})
            name = os.path.join(result_path, "{}.png".format(i))
            save_image_collections(images, name)

if __name__ == "__main__":

Generated handwritten digits

The following digits have been generated by adding a noise of 0.7 to the variance. Because of the low noise, they look very similar to each other.

Now, we are going to add more noise to the output by adding 1.4 to the variance. In this way we are going to obtain more screwed and different digits. Basically, VAE is going to imitate human behavior: for a humans it is impossible to write many identical digits, but slight differences will occur each time we write them down.

You can play with the variance to see its impact on the output digits. Will your VAE behave like a human or like a machine?

Find more on


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s