In the last post we talked about Variational Auteoncoders (VAE), powerful generative machine learning model able to generate new data based on previously seen samples. In this post we are going to implement one and use it to generate handritten digits. Will you recognize which digits have been written by a human and which ones have been written by a machine?

We are going to implement a VAE model in Python exploiting two very useful machine learning libraries: Tensorflow and Zhusuan. It is well known that Tensorflow is the main library when having to deal with machine learning such that most of the models are actually implemented upon it. Thus, it is assumed the reader already having a basic knowledge about it.

### Zhusuan

ZhuSuan is a Python probabilistic programming library for **Bayesian deep learning**, which conjoins the complimentary advantages of Bayesian methods and deep learning. ZhuSuan is built upon Tensorflow. Unlike existing deep learning libraries, which are mainly designed for deterministic neural networks and supervised tasks, ZhuSuan provides deep learning style primitives and algorithms for building probabilistic models and applying Bayesian inference. It has been developed in 2017 by some students and teachers of Tsinghua University.

To have Zhusuan installed, please refer to the official Github page: https://github.com/thu-ml/zhusuan

A basic tutorial on Zhusuan can be found here: https://zhusuan.readthedocs.io/en/latest/tutorials/vae.html

### Implementation

The model will be trained on the MNIST dataset which is composed by a training set of 60000 handwritten digits and a test set of 10000 handwritten digits. Each picture has a dimension of 28×28 greyscale pixels, thus the input and the output layers of the VAE will by composed by 784 neurons, while encoder and decoder will have a symmetric structure both containing two hidden layers of 500 neurons each. Finally, the latent variable *z*, output of the encoder and input of the decoder, is drawn from a normal distribution with the learnt mean and variance as parameters, while the output of the decoder is drawn from a simple Bernoulli distribution built on top of the last hidden layer.

Once the network is trained, it is also possible to generate new random samples of digits by running only the decoder and drawing *z* from a normal distribution with mean equal to 0 and variance equal to 1. Conversely, to generate specific digits, *z* will simply be generated by the encoder just by giving in input the desired digit we want to generate. In this post, we are going to implement both ways of digits generation.

By default, we have the latent variable *z* composed by 40 neurons, this means that each image will be compressed from 784 to 40 pixel representation before being regenerated by the decoder. Furthermore, we are not going to use a validation test, thus all the 60000 instances will be available for training divided in batches of 128 samples.

We are going to train the model for a total of 1000 epochs and save its paramaters every 100 epochs in order to continue training the model from the latest chekpoint in case we would need to stop it. Overall, the training process should not take more than three hours on Intel Core i5 CPU. Finally, once the training phase will be over, we will use the VAE to generate some handwritten digits.

**Important**: basic knowledge about Variational Autoencoders is required in order to better understand the following code. You can use the following post as a reference.

from __future__ import absolute_import from __future__ import print_function from __future__ import division import os import time import tensorflow as tf from six.moves import range import numpy as np import zhusuan as zs from examples import conf from examples.utils import dataset, save_image_collections # build the decoder @zs.meta_bayesian_net(scope="gen", reuse_variables=True) def build_gen(x_dim, z_dim, n, n_particles=1): bn = zs.BayesianNet() z_mean = tf.zeros([n, z_dim]) z = bn.normal("z", z_mean, std=1., group_ndims=1, n_samples=n_particles) h = tf.layers.dense(z, 500, activation=tf.nn.relu) h = tf.layers.dense(h, 500, activation=tf.nn.relu) x_logits = tf.layers.dense(h, x_dim) bn.deterministic("x_mean", tf.sigmoid(x_logits)) bn.bernoulli("x", x_logits, group_ndims=1) return bn # build the encoder @zs.meta_bayesian_net(scope="q_net", reuse_variables=True) def build_q_net(x, z_dim, n_z_per_x, std_noise=0): bn = zs.BayesianNet() h = tf.layers.dense(tf.cast(x, tf.float32), 500, activation=tf.nn.relu) h = tf.layers.dense(h, 500, activation=tf.nn.relu) z_mean = tf.layers.dense(h, z_dim) z_logstd = tf.layers.dense(h, z_dim) + std_noise bn.normal("z", z_mean, logstd=z_logstd, group_ndims=1, n_samples=n_z_per_x) return bn def main(): # Load MNIST dataset data_path = os.path.join(conf.data_dir, "mnist.pkl.gz") x_train, t_train, x_valid, t_valid, x_test, t_test = \ dataset.load_mnist_realval(data_path) x_train = np.vstack([x_train, x_valid]) x_test = np.random.binomial(1, x_test, size=x_test.shape) # Define model parameters x and z x_dim = x_train.shape[1] z_dim = 40 # how many samples to draw from the nornmal distribution n_particles = tf.placeholder(tf.int32, shape=[], name="n_particles") # input data to feed the variational x_input = tf.placeholder(tf.float32, shape=[None, x_dim], name="x") x = tf.cast(tf.less(tf.random_uniform(tf.shape(x_input)), x_input), tf.int32) # input batch size n = tf.placeholder(tf.int32, shape=[], name="n") # add random noise to the variance of the q_model so to # get more various samples when generating new digits with the decoder std_noise = tf.placeholder_with_default(0., shape=[], name="std_noise") # build the model (encoder) and the q_model (variational or decoder) model = build_gen(x_dim, z_dim, n, n_particles) q_model = build_q_net(x, z_dim, n_particles, std_noise) # create an instance of the variational with no observed variables variational = q_model.observe() # calculate ELBO lower_bound = zs.variational.elbo( model, {"x": x}, variational=variational, axis=0) cost = tf.reduce_mean(lower_bound.sgvb()) lower_bound = tf.reduce_mean(lower_bound) # optimize the ELBO optimizer = tf.train.AdamOptimizer(learning_rate=0.001) infer_op = optimizer.minimize(cost) # define training/evaluation parameters epochs = 1000 batch_size = 128 iters = x_train.shape[0] // batch_size # path to store the generated digits result_path = "results/vae_digits" # path to store checkpoints during training checkpoints_path = "checkpoints/vae_digits" # used to save checkpoints during training saver = tf.train.Saver(max_to_keep=10) save_model_freq = 100 # run the inference with tf.Session() as sess: sess.run(tf.global_variables_initializer()) # restore the model parameters from the latest checkpoint ckpt_file = tf.train.latest_checkpoint(checkpoints_path) begin_epoch = 1 if ckpt_file is not None: print('Restoring model from {}...'.format(ckpt_file)) begin_epoch = int(ckpt_file.split('.')[-2]) + 1 saver.restore(sess, ckpt_file) # begin training for epoch in range(begin_epoch, epochs + 1): time_epoch = -time.time() np.random.shuffle(x_train) lbs = [] for t in range(iters): x_batch = x_train[t * batch_size:(t + 1) * batch_size] # apply gradient descent _, lb = sess.run([infer_op, lower_bound], feed_dict={x_input: x_batch, n_particles: 1, n: batch_size}) lbs.append(lb) time_epoch += time.time() print("Epoch {} ({:.1f}s): Lower bound = {}".format( epoch, time_epoch, np.mean(lbs))) # save model parameters if epoch % save_model_freq == 0: print('Saving model...') save_path = os.path.join(checkpoints_path, "vae.epoch.{}.ckpt".format(epoch)) if not os.path.exists(os.path.dirname(save_path)): os.makedirs(os.path.dirname(save_path)) saver.save(sess, save_path) print('Done') # random generation of images from latent distribution as N(0,1) # observe the variable "x_mean" which will contain the raw results x_gen = tf.reshape(model.observe()["x_mean"], [-1, 28, 28, 1]) # generate 100 random output samples images = sess.run(x_gen, feed_dict={n: 100, n_particles: 1}) name = os.path.join(result_path, "random_samples.png") save_image_collections(images, name) # generation of 100 samples for each digit # map each digit to a corresponding sample from the test set so we can generate similar digits test_n = [3, 2, 1, 90, 95, 23, 11, 0, 84, 7] # ex: instance 3 represents a 0, instance 2 represents a 1 for i in range(len(test_n)): # get latent distribution from the variational giving as input a fixed sample from the dataset z = q_model.observe(x=np.expand_dims(x_test[test_n[i]], 0))['z'] # run the computation graph adding noise to computed variance to get different output samples latent = sess.run(z, feed_dict={x_input: np.expand_dims(x_test[test_n[i]], 0), n: 1, n_particles: 100, std_noise: 0.7}) # get the output images from the decoder giving as input the latent distribution z x_gen = tf.reshape(model.observe(z=latent)["x_mean"], [-1, 28, 28, 1]) images = sess.run(x_gen, feed_dict={}) name = os.path.join(result_path, "{}.png".format(i)) save_image_collections(images, name) if __name__ == "__main__": main()

### Generated handwritten digits

The following digits have been generated by adding a noise of 0.7 to the variance. Because of the low noise, they look very similar to each other.

Now, we are going to add more noise to the output by adding 1.4 to the variance. In this way we are going to obtain more screwed and different digits. Basically, VAE is going to imitate human behavior: for a humans it is impossible to write many identical digits, but slight differences will occur each time we write them down.

You can play with the variance to see its impact on the output digits. Will your VAE behave like a human or like a machine?