If I ask you the question “do you like anime characters?”, then it’s very likely that most of you would answer “yes” and that some of you would even admit that anime has been part of their childhood. Although most people, regardless their age, enjoy watching them, only a few people can actually draw them from scratch and even less people have mastered this skill at the extent to be able to draw their own characters and make them become top-stars in the popular culture such as the well-known trio formed by Naruto Uzumaki, Son Goku and Monkey D. Luffy. However, unless your name is Masashi Kishimoto, Akira Toriyama or Eiichiro Oda, drawing characters anime is a very complex task for a human and, as we did in many other posts, we are going to build a machine learning model that can solve this hard task for us. Thus, we would like to implement a new model that can learn how to draw them and improves with experience trying to close the gap between its own generated anime characters and the ones drawn by those veteran japanese artists. Will our new model be ever able to rival with those Gods of Anime?
For this purpose, we are going to implement a more advanced version of GAN called DCGAN (Deep Convolutional GAN) that is able to generate new faces of anime characters receiving as input just random noise after watching them for a couple of hours. Anyway, before plunging into the DCGAN’s details, let’s briefly review what a GAN consists of.
GAN: Generative Adversarial Network
GAN (Generative Adversarial Network) was a new kind of generative model proposed by Goodfellow and his colleagues in 2014. It shows impressive results in image generation, image translation, super-resolution and many other generation tasks in the computer vision area. The essence of GAN can be summarized as training a generator model and a discriminator model simultaneously, where the discriminator model tries to distinguish the real example, sampled from ground-truth images, from the samples generated by the generator. On the other hand, the generator tries to produce realistic samples that the discriminator is unable to distinguish from the ground-truth samples. The above idea can be described as an adversarial loss that applied to both generator and discriminator in the actual training process, which effectively encourages outputs of the generator to be similar to the original data distribution. A very gentle and effective introduction to GAN can be found on this post.
DCGAN: Deep Convolutional Gan
As anime characters usually do, also machine learning models evolve. So after its birth in 2014, GAN evolved into DCGAN (Deep Convolutional Generative Adversarial Network) in 2016. Despite their simplicity, GANs have been known to be unstable to train, often resulting in generators that produce nonsensical outputs. To mitigate this problem, DCGANs put some constraints on the architectural topology of original GANs with convolutional layers in order to make them more stable during training:
- Replace any pooling layers with strided convolutions in the discriminator for down-sampling and fractional-strided convolutions (transposed convolutions) in the generator for up-sampling. This approach is used to let the generator and discriminator learn how to do up-sampling and down-sampling respectively instead than deterministically doing it by with spatial pooling layers such as maxpooling. To understand how strided convolution works, here is another post introducing convolutional layers as well as pooling layers.
- Use batch-normalization after each layer in both the generator and the discriminator except for the output layer of the generator and the input layer of the discriminator. Batch-normalization stabilizes learning in most networks by normalizing the input of each unit to have zero mean and unit variance. This important trick has many benefits such as dealing with those training problems arising due to poor initialization or helping the gradient flow in deeper models.
- Remove fully connected hidden layers for deeper architectures. This modification has proven to give better results than having leaving fully connected layers on top of the convolutional ones, with training convergence speed befitting the most. It is also present a flatten layer at the top of the discriminator to flatten the output of the last convolutional layer and then fed it into a single sigmoid output to get the probability value of the discriminator’s prediction.
- Use ReLU activation in the generator for all layers except for the output of the generator, which uses Tanh.
- Use Leaky-ReLU activation in the discriminator for all layers.
Training with DCGAN
The way DCGANs are trained is not different from the original GANs’ one. For this experiment has been used a dataset containing tens of thousands of face anime images scraped from www.getchu.com, which are then cropped using a anime face detection algorithm and resized to a suitable shape of 64×64 pixels. For more information about how to collect the dataset you can refer to this link.
The only pre-processing applied to the training images consists of scaling down the images down to the range of the Tanh activation function [-1, 1]. The model is then trained using binary crossentropy loss function and Adam optimizer with a learning rate of 0.00015 and β momentum term of 0.5 since it results in a more stable training. All weights are initialized with a Glorot normal distribution and the slope of the leak of the Leaky-ReLU layers is set to 0.2. Images have a size of 64x64x3 pixels in rgb format and the model works with batches of 64 images, thus resulting in a batch size of 64x64x64x3. The input of the generator is a 1x1x100 shape noise sampled from a normal distribution while its output is a 64x64x3 generated image. Conversely, the discriminator takes an image as input and after down-scaling it through different convolutional layers, it uses a single output neuron with softmax activation function to return a probability indicating whether it is a fake or a real one.
In the following two tables is reported the structure of the transpose convolutional layers of the generator and the convolutional layers of the discriminator as well as their parameters. Given their simple structure, others layers have been omitted but the model still follow the architecture of a typical DCGAN described in the previous paragraph.
|TConv1||512 (4×4)||(1, 1)||valid||4x4x512|
|TConv2||256 (4×4)||(2, 2)||same||8x8x256|
|TConv3||128 (4×4)||(2, 2)||same||16x16x128|
|TConv4||64 (4×4)||(2, 2)||same||32x32x64|
|TConv5||3 (4×4)||(2, 2)||same||64x64x3|
|Conv1||64 (4×4)||(2, 2)||same||32x32x64|
|Conv2||128 (4×4)||(2, 2)||same||16x16x128|
|Conv3||256 (4×4)||(2, 2)||same||8x8x256|
|Conv4||512 (4×4)||(2, 2)||same||4x4x512|
After defining its architecture, the model has been trained on the anime face dataset for a total of 20000 epochs. Each epoch consists in alternating tho phases training discriminator and generator separately, temporarily freezing the weights of the model not involved in the learning process. This is equivalent to a zero-sum game where both parts, discriminator and generator, have to improve their own strategy in order to succeed in their task and win over their opponent.
In this section we are going to show the results obtained after 20000 epochs of training. To show that the model is effectively learning, some images generated with the generator model have been sampled at different stages during the training process so to compare the model’s improvements as well as its convergence speed. The image above shows some samples generated after 200, 600, 1000, 2000, 4000 and 20000 epochs respectively.
As we can see, the model improves very quickly during the first thousand epochs. In fact, during the first hundreds epochs it manages to generate a few indistinct shapes and it keeps drawing better and better faces as the training goes on. After that, it gradually reduces it convergence speed such that after about 10000 steps there seems to be zero or very little improvement. Due to their nature, comparing GAN models’ performances is also a quite unclear task. At the beginning, a visual feedback from other users about the quality of the final results was used as metrics to measure the efficacy of these kind of models. Nevertheless, since software engineers would laugh at hearing the existence of such a subjective metrics, other more formal metrics have recently gained popularity in this field such as the distance between a generated image and the nearest among the real images or exploiting another model intentionally designed to measure the quality of the generated image.
In the next picture we can see a batch of 64 faces generated after 20000 training steps.
Considering those pictures have been generated out of random noise, we can’t negate the fact that our DCGAN did a good job. Anyway, these results are still far from being compared to the masterpieces drawn by the japanese artists. Nevertheless, some of the above generated faces are still passable and if the resolution would have been a little higher, we could pretend some of those faces to have been drawn by amateur anime artists.
However, just like anime characters, also machine learning models have more than just one transformation. Improved versions of GAN such as DRAGAN and ACGAN have already proved that the gap between machine learning generated anime faces and real ones could be soon closed, and why not, maybe in a near future machines will even outperform those Gods of Anime.
Let’s see, I wouldn’t be so sure. I can beat Gods as well.Goku Super Saiyan Blue
Find more on