Teaching AI to play Snake with Genetic Algorithm

Supervised learning, unsupervised learning, and reinforcement learning are commonly recognized as the three main ways to train machine learning models. We can have a fourth one if we include the union of the first two, that is, semi-supervised learning.

However, in this post, we are going to introduce an alternative algorithm that can be used to both train and optimize neural network models: Genetic Algorithm. In particular, we are going to use it to train from scratch a new model and teach it how to play the game of Snake.

Finally, we are going to make it compete against other two snakes controlled by two different AIs developed in a previous post, namely, a hard-coded AI and a deep reinforcement learning agent.

Three snakes competing in the same area controlled by three different AI: hard-coded policy (green), deep reinforcement learning agent (yellow), a genetic algorithm trained neural network (blue)

Genetic algorithm

One general law, leading to the advancement of all organic beings, namely, multiply, vary, let the strongest live and the weakest die.

Charles Darwin

Genetic algorithm is a meta-heuristic inspired by the process of natural selection that belongs to the larger class of evolutionary algorithms (EA).

Those kinds of algorithms are often used to generate high-quality solutions to optimization and search problems by simulating the process of natural selection, which means that only those species that can adapt to changes in their environment are able to survive and reproduce and go to the next generation.

Each generation consists of a population of individuals, that in our case are neural networks, each having its own weights, thus each representing a different solution.

Considering the analogy between genetic algorithms and natural selection, we can compare the weights of each individual with the genetic structure of a chromosome which mutates generation by generation in order to give birth to better individuals.

A high-level generalization of the workflow of the algorithm can be summarized in the following 6 steps:

  1. Set the structure of the network representing an individual;
  2. Generate the population of individuals by filling their weights matrices with random numbers between -1 and 1;
  3. Set the fitness function, that is, a function that allows calculating the performance of each individual;
  4. Play a game for each individual in the population and compute their fitness function score;
  5. Select some top individuals from the population and create the next generation from these top-selected individuals using crossover and mutation techniques. The algorithm works by exploiting these parents to create a new offspring that is better than each of the parents, thus resulting in the next generation being more suited for their environment;
  6. Go to step 4 and repeat until the stopping criteria are not satisfied.

In the next sections, we are going to give a more detailed explanation for each step.

Network structure

The network is composed of a 12 neurons input layer followed by three dense layers of 120 neurons each. The output is a 4 neurons softmax layer outputting a probability value for each possible direction the snake can move. The state of the environment is represented as a binary array of 12 values which takes into consideration:

  • If there is a danger on the left, right, up, and down respect the position of the head of the snake (4 values);
  • The direction of the snake (4 values);
  • The relative position of the food with respect to the head of the snake (4 values).

Important: in order to simplify the representation of the weights matrix, the bias of the network’s layers has been kept fixed at zero.

We can then begin by creating 50 networks with the given structure, where each network will correspond to an individual in the population with its own weights randomly initialized according to random values between 1 and -1.

Having a bigger population would increase the chance to have smarter snakes, but it would cost more memory required to store them and more time to evaluate the performance of each individual.

This concept is very similar to how the size of the human population reflects the speed of our research, in fact, having more people means having more brains that might invent something new, thus advancing our technology level or enriching our culture.

Fitness function

The fitness function is a function that calculates the overall performance of an individual in the population. This is a very crucial phase in the algorithm since the way the fitness function is calculated will define the behavior of the snakes.

A wrong one may not improve the performance of the individuals over the next generations while a good one could greatly speed up the training process, thus creating good individuals with few generations as well as achieving higher scores. After many attempts, a good fitness function resulted to be:

F=record*5000-deaths*150-avg\_steps*100-penalties*1000

Where record is the highest score achieved by the individual, deaths is the number of times it died, avg_steps is the average number of steps between eating food, and penalties is the number of times the individuals ran for at least 200 steps without eating any food.

When this last condition is met, the snake automatically dies and spawns in a new random position in order to avoid that loops where a snake may keep running in a circle without eating any food or causing any collision.

This function tends to reward individuals achieving high scores, focusing on their target without running across unnecessary steps, and having a low number of deaths. A good alternative strategy could be one of using different fitness functions and swapping them when there is no progress over many generations.

For example, we could use a fitness function rewarding individuals with the least number of deaths for the first generations, therefore teaching them how to survive and then changing it to another function to train them to eat more food, thus achieving higher scores. Anyway, for this experiment has been used the standard fixed function defined above.

Selection, crossover, and mutation

The next phase consists in selecting the best individuals to give birth to the next generation. The idea is to give preference to those individuals with good fitness scores and allow them to pass their weights to successive generations.

Thus, the first step is running each snake for a fixed number of steps, 5000 in our case, and then calculating its fitness value. Running each individual for more steps would provide a more accurate fitness value at the cost of waiting for a longer time.

Once got these values, we choose the top 12 individuals in our population of 50 to create the next generation. Once again, the number of best individuals to choose in order to create the next offspring is a free parameter and can influence the results of the training process.

The next operation is the crossover which is used to give birth to the individuals of the next generation. First, two individuals are randomly selected from the best and they will be the father and the mother of a new individual of the next generation.

To generate it, we first create an empty matrix that will be used to store its weights. Then, we draw a number between 0 and 1 for each of its weights and use them to decide from which parent inherits the corresponding weight.

For example, if we obtain a number lower than 0.5, we will take the weight from the father, otherwise, we will inherit it from the mother.

Crossover operation: a mother (pink) and a father (blue) generate a new individual inheriting part of their weights

This process is repeated until the total population size for the next generation is not achieved.

The last step of this phase is mutation. The key idea is to insert random genes in the offspring to maintain the diversity in the population. For each child, some are randomly selected some weights and then mutated by replacing them with a random value between -1 and 1.

Mutating more weights would slow down the convergence but increase the chances to obtain smarter individuals.

Mutation operation: a weight (red) of an individual is randomly changed

A grain in the balance will determine which individual shall live and which shall die, which variety or species shall increase in number, and which shall decrease, or finally become extinct.

Charles Darwin

Training results

After running the genetic algorithm over 100 generations, composed of 50 individuals each, we achieve an average fitness value over the last generation of 185275 points, an average number of deaths equal to 18.7, and an average score of 15.6 (the score is the same as the number of foods eaten).

If we consider only the best snake in the generation, it achieved a fitness value of 244850 points, an average score of 23.23, and a max score of 55. In the video below we can see the same snake obtaining a score of 51 in a test run.

Agent trained with genetic algorithm achieving a score of 51

Following, are shown four charts relative to some statistics calculated during training.

Fitness value achieved by the best snake in each generation

From the above line chart is interesting noticing that the best fitness score has been achieved by a snake in generation 80 with a total of 285400 points.

Score achieved by the best snake in each generation

Two snakes, one in generation 60 and one in 80 (the same one who achieved the highest fitness score), obtained a max score of 58. Nevertheless, their average scores are 23 and 22.3 respectively, which are not the highest values for that metric.

Thus, those particularly high scores might have just been obtained because of lucky runs.

Average score achieved by the snakes in each generation (blue) and average score achieved by the best snake in each generation (yellow)

With 30.6 points, a snake in generation 98 achieves the highest average score. Such a high value makes it perform even better than the other two AIs.

Overall, by also looking at the other charts, all metrics in this generation result to be very high with most of them close to the top. Therefore, generation 98 is likely to be the most performing one.

Average number of deaths in each generation

By observing all the charts, we can see that the generations stopped improving after generation 60. Nevertheless, a better fitness function and a more complete representation of the environment’s state would further push up the performance of the snakes and eventually keep improving it over more generations.

Snake vs Snake vs Snake

Now that our new snakes have finished evolving, it’s time to start the showdown. As announced in the introduction, a hard-coded policy-controlled snake (green), a reinforcement learning (RL) policy-controlled snake (yellow), and our new genetic algorithm (GA) trained neural network snake (blue) are going to compete in the same arena.

For the latter, we are going to select the snake with the highest fitness score from the last generation. Who is going to be the champion?

metricGreen Snake (Hard-coded)Yellow Snake (RL)Blue Snake (GA)
Average score3527.523.25
Max score625255
Average score and highest score achieved by the three different AI over 100 games
Hard-coded policy snake (green) vs GA snake (blue)
RL snake (yellow) vs GA snake (blue)
Hard-coded policy snake (green) vs RL snake (yellow) vs GA snake (blue)

Genetic algorithm is an algorithm that heavily relies on chance, thus we may need to train many generations before obtaining a good one. Indeed, when playing together, the blue snake is outperformed by both the yellow and red ones indicating that a snake trained with this algorithm performs worse than more traditional ones such as RL or dynamic programming.

Nevertheless, genetic algorithms can still be used as optimization techniques to further improve already trained models. For example, a worthwhile experiment would be applying this algorithm to the weights of the DQN agent, the one controlling the yellow snake, instead of starting from scratch with randomly initialized weights and see if it can achieve better results, possibly surpassing the performance of the hard-coded policy one (green).

Find more at

References

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s