Teaching AI to play Snake with Reinforcement Learning

It is well known that two of the most fascinating fields of computer science are gaming and artificial intelligence. The gaming field saw its origins back in the 1970s when gaming consoles such as Atari 2600, along with graphics on computer screens and home computer games were introduced to the general public giving birth to different kinds of arcade games like Pong and Pacman.

In the same period, computer scientists were beginning to get familiar with basic machine learning models such as simple neural networks and their training algorithms. It was only after the ascent of reinforcement learning that these two fields started getting along together pushing machines to outperform human level in different kinds of videogames ranging from mastering the simple game of Pong to beating the world champion of the more complex game of Go in 2016.

In this post, we are going apply a simplified version of the same reinforcement learning algorithm, also known as Deep Q-learning, to another very famous game: Snake.

You have probably already played or at least heard about this game. It is that classic videogame we can play on our smartphones and PCs which consists in moving a snake inside a specific area with the goal to increase its length by eating as much food as possible. Anyway, accidentally eating its own body or hitting a wall would result in the death of the snake, thus resetting its length.

Gameplay of the game of Snake

To make it more interesting, we are going to enhance the rules of this game by adding the possibility to let two or more snakes compete in the same arena. Of course, the other rules will remain the same: eating its own body or other snakes’ body will result in its own death. In particular, we are going to create a competition between two snakes controlled by two different kinds of AI.

The green snake is moved by a hard-coded algorithm, that is, it uses a heuristic method to decide its move. On the other way, the yellow one is controlled by a reinforcement learning algorithm exploiting Deep Q-network to learn how to play through experience, thus improving itself by playing more games.

In the beginning, no rules about the game are given and the agent has no information about what to do. Its goal is to figure it out and elaborate a strategy to maximize the reward.

Two snakes compete with each other in order to grow longer. The green one is controlled by a hard-coded policy while the yellow one has been trained by a reinforcement learning algorithm

The game has been developed in Python and relies on the framework Pygame, a Python library that allows the development of games in a fairly easy way.

The size of the arena is 20×20 and is present only one food which spawns in a new random free position each time it gets eaten. Similarly, after a collision happens, a snake will spawn in a free position decided at random to prevent multiple chained collisions from happening on respawn.

The score of each snake is equal to its length and, in addition, for each game is recorded the highest score reached by each snake as well as its average score in order to facilitate the comparison between different algorithms.

Hard-coded policy

The green snakes adopt a hard-coded policy. This means that once implemented, the algorithm will drive the snake across the arena without performing any learning process.

The algorithm works by first calculating a value score for each of the four possible directions the snake can move (right, left, up, down) and then choosing the action with the highest value. In order to calculate these scores, the snake simulates moving one step toward each direction and returns a score based on the outcome of that movement:

  • -1000 if collision against a wall or the body of another snake;
  • 1000 – distance to the nearest food otherwise.

Thus, a movement leading to a collision it’s very likely to be discarded while will be preferred those actions that decrease the distance between the head of the snake and its food target.

A snake controlled by this algorithm can achieve a highest score of over 62 points and an average of 35 calculated over a session of 100 games.

Hard-coded policy-controlled snake achieving a score of 62 points

The policy can be further improved by expanding the hard-coded algorithm with a dynamic algorithm such as minimax in order to calculate the score of each action by deeper exploring the tree of the state of the game. Nevertheless, this strategy hasn’t been followed since it would significantly slow down the speed of the game.

Deep Q-learning policy

As already mentioned, the yellow snake is controlled by a reinforcement learning algorithm. Similarly to the hard-coded policy, each state of the game is associated with a reward, which can be positive or negative and the algorithm has to learn which actions can maximize the reward and which ones have to be avoided.

To have a high-level understanding of how the agent takes decisions, it’s important to introduce the concept of Q-table.

A Q-table is a matrix that correlates the state of the agent with the possible actions that the system can adopt. The values in the table are the probability of choosing an action based on the rewards it got during the training.

Example of Q-table

The goal of the algorithm is to update the Q-values according to the Bellman equation:

Q_{new}(s,a)=Q(s,a)+\alpha[R(s,a)+\gamma maxQ'(s',a')-Q(s,a)]

Where Qnew(s,a) is the new value of the Q-table for the state s and action a, α is the learning rate, γ is the discount factor, R(s,a) is the reward, and Q'(s‘,a‘) is the maximum predicted reward given the new state s‘ and all the possible actions a‘. In practice, a neural network, also known as Deep Q-network, is used to calculate the values Q(s,a).

The network has a learning rate α of 0.0005 and is composed of three dense layers of 120 neurons followed by three dropout layers to optimize generalization and reduce overfitting. The output is a 4 neurons softmax layer outputting a probability value for each possible direction.

The state s is a simplified representation of the environment and can be seen as a binary array of 12 values. It takes into consideration:

  • If there is a danger on the left, right, up, and down with respect to the position of the head of the snake (4 values);
  • The direction of the snake (4 values);
  • The relative position of the food with respect to the head of the snake (4 values).

Thus, the input layer of the network is composed of 12 neurons ready to receive as input the state s. The loss is a minimum squared error function defined as:

L=(R(s,a)+\gamma maxQ'(s',a')-Q(s,a))^2

Therefore, the training process works by trying to minimize the loss by reducing the difference between the real target value and the predicted one. As for the reward, it is computed in the following way:

  • +10 if the snake has eaten a piece of food;
  • -10 if the snake hits a wall or another snake.

In this way, the reward is defined to characterize the behavior of a snake. For example, adding a small reward for each step could give the snake a more cautious nature at the cost of risking causing it to run in a circle without eating food.

On the other hand, an increased reward for eating food could make the snake more aggressive, thus focusing on rushing toward the food with the risk of colliding more often with the environment’s obstacles.

Finally, the algorithm works according to the following workflow:

  • In the beginning, the network is randomly initialized. It corresponds to having a Q-table randomly initialized;
  • The system gets the current state s;
  • The agent executes an action a randomly or based on the result of the neural network. A parameter Ɛ balances the trade-off of exploration and exploitation favoring the exploration during the earlier stages of training and the exploitation as the network gets more experienced;
  • After the snake performs its moves, the system collects the value of the new state s‘ and executes a learning step to update the weights of the network according to the above-defined loss function, thus updating the virtual Q-table. During this phase, the system also computes the tuple formed by the old state s, the action a, the reward r, the new state s’, and a boolean variable indicating whether the snake has died or not and saves it in the replay memory;
  • Every time the snake dies, it executes a learning phase by sampling random tuples from the replay memory in order to let the agent learn from its past actions.

This agent has been trained for a total of 300 games or episodes. The chart below shows the average score achieved by the agent during its training process. The number of episodes is represented on the x-axis while the y-axis shows the score obtained during each episode.

Episode-Score chart of the training phase

During testing, the Ɛ parameter is set to zero, thus preventing the agent from executing random actions. On a set of 100 games, it achieved a highest score of 52 points and an average score of 27,5 points. Hence, this algorithm performs worse than the hard-coded one. Nevertheless, the agent has been trained with just a simplified version of the full Deep Q-learning algorithm.

More recent advances such as the Deep Q-learning algorithm updated with the features introduced in the Rainbow algorithm such as the Double and Dueling Deep Q-networks or Multi-step learning can bring significant improvement to the agent’s performance.

Let’s conclude the section by visualizing the yellow snake controlled by the Deep Q-learning algorithm in a testing game.

RL policy controlled snake achieving a score of 52 points

Snake vs Snake

Now comes the funny part, as announced in the introduction, a hard-coded policy-controlled snake (green) and an RL policy-controlled snake (yellow) are going to compete in the same arena. Who is going to win?

In this game, the yellow snake performs better than the green one
In this second game, the green snake gets its revenge
In this last game, the competition is very intense, it’s not clear who is prevailing

Overall, the green snake performs slightly better than its yellow opponent. If running over enough episodes, it scores better on both the single and double arena. Moreover, both AIs have a large margin of improvement and would be very interesting to repeat their challenge once both algorithms will be upgraded with the above-mentioned features.

It’s very likely that a snake implementing the minimax algorithm would move according to an almost perfect policy but it’s important to remark that also Deep Q-learning is able to achieve super-human skills as it happened for the game of Go and Honor of Kings.

Find more on


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s