Introduction to Deep Reinforcement Learning

Deep Reinforcement Learning is the result of the combination of two well-known machine learning approaches: Deep Learning and Reinforcement Learning. Its main goal is the one to create a single agent able to handle any human-level task but achieving super-human results on it. A famous AI implementing this technique is AlphaGo that, in March 2016, defeated for the first time in the history a 9-dan Go player, Lee Sedol, for 4 games to 1 playing against him without handicaps. One year later, it’s evolution, AlphaGo Master even managed to defeat Ke Jie who at the time of the match was ranked first among all human players worldwide.

Before moving on, let’s briefly review what reinforcement learning is.

Reinforcement Learning

Reinforcement learning is a machine learning technique that addresses the problem of an agent learning to act in an environment in order to maximize a scalar reward signal. Differently from how supervised machine learning works, the agent is never directly told the best action but has to learn it trials and accumulating experience. To make it clear, let’s use the Atari game Pacman to make an example.

Gameplay of Pacman

At the beginning of the game, Mr.Pacman doesn’t have any knowledge about the environment surrounding it: he has no idea that some sneaky ghosts are wandering around the map waiting to steal his life, he not even knows that eating points along the path his its main way to increase his score. Even more mysterious for him would be the four power-ups located at the edges of the map used to scare the ghosts and gain temporal invincibility against them. After the first episode, by randomly walking around the maze he casually discovers eating points makes him raise his score, thus, he learns that eating them is a good action and increases the likelihood to take similar action in the future. After a dozen steps, Mr.Pacman casually happens into one of the ghosts, therefore, resulting in inevitable death. Our agent has just learned that running up against ghosts is a bad action and from now on, he will try not to repeat the same mistake again. After several trials, Mr.Pacman will have hopefully learned the rules of the game and might has become clever enough to play resembling human players. The questions now are: can he do better than this? Can he achieve super-human skills?

Ghosts chasing Mr.Pacman

To find the answers, let’s deep into the magic world of Deep Reinforcement Learning, and let’s see how can we turn our naive Mr.Pacman into an insatiable dots machine. Even ghosts will fear him!

Mr.Pacman chasing ghosts


During training, the agent (that in our example is Mr-Pacman) is provided with an observation St of the environment (literally a screenshot from the game) for each time step t=0,1,2,… and it has to respond with an action At (move left, right, up, down). Afterward, the environment provides the next reward Rt+1 (the score after moving in the chosen direction), the next state St+1 (another screenshot from the game), and a discount factor γt+1. These five components constitute a tuple {S, A, T, t, γ} where:

  • S is a finite set of states.
  • A is a finite set of actions.
  • T(s, a, s‘) = P[St+1=s‘ | St=s, At=a] is the stochastic transition function. It represents the probability of moving to state s‘ performing action a while on state s.
  • r(s, a) = E[Rt+1| St=s, At=a] is the reward function. It is the expectation of all the possible rewards obtainable by performing action a while on state s.
  • γ ∈ [0, 1] is a discount factor that trades off the importance of immediate and later rewards.

As regards to the agent, actions are selected according to a policy π (behavior of the agent) that defines a probability distribution over actions for each state (probability to choose any of the moves according to the current situation). So, starting from state St encountered at time t, we define the discounted value as:


Gt represents the discounted sum of future rewards collected by the agent starting from a certain state. The discount factor γ for a reward k steps in the future is given by the product of the discounts that occur before that time:


The discounted factor used to give more importance to the immediate reward and gradually reduce the contribution of future rewards.

I can see the future…


The agent’s goal is to maximize the expected discounted value by finding a good policy. In value-based reinforcement learning, the agent learns an estimate of the expected discounted value when following a policy π and start exploring from a given state s. The discounted value state vπ obtained under policy π (the expected score Mr.Pacman will achieve starting from state s and keep playing according to a certain policy π) is then defined as:


Intuitively, the value function v measures how good it is to be in a particular state s. The agent can also learn the value of a state-action pair, known as Q-value (the expected score Mr.Pacman will achieve starting from state s, taking a fixed action a and then keep playing according to a certain policy π), which is computed according to the following Q-function:


The Q function measures the value of choosing a particular action when in a particular state. The above Q-function can also be computed using the following recursive formula:

Q^\pi(s,a)=E_{s'}[r+\gamma E_{a'\sim \pi(s')}[Q^\pi (s',a')]|s,a,\pi]

A common way of deriving a new policy from a state-action value function is to act ε-greedily with respect to the action values. This corresponds to taking the action with the highest value (the greedy action) with probability (1−ε), and otherwise to act uniformly at random with probability ε. Policies of this kind are used to introduce a form of exploration: by randomly selecting actions that are sub-optimal according to its current estimates, the agent can discover new strategies and correct its estimates when appropriate.

The optimal Q-value is then Q*(s,a)=maxπQπ(s,a) that is the best Q-value we can get from state s and action a assuming following an optimal policy. Given the nature of the two functions v and Q, we have that under the deterministic policy a=argmaxa’∈AQ(s,a‘) it results that v(s)=maxaQ(s,a). From this, it also follows that the optimal Q function satisfies the Bellman equation:

Q^*(s,a)=E_{s'}[R+\gamma max_{a'}Q^*(s',a')|s,a,\pi]

That is, for each state the agent keeps playing always choosing the best possible action. Thus, Q(s,a) is the policy we would like our agent to be able to approximate. To do so, we will introduce a new kind of network called Deep Q-network.

…And it allows me to select the best possible move.


Deep Q-networks

Even though we could use dynamic programming to learn the Q-value for all possible state-action pairs, in practice, given the huge countless number of possible states and actions, it is intractable to learn Q-value estimates for each state-action pair independently. Thus, in deep reinforcement learning, we train a deep neural network, also known as deep Q-network (DQN) to learn an estimate of the Q-value Q(s,a,θ) where θ are the parameters of the network. Its architecture is composed of a first convolutional network which takes as input a state St and learns to detect increasingly abstract features from it. Surprisingly, the input states which is fed as input to the network are just raw pixels from the current frame on the game screen. Subsequently, a dense classifier maps the high-level features extracted from the convolutional network to an output layer with one neuron per action in order to approximate the corresponding action value.

Architecture of a Deep Q-network

The parameters of the network are trained by gradient descent to minimize some suitable loss function:

L_t(\theta_t)=(R_{t+1}+\gamma_{t+1} max_{a'}Q_{\theta^{-}}(S_{t+1},a')-Q_{\theta_t}(s_t,a_t))^2

where t is a time step randomly picked from the replay memory. The replay memory is a sort of dataset storing the agent’s past experiences, usually the last million ones, where a single experience is defined as the tuple (s, a, r, s’) where s indicates a state, a is the action taken on that state, r is the reward obtained and s’ represents the state that followed s when the agent took action a. In this way, the agent doesn’t learn by simply watching Mr.Pacman moving around the maze in a sequential way, but instead, it can sample random past experiences from its storage and replay them so that it can use these memories to learn how to take better actions in the future. The experience replay technique prevents the network from learning from strongly correlated experiences that break the i.i.d. (independent and identically distributed) assumption of many popular stochastic gradient-based algorithms as well as avoiding the rapid forgetting of possibly rare experiences that would be useful later on.

The gradient of the loss is then back-propagated only into the parameters θ of the online network (which is also used to select actions). The term θ represents the parameters of a target network, a periodic copy of the online network which is not directly optimized. The parameters of the online network are copied into the ones of the target network every τ steps and are kept fixed on all the other steps. Overall, the value calculated by the target network has more accuracy since it has access to at least the first reward term to calculate the Q-value, thus, the above formula can be thought of as the difference between the true Q-value and a current estimation of it. Therefore, the specific gradient update formula is:

\nabla_{\theta_t}L_t(\theta_t)=(R_{t+1}+\gamma_{t+1} max_{a'}Q_{\theta^{-}}(S_{t+1},a')-Q_{\theta_t}(s_t,a_t))\nabla_{\theta_t}Q_{\theta_t}(s,a)

Those kinds of update rules are also known as backup operations because they transfer information back from future states to the current one.

This approach is model-free in the sense that the states and rewards are produced by the environment. It is also off-policy because these states and rewards are obtained with a behavior policy different from the online policy that is being learned (basically the network is learning from an old copy of itself).

Stronger than I was yesterday.


Double Deep Q-networks

In Deep Q-learning, we have the max operator uses the same values to both select and evaluate action. This makes it more likely to select overestimated values, resulting in overoptimistic value estimates. To prevent this, we can compute the real Q-value of the loss function by decoupling the selection of the action from the evaluation with a new technique called Double Q-learning:

L_t(\theta_t)=(R_{t+1}+\gamma_{t+1} Q_{\theta^{-}}(S_{t+1},argmax_{a'}Q_{\theta}(S_{t+1},a'))-Q_{\theta_t}(s_t,a_t))^2

In this way, we delegate the selection of the action to the online network θ and then use the argmax operator to select the best-estimated action. The target network will only be responsible to generate the Q value given the action decided by the online network. This method is particularly useful to avoid those situations where the discrepancy between the target network and online network causes them to recommend totally different actions given the same state, thus, reducing harmful overestimations and improving the performance.

Prioritized Experience Replay

In the original experience replay method, the DQN samples experiences uniformly from the replay buffer. Ideally, we want to sample more frequently those transitions from which there is much to learn. For example, some experiences might be more or less surprising, redundant, or task-relevant than others, moreover, there are some transitions that may not be immediately useful to the agent, but might become so later when the agent competence increases. We don’t need an expert to understand that our Mr.Pacman might learn faster by replaying those memories when surrounded by ghosts from multiple sides rather than when running along an empty path. Therefore, prioritized experience replay samples transitions with probability pt relative to the last encountered absolute TD error:

p_t\propto=|R_{t+1}+\gamma_{t+1} Q_{\theta^{-}}(S_{t+1},argmax_{a'}Q_{\theta}(S_{t+1},a'))-Q_{\theta_t}(s_t,a_t)|


The exponent α determines how much prioritization is used, with α = 0 corresponding to the uniform case, that is, how much importance we want to give to the sampling bias to be. We also need to introduce some importance sampling weights, which are meant to compensate for the fact that we are now intentionally feeding the network data that will cause abnormally large gradient updates and therefore changes the solution that the estimates will converge to:


In typical reinforcement learning scenarios, the unbiased nature of the updates is most important near convergence at the end of the training. We, therefore, exploit the flexibility of annealing the amount of importance-sampling correction over time, by defining a schedule on the exponent β that reaches 1 (full compensation) only at the end of learning. In practice, we linearly anneal β from its initial value β0 to 1. Finally, the formula to update the weights θ of the online network will result to be:


Examination of our past is never time-wasting.


Dueling Networks

We recall that vπ is the discounted value state obtained under policy π and is defined as:


and represents the value of being in a certain state. We now define another important quantity, the advantage function aπ, relating the value and Q function:


The advantage function subtracts the value of the state from the Q function to obtain a relative measure of the importance of each action. The idea behind the dueling architecture explicitly separates the representation of state values vπ and state-dependent action advantages aπ. The dueling architecture consists of two streams, one representing vπ, and the other one representing aπ, while sharing a common convolutional feature learning module. Finally, the two streams are combined via a special aggregating layer to produce an estimate of the state-action value function Q.

Deep Q-Network (top) and Dueling Network Architecture (bottom)

In the above picture, the red layers are separated fully connected ones while the rest of them are the shared convolutional layers. The dueling network should be understood as a single Q network with two streams that replaces the popular single-stream Q-network in existing algorithms such as Deep Q-networks. Intuitively, the goal of the dueling architecture is to learn which states are (or are not) valuable, without having to learn the effect of each action for each state.

Considering its structure, in a dueling network, we want one stream of fully-connected layers to output a scalar v(s;θ,β), and the other stream output an |A| dimensional vector a(s,a;θ,α). Here, θ denotes the parameters of the convolutional layers, while α and β are the parameters of the two streams of fully-connected layers. Using the definition of advantage, we might be tempted to construct the aggregating module as follows:


Unfortunately, this naive solution isn’t effective because the network isn’t encouraged to optimize V and A independently. To address this issue, we can force the advantage function estimator to have an advantage closer to zero at the chosen action so that Q(s,a) is approximately equal to v(s):


This alternative formula improves the network identifiability, in the sense that given Q we can recover v and a uniquely, thus, enhancing performance when the equation is used directly.

Multi-step Learning

Previously we defined the discounted value Gt as:


It represents the discounted sum of future rewards collected by the agent starting from a certain state s. In practice, this value is computed as:

G_{t:t+1}=R_{t+1}+\gamma Q_t(s_{t+1},a_{t+1})

We can clearly note that the above expression only involves the reward, state, and action taken in the next timestamp. Anyway, these methods can be generalized even further by bootstrapping over longer time intervals:

G_{t:t+n}=\sum_{k=0}^{n-1}\gamma^k R_{t+k+1}+\gamma^n Q_{t+n-1}(s_{t+n},a_{t+n})

Here Gt:t+n means that for the first n−1 states, actions and rewards are sampled according to the behavior policy with the discounted estimate γnQt+n-1(st+n,at+n) taking the place of the other future terms. These kinds of algorithms are also known as atomic multi-step algorithms because they take advantage of the bootstrapped sequences of transitions to improve the target value accuracy. Beside improving learning performance, multi-step algorithms with suitably tuned n often lead to faster learning.

Noisy Nets

In a previous section, we discussed the trade-off between exploration and exploitation, and we used a greedy solution to choose the best action with probability ε and a random action otherwise. The drawback of this approach is its reliance on a scheduled hyper-parameter ε, thus we don’t know how much is the right grade of exploration and testing a whole bunch of different values would be a time consuming and expensive process.

Instead, in Noisy Nets, we let the network learn the amount of exploration needed by itself. It works by perturbing its weights and biases by injecting some noise in the network itself. In a normal linear layer of a neural network, the output value is computed as:


where x is the layer input, w is the weight matrix and b is the bias. The corresponding noisy linear layer is defined as:


where µww⊙εw and µbb⊙εb replace w and b in respectively. The parameters µw, σw, µb, σb are learnable whereas εw and εb are zero-mean noise random variables following a Gaussian distribution and ⊙ represents element-wise multiplication. The usual loss of the neural network is wrapped by expectation over the noise ε:


but using a Monte Carlo approximation we can compute the gradients taking a single sample at each step of optimization.

As a result, the agent will consistently recommend actions that it didn’t intend to take, thus ensuring exploration and action decisions are only based on the highest Q value output. Because of the noise injection, the agent can learn to ignore this noise by gradually setting σ to 0 with a form of self-annealing. Nevertheless, this method tends to naturally stabilize above 0 ensuring at least some level of exploration for the duration of training (as it was for the case of the greedy policy).

Distributional RL

Normally. a DQN learns a scalar value representing how good it is being in a specific state. However, a model could also learn to approximate the distribution of returns instead of simply learning the expected return. For this purpose, it has been proposed to model such distributions with probability masses placed on a discrete support z, where z is a vector with Natoms, where N is a positive integer, defined by:

z_i=V_{min}+(i-1)\frac{V_{max}-V_{min}}{N-1} \forall i \in [1,...,N_{atoms}]

The approximating distribution dt at time t is defined on this support, with the probability mass pθi(St, At) on each atom i, such that the distribution dt= (z, pθ(St, At)). The goal is then to update θ such that this distribution closely matches the actual distribution of returns.

Let’s make this concept more clear with a simple example. Let’s suppose that the expected return of a standard DQN would be in the range (0, 200) and consequently set Vmin=0 and Vmax=200. With a support of 51 atoms, the first element would have a value of 0, the second one would be 4, and so on with the last one having a value of 200. Therefore, the probability mass on the first atom, that is pθ1(St, At), would be the probability of the returned Q-value to be equal to 0, probability mass on the second atom, namely pθ2(St, At), would be the probability of the Q-value to be 4 and so on. Being the underlying distribution dt a categorical distribution, we have that the mass probabilities sum to 1. The Q-value of the of the corresponding distribution can simply be reconstructed by multiplying each probability mass times its atom and computing their sum:

Q(S_t, A_t)=\sum_{i=1}^{51}z_i p_\theta^i(S_t, A_t)

To learn the probability masses for a given state St and action At, the distribution of the returns under the optimal policy π should match a target distribution defined by taking the distribution for the next state St+1 and action At+1(St+1), contracting it towards zero according to the value of γ, shifting it by the distribution of the rewards, and projecting it onto the fixed support z. The figure below visually explains the above-mentioned steps.

(a) Next state distribution under policy π, (b)
Discounting shrinks the distribution towards 0, (c) The reward
shifts it, and (d) Projection step

A distributional variant of Q-learning is then derived by first constructing a new support for the target distribution, and then minimizing the Kullbeck-Leibler divergence between the distribution dt and the target distribution dt‘:

d_t'=(R_{t+1}+\gamma_{t+1}z, p_\theta'(S_{t+1}, argmax_a q_\theta'(S_{t+1},a))

Let’s remind that the Kullbeck-Leibler divergence is used to measure how close to each other are two distribution:

D_{KL}=(\Phi_z d_t'||d_t)

Where Φz is a L2-projection of the target distribution onto the fixed support z and qθ‘(St+1, a) = zT pθ‘(St+1, a) is the mean action value which represents state St+1. As in the normal non-distributional case, θ’ are the parameters of the target network.

The parametrized distribution can be represented by a neural network, as in DQN, but with Natoms ×Nactions outputs with a softmax applied independently for each action dimension of the output to ensure that the distribution for each action is appropriately normalized.


I’m coming…


To sum up, we enhanced the standard Q-learning algorithm with the following 7 techniques:

  • Deep Q-network.
  • Double Deep Q-network.
  • Prioritized experience replay.
  • Dueling network.
  • Multi-step learning.
  • Noisy Nets.
  • Distributional RL.

Now, we can finally turn our Mr.Pacman into a Super-Mr.Pacman by upgrading it with the aforementioned components. Instead of calling it as Noisy Net N-step Prioritized Double Dueling Distributional Deep Q-Network, Rainbow is the name attributed by researchers to an agent integrating all of these techniques. In the figure above we can see how Rainbow outperforms all the previous networks on a benchmark of 57 Atari games.

Median human-normalized performance across
57 Atari games

Do you know why I’m called Rainbow?

Because ghosts vanish as I appear.

Super Mr.Pacman in action

After this long read, let’s take a rest and watch our Super-Mr.Pacman in action! Can you do better than him?


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s