Balancing CartPole with policy gradients algorithm

In this post, we are going to analyze a type of reinforcement learning algorithm called policy gradients and we will use it to train an agent to balance a pole on a moving cart. This type of environment is also known as CartPole.

In the field of reinforcement learning, we have an agent making observations and taking actions within an environment in order to receive some rewards and its main objective is to learn a policy such that its actions will maximize its expected long-term rewards.

In this case, our agent is a cart on top of the which stands a pole and its goal is to keep it balanced as long as possible to avoid making it fall. The environment is everything that can be observed, which in our case includes:

  • the cart’s horizontal position where the point (0.0) indicates the center.
  • the cart’s velocity where (0.0) means the agent is stopped.
  • the angle of the pole where (0.0) means it stands perfectly vertical.
  • the pole’s angular velocity where (0.0) indicates the pole is not moving.
Cart Pole environment
Cart Pole environment

In this simple environment are only possible two different discrete actions: move the cart toward left or right. Finally, the agent obtains a reward of 1 every time it takes an action until the pole doesn’t fall and its goal is to avoid this event to happen, or at least, to make it happen as late as possible.

The cart pole environment

The pole cart environment is included in the OpenAI gym library. gym is a toolkit containing several virtual environments such as Atari games, board games, 2D and 3D physical simulators, and many more.

Thanks to their shared interface, they allow different developers to test their own methods, thus providing a baseline to compare different reinforcement learning algorithms.

Important: the gym library is only supported on Linux and Mac operating systems. Updated as 12/2019.

Let’s start by setting up a new environment, initializing it, and making some observations.

import gym
import matplotlib.pyplot as plt
from tensorflow import keras
import numpy as np
import tensorflow as tf

def plot_environment(env, figsize=(5, 4)):
    img = env.render(mode="rgb_array")
    return img

env = gym.make('CartPole-v1')
observation = env.reset()

The make function creates a new Cart Pole virtual environment that we initialize by using the reset method. The reset method will also return the first observation of the environment. In our case, each observation is a (1 x 4) numpy array containing four floats, representing the observable variables of the environment described in the list shown in the previous section in the same order as its elements appear. We also wrap the method render, which returns a 2D numpy array representing an image of the current status of the environment, in a plot_environment function that plots it making use of the matplotlib library.

Continuing exploring how the gym library works, the next short snippet of code will show how to interact with this environment.

obs, reward, done, info = env.step(1)
print(obs, reward, done, info)

The action_space class member shows all the possible actions the agent can take, in this case, there are only two integers 0 and 1, which represent accelerating left or right respectively. To execute an action we use the method step and we move the cart toward the right and see how the environment changes after the execution of the action:

  • obs is the new state of the environment.
  • In this environment, the reward will be always 1.0, no matter the result of the action.
  • done is a boolean variable when the episode is over its value will be True, otherwise, it will return False. The episode ends when the pole is inclined more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.
  • info is a dictionary providing extra debug information.

Hard-coded naive solution

According to our physics laws, to keep the pole balanced, we should accelerate the cart right if the pole is leaning toward the right and accelerate it left when the pole is leaning toward the left. Therefore, we can hard-code a naive policy acting according to these two simple rules. We will make it run for 500 episodes and for no longer than 200 steps for each episode.

def basic_policy(obs):
    angle = obs[2]
    return 0 if angle < 0 else 1

# test the algorithm 500 times for at most 200 steps
totals = []
for episode in range(500):
    episode_rewards = 0
    obs = env.reset()
    for step in range(200):
        action = basic_policy(obs)
        obs, reward, done, info = env.step(action)
        episode_rewards += reward
        if done:

print(np.mean(totals), np.std(totals), np.min(totals), np.max(totals))

After running this code we should get some results similar to the following: (42.125999999999998, 9.1237121830974033, 24.0, 68.0). We can note that our policy kept the pole balanced for at most 68 steps and kept the pole balanced for 42 steps on average.

Definitely, it’s not a good result, we should come out with something else if we want to make it run longer.

The credit assignment problem

Differently from supervised learning, in reinforcement learning, in order to understand how its environment works and improve its policy, the agent can only rely on the rewards it obtains through exploration. The problem is that the rewards are typically sparse and delayed.

Let’s make an example, if the agent manages to balance the pole for 100 steps, how can it know which of the 100 actions it took were good ones, and which of them were bad ones?

It only knows the pole fell after the last action, but there is no way the last action could be the only one responsible for the fallen of the pole, but it is more likely that this negative event has been influenced by a subset of the previous actions. This is the so-called credit assignment problem: how can we know which actions were the good ones and which were the bad ones?

The problem is tackled in the following way: every action is evaluated based on the sum of all the discounted rewards of the actions coming after it. That is, each successive reward is calculated by applying a discount rate r at each step. The discount rate r can typically range from values 0.95 to 0.99, lower discount rates will result in future actions having less influence than the current one and vice-versa.

When r is equal to 1 all the actions have the same importance, no matter their order of execution. Conversely, with a discount rate of 0.95, rewards 13 steps into the future count roughly for half as much as immediate rewards (0.9513 ≈ 0.5). In the Cart Pole environment, we choose a discount rate of 0.95 since we want to give more importance to short-term actions.

Credit assignment problem
Credit assignment problem

For example, a good action followed by several bad actions leading the pole to fall quickly will result in the good action getting a low score. On the other hand, a bad action followed by many good actions which will help the pole to restore its balance will result in the bad action still getting a high score.

Then, after playing the game enough times, on average good actions will get a better score than bad ones. To make the action scores more reliable, we must evaluate them over many episodes and normalize all the action scores by subtracting their mean and dividing them by their standard deviation.

Policy gradients algorithms

Policy gradient algorithms work by optimizing the parameters of a policy by following the gradients toward higher rewards. One common variant, where the policy is modeled by a neural network is the following:

  • First, let the neural network policy play the game for several episodes, and at each action compute the gradients that would make the chosen action even more likely, no matter its reward. Anyway, these gradients are, for the moment, just calculated but not applied.
  • After having run several episodes, compute each action’s score using the credit assignment strategy.
  • If an action’s score is positive, it means that the action was good and we want to apply the gradients computed earlier to make the action even more likely to be chosen in the future. However, if the score is negative, it means the action was bad and we want to apply the opposite gradients to make this action slightly less likely in the future. The solution in both cases is simply to multiply each gradient vector by the corresponding action’s score.
  • Finally, compute the mean of all the resulting gradient vectors, and use it to perform a gradient descent step to make the model improve its policy by tuning its parameters.

Once more, deep learning is involved in the solution. Instead of hard-coding a policy, we can let a neural network learn it by exploiting a policy gradient algorithm.

So, let’s start by defining a simple model with 4 input neurons (one for each observable variable), two hidden layers, and one output layer with sigmoid function indicating the probability p of taking action 0, that is moving toward the left. Similarly, the probability of taking action 1 (moving right) is 1-p.

# define a simple model
n_inputs = 4  # env.observation_space.shape[0]
model = keras.models.Sequential([
    keras.layers.Dense(10, activation="relu", input_shape=[n_inputs]),
    keras.layers.Dense(5, activation="relu"),
    keras.layers.Dense(1, activation="sigmoid"),
optimizer = keras.optimizers.Adam(lr=0.01)
loss_fn = keras.losses.binary_crossentropy

It is important to remark that given this kind of environment, past actions and observations can be ignored without any loss of information since each observation contains the environment’s full state and there is the absence of noise. If there were some hidden states or noise, then it would have been better to consider past actions and observations in order to try to infer the hidden state of the environment.

Next, let’s create a function to play a single step using the model and pretend that whatever action it takes is the right one. It also computes the loss and returns its gradients.

def play_one_step(env, obs, model, loss_fn):
    with tf.GradientTape() as tape:
        # makes an action more likely
        left_proba = model(obs[np.newaxis])
        action = (tf.random.uniform([1, 1]) > left_proba)
        y_target = tf.constant([[1.]]) - tf.cast(action, tf.float32)
        loss = tf.reduce_mean(loss_fn(y_target, left_proba))
    grads = tape.gradient(loss, model.trainable_variables)
    obs, reward, done, info = env.step(int(action[0, 0].numpy()))
    return obs, reward, done, grads

It is important in this phase to pick a random action based on the probability given by the policy network, rather than just picking the action with the highest probability. This approach lets the agent find the right balance between exploring new actions and exploiting the actions that are known to work well.

Here’s an analogy: suppose you order a pizza for the first time and you realize you like it. In this way, you can increase the probability to order the same pizza next time, but if you increase that probability to 100%, you will never have the chance to taste another type of pizza, some of which may be even better than the one you tried.

Let’s create another function on top of the just-created play_one_step function to play multiple episodes. It will return a 3D gradients matrix (episodes x steps x training variables) and a 2D rewards matrix (episodes x steps).

def play_multiple_episodes(env, n_episodes, n_max_steps, model, loss_fn):
    all_rewards = []
    all_grads = []
    for episode in range(n_episodes):
        current_rewards = []
        current_grads = []
        obs = env.reset()
        for step in range(n_max_steps):
            obs, reward, done, grads = play_one_step(env, obs, model, loss_fn)
            if done:
    return all_rewards, all_grads

To make the policy gradients algorithm work we need to compute the discounted rewards and normalize them. Therefore, we are going to create a couple of functions for these two tasks.

The first takes in inputs the rewards for a single episode and returns the discounted rewards.

def discount_rewards(rewards, discount_rate):
    discounted = np.array(rewards)
    for step in range(len(rewards) - 2, -1, -1):
        discounted[step] += discounted[step + 1] * discount_rate
    return discounted

Let’s suppose there are 3 actions, and after each action, there was a reward: first 10, then 0, and finally -50. With a discount rate of 80%, the third action will get -50, the second action will only get -40 (0+0.8x(-50)), and the first action will get -22 (10+0.8*(-40)).

The second function will simply normalize the discounted rewards across many episodes.

def discount_and_normalize_rewards(all_rewards, discount_rate):
    all_discounted_rewards = [discount_rewards(rewards, discount_rate)
                              for rewards in all_rewards]
    flat_rewards = np.concatenate(all_discounted_rewards)
    reward_mean = flat_rewards.mean()
    reward_std = flat_rewards.std()
    return [(discounted_rewards - reward_mean) / reward_std
            for discounted_rewards in all_discounted_rewards]

Now we are ready to train the network using the analyzed policy gradients algorithm.

n_iterations = 100
n_episodes_per_update = 10
n_max_steps = 500  # note that 500 is the max possible score
discount_rate = 0.95

def show_results(all_rewards):
    all_rewards = np.array(list(map(sum, all_rewards)))
    print(np.mean(all_rewards), np.std(all_rewards), np.min(all_rewards), np.max(all_rewards))

for iteration in range(n_iterations):
    all_rewards, all_grads = play_multiple_episodes(
        env, n_episodes_per_update, n_max_steps, model, loss_fn)                 
    print("Iteration: {}: ".format(iteration), end="")
    all_final_rewards = discount_and_normalize_rewards(all_rewards, discount_rate)
    all_mean_grads = []
    # multiply each gradient vector corresponding to each trainable variable, episode and step
    # by its corresponding action score and compute its mean
    for var_index in range(len(model.trainable_variables)):
        mean_grads = tf.reduce_mean(
            [final_reward * all_grads[episode_index][step][var_index]
             for episode_index, final_rewards in enumerate(all_final_rewards)
                 for step, final_reward in enumerate(final_rewards)], axis=0)
    optimizer.apply_gradients(zip(all_mean_grads, model.trainable_variables))
    # save the model each 20 iterations
    if iteration % 20 == 0:
        print("model_saved")"cart_pole" + str(iteration) + ".h5")


The above code will run the cart for 1000 epochs (n_iterations x n_episodes_per_update) updating the model’s weights every 10 episodes. After each iteration, it displays some statistics about the past 10 epochs such as the average and maximum steps.

It’s important to note that the duration of the game is limited to 500 actions after which the game will automatically end (done=True). Once we have finished using the environment, let’s invoke the function close to close it. After 100 iterations we should obtain a perfect policy (500.0 0.0 500.0 500.0).


The first video shows an episode played by the naive hard-coded policy, the pole will fall on average after 42 actions.

cart pole not trained policy

The next video shows an episode played by the policy gradients algorithm, the pole will stand balanced until the end of the game (500 steps).

cart pole trained policy

Finally, we show some results obtained during the training process. The statistics are calculated on the last 10 episodes after the model has been trained for different numbers of iterations.

2070.124.76 35104
40124.440.97 58198
60236.689.69 140387

We can easily observe that after only 20 iterations, which corresponds to 200 episodes, the model already far outperforms the hard-coded policy and reaches an optimal policy after 100 iterations (1000 epochs).

The full code can be found on my GitHub repository linked in the next section, it also contains the script used to show the animation relative to a single pole chart episode and to save it as an mp4 file.

Find more at


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s