# Balancing a cart pole with policy gradients algorithm

In this post we are going to analyze a type of reinforcement learning algorithm called policy gradients. In the field of reinforcement learning, we have an agent making observations and taking actions within an environment in order to receive some rewards and its main objective is to learn a policy such that its actions will maximize its expected long-term rewards.

In this case, our agent is a cart on top of the which stands a pole and its goal is to keep it balanced as long as possible avoiding to make it fall. The environment is everything that can be observed, which in our case includes:

• the cart’s horizontal position where the point (0.0) indicates the center.
• the cart’s velocity where (0.0) means the agent is stopped.
• the angle of the pole where (0.0) means it stands perfectly vertical.
• the pole’s angular velocity where (0.0) indicates the pole is not moving.

In this simple environment are only possible two different discrete actions: move the cart toward left or right. Finally, the agent obtains a reward of 1 every time it takes an action until the pole doesn’t fall and its goal is to avoid this event to happen, or at least, to make it happen as late as possible.

### The cart pole environment

The pole cart environment is included in the OpenAI `gym` library. `gym` is a toolkit containing several virtual environments such as Atari games, board games, 2D and 3D physical simulators, and many more. Thanks to their shared interface, they allow different developers to test their own method, thus providing a baseline to compare different reinforcement learning algorithms.

Important: the gym library is only supported on Linux and Mac operating systems. Updated as 12/2019.

Let’s start by setting up a new environment, initializing it and making some observations.

```import gym
import matplotlib.pyplot as plt
from tensorflow import keras
import numpy as np
import tensorflow as tf

def plot_environment(env, figsize=(5, 4)):
plt.figure(figsize=figsize)
img = env.render(mode="rgb_array")
plt.imshow(img)
plt.axis("off")
return img

env = gym.make('CartPole-v1')
observation = env.reset()
print(observation)
plot_environment(env)
plt.show()
```

The `make` function creates a new Cart Pole virtual environment that we initialize by using the `reset` method. The `reset` method will also return the first observation of the environment. In our case, each observation is a (1 x 4) numpy array containing four floats, representing the observable variables of the environment described in the list showed in the previous section in the same order as its elements appear. We also wrap the method `render`, which returns a 2D numpy array representing an image of the current status of the environment, in a `plot_environment` function which plots it making use of the `matplotlib` library.

Continuing exploring how the `gym` library works, the next short snippet of code will show how to interact with this environment.

```print(env.action_space)
obs, reward, done, info = env.step(1)
print(obs, reward, done, info)
```

The `action_space` class member shows all the possible actions the agent can take, in this case there are only two integers 0 and 1, which represent accelerating left or right respectively. To execute an action we use the method `step` and we move the cart toward the right and see how the environemnt changes after the execution of the action:

• obs is the new state of the environment.
• In this environmnet reward will be always 1.0, no matter the result of the action.
• done is a boolean variable, when the episode is over it’s value will be True, otherwise it will return False. The episode ends when the pole is inclined more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.
• info is a dictionary providing extra debug information.

### Hard coded naive solution

According to our physics laws, to keep the pole balanced, we should accelerate the cart right if the pole is leaning toward the right and accelerate it left when the pole is leaning toward the left. Therefore, we can hard-code a naive policy acting according to these two simple rules. We will make it run for 500 episodes and for no longer than 200 steps for each episode.

```def basic_policy(obs):
angle = obs
return 0 if angle < 0 else 1

# test the algorithm 500 times for at most 200 steps
totals = []
for episode in range(500):
episode_rewards = 0
obs = env.reset()
for step in range(200):
action = basic_policy(obs)
obs, reward, done, info = env.step(action)
episode_rewards += reward
if done:
break
totals.append(episode_rewards)

print(np.mean(totals), np.std(totals), np.min(totals), np.max(totals))
```

After running this code we should get some results similar to the following: (42.125999999999998, 9.1237121830974033, 24.0, 68.0). We can note that our policy kept the pole balanced for at most 68 steps and kept the pole balanced for 42 steps on average. Definitely, it’s not a good result, we should come out with something else if we want to make it run longer.

### The credit assignment problem

Differently from supervised learning, in reinforcement learning, in order to understand how its environment works and improve its policy, the agent can only rely on the rewards its obtains through exploration. The problem is that the rewards are typically sparse and delayed. Let’s make an example, if the agent manages to balance the pole for 100 steps, how can it know which of the 100 actions it took were good ones, and which of them were bad ones? It only knows the pole fell after the last action, but there is no way the last action could be the only responsible of the fallen of the pole, but it is more likely that this negative event has been influenced by a subset of the previous actions. This is the so called credit assignment problem: how can we know which actions were the good ones and which were the bad ones?

The problem is tackled in the following way: every action is evaluated based on the sum of all the discounted rewards of the actions coming after it. That is, each successive reward is calculated applying a discount rate r at each step. The discount rate r can typically range from values to 0.95 to 0.99, lower discount rates will result in future actions having less influence than the current one and viceversa. When r is equal to 1 all the actions have the same importance, no matter their order of execution. Conversely, with a discount rate of 0.95, rewards 13 steps into the future count roughly for half as much as immediate rewards (0.9513 ≈ 0.5). In the Cart Pole environment we choose a discount rate of 0.95 since we want to give more importance to the short-term actions.

For example, a good action followed by several bad actions leading the pole to fall quickly, will result in the good action getting a low score. On the other hand, a bad action followed by many good actions which will help the pole to restore its balance, will result in the bad action still getting a high score. Then, after playing the game enough times, on average good actions will get a better score than bad ones. To make the action scores more reliable, we must evaluate them over many episodes and normalize all the action scores by subtracting their mean and dividing them by their standard deviation.

Policy gradients algorithms works by optimizing the parameters of a policy by following the gradients toward higher rewards. One common variant, where the policy is modeled by a neural network is the following:

• First, let the neural network policy play the game for several episodes and at each action compute the gradients that would make the chosen action even more likely, no matter its reward. Anyway, these gradients are, for the moment, just calculated but not applied.
• After having run several episodes, compute each action’s score using the credit assignent strategy.
• If an action’s score is positive, it means that the action was good and we want to apply the gradients computed earlier to make the action even more likely to be chosen in the future. However, if the score is negative, it means the action was bad and we want to apply the opposite gradients to make this action slightly less likely in the future. The solution in both cases is simply to multiply each gradient vector by the corresponding action’s score.
• Finally, compute the mean of all the resulting gradient vectors, and use it to perform a gradient descent step to make the model improve its policy by tuning its parameters.

Once more, deep learning is involved in the solution. Instead than hard-coding a policy, we can let a neural network learn it by exploiting a policy gradient algorithm. So, let’s start by defining a simple model with 4 input neurons (one for each observable variable), two hidden layers and one output layer with sigmoid fuction indicating the probability p of taking action 0, that is moving toward the left. Similarly, the probability of taking action 1 (moving right) is 1-p.

```# define a simple model
n_inputs = 4  # env.observation_space.shape
model = keras.models.Sequential([
keras.layers.Dense(10, activation="relu", input_shape=[n_inputs]),
keras.layers.Dense(5, activation="relu"),
keras.layers.Dense(1, activation="sigmoid"),
])
loss_fn = keras.losses.binary_crossentropy
```

It is important to remark that given this kind of environment, past actions and observations can be ignored without any loss of information, since each observation contains the environment’s full state and there is absence of noise. If there were some hidden states or noise, then it would have been better to consider past actions and observations in order to try to infer the hidden state of the environment.

Next, let’s create a function to play a single step using the model and pretending that whatever action it takes is the right one. It also computes the loss and returns its gradients.

```def play_one_step(env, obs, model, loss_fn):
# makes an action more likely
left_proba = model(obs[np.newaxis])
action = (tf.random.uniform([1, 1]) > left_proba)
y_target = tf.constant([[1.]]) - tf.cast(action, tf.float32)
loss = tf.reduce_mean(loss_fn(y_target, left_proba))
obs, reward, done, info = env.step(int(action[0, 0].numpy()))
```

It is important in this phase to pick a random action based on the probability given by the policy network, rather than just picking the action with the highest probability. This approach lets the agent find the right balance between exploring new actions and exploiting the actions that are known to work well. Here’s an analogy: suppose you order a pizza for the first time and you realize you like it. In this way you can increase the probability to order the same pizza next time, but if you increase that probability to 100%, you will never have the chance to taste other type of pizza, some of which may be even better than the one you tried.

Let’s create another function on top of the just created `play_one_step` function to play multiple episodes. It will return a 3D gradients matrix (episodes x steps x training variables) and a 2D rewards matrix (episodes x steps).

```def play_multiple_episodes(env, n_episodes, n_max_steps, model, loss_fn):
all_rewards = []
for episode in range(n_episodes):
current_rewards = []
obs = env.reset()
for step in range(n_max_steps):
obs, reward, done, grads = play_one_step(env, obs, model, loss_fn)
current_rewards.append(reward)
if done:
break
all_rewards.append(current_rewards)
```

To make the policy gradients algorithm work we need to computed the discounted rewards and normalize them. Therefore, we are going to create a couple of functions for these two tasks.

The first takes in input the rewards for a single episode and returns the discounted rewards.

```def discount_rewards(rewards, discount_rate):
discounted = np.array(rewards)
for step in range(len(rewards) - 2, -1, -1):
discounted[step] += discounted[step + 1] * discount_rate
return discounted
```

Let’s suppose there are 3 actions, and after each action there was a reward: first 10, then 0 and finally -50. With a discount rate of 80%, the third action will get -50, the second action will only get -40 (0+0.8x(-50)), and the first action will get -22 (10+0.8*(-40)).

The second function will simply normalize the discounted rewards across many episodes.

```def discount_and_normalize_rewards(all_rewards, discount_rate):
all_discounted_rewards = [discount_rewards(rewards, discount_rate)
for rewards in all_rewards]
flat_rewards = np.concatenate(all_discounted_rewards)
reward_mean = flat_rewards.mean()
reward_std = flat_rewards.std()
return [(discounted_rewards - reward_mean) / reward_std
for discounted_rewards in all_discounted_rewards]
```

Now we are ready to train the network using the analyzed policy gradients algorithm.

```n_iterations = 100
n_episodes_per_update = 10
n_max_steps = 500  # note that 500 is the max possible score
discount_rate = 0.95

def show_results(all_rewards):
all_rewards = np.array(list(map(sum, all_rewards)))
print(np.mean(all_rewards), np.std(all_rewards), np.min(all_rewards), np.max(all_rewards))

for iteration in range(n_iterations):
env, n_episodes_per_update, n_max_steps, model, loss_fn)
print("Iteration: {}: ".format(iteration), end="")
show_results(all_rewards)
all_final_rewards = discount_and_normalize_rewards(all_rewards, discount_rate)
# multiply each gradient vector corresponding to each trainable variable, episode and step
# by its corresponding action score and compute its mean
for var_index in range(len(model.trainable_variables)):
for episode_index, final_rewards in enumerate(all_final_rewards)
for step, final_reward in enumerate(final_rewards)], axis=0)
# save the model each 20 iterations
if iteration % 20 == 0:
print("model_saved")
model.save("cart_pole" + str(iteration) + ".h5")

env.close()
```

The above code will run the cart for 1000 epochs (n_iterations x n_episodes_per_update) updating the model’s weights each 10 episodes. After each iterations it displays some statistics about the past 10 epochs such as the average and maximum steps. It’s important to note that the duration of the game is limited to 500 actions after the which the game will automatically end (`done`=True). Once we have finished using the environment, let’s invoke the function `close` to close it. After 100 iterations we should obtain a perfect policy (500.0 0.0 500.0 500.0).

### Results

The first video shows an episode played by the naive hard-coded policy, the pole will fall on average after 42 actions.

The next video shows an episode played by the policy gradients algorithm, the pole will stand balanced until the end of the game (500 steps).

Finally, we show some results obtained during the training process. The statistics are calculated on the last 10 episodes after the model have been trained for different numbers of iterations.

We can easily observe that after only 20 iterations, which corresponds to 200 episodes, the model already far outperforms the hard-coded policy and reaches an optimal policy after 100 iterations (1000 epochs). The full code can be found on my GitHub repository linked in the next section, it also contains the script used to show the animation relative to a single pole chart episode and to save it as mp4 file.