# Learning to imitate: using GAIL to imitate PPO

Usually, in reinforcement learning, the agent is provided with a reward according to the action it executes to interact with the environment and its goal is to optimize its total cumulative reward over multiple steps. Actions are selected according to some observations the agent has to learn to interpret. In this post, we are going to explore a new field called imitation learning: the agent is only provided with some trajectories taken from an expert agent and its goal is to learn to imitate it. The agent is not provided any reinforcement signal, and is not allowed to query the expert for more data while training.

More specifically, imitation learning refers to the problem of learning to perform a task from expert demonstrations. Given this task, there are two common solution widely known in literature:

• Behavioral cloning: learn a policy as a supervised learning problem over state-action pairs from expert trajectories. Cons: requires a large amounts of data, and is prone to errors caused by covariate shift. We have covariate shift when the states encountered during training differ from the states encountered during testing, thus reducing model robustness.
• Inverse reinforcement learning: find a cost function under which the expert is uniquely optimal. The cost function should prioritize entire trajectories over others. Cons: many IRL algorithms are extremely expensive to run, requiring reinforcement learning in an inner loop.

Note: the code relative to the project in this post is also available on Github.

### GAIL

In this work, we are going to explore a new algorithm called GAIL (Generative Adversarial Imitation Learning) that, as its name suggests, is a combination of inverse reinforcement learning and generative adversarial learning.

Under our adversarial settings, we have a generative model G competing against a discriminative classifier D. The goal of D is to distinguish between the trajectories generated by G and trajectories generated by the expert E. When D cannot distinguish the trajectories generated by G from the expert trajectories, then G has successfully learnt to imitate the expert. Hence the formula that should be maximized by D is:

$E_{\pi}[log(D(s,a))]+E_{\pi_E}[log(1-D(s,a))]$ (1)

The discriminator D takes as input a trajectory formed by a state s and its corresponding action a and log(D(s,a)) returns a continuous value between 0 and 1. The more this value is close to 1, the more the input trajectory resembles an expert’s trajectory. In this way D can be used as reward signal to train G to generate trajectories similar to the expert’s ones. If for example G is a policy gradient algorithm parameterized by θ, then a gradient update can be computed as:

$\nabla_\theta log \pi_\theta(s,a) Q(s,a)$ (2)

$Q(s,a)=log(D(s,a))$

As in the standard GAN, each training iteration is divided into two parts to be repeated for a certain number of steps or epochs:

1. Train the discriminator D using expert and generated trajectories according to the formula in (1);
2. Update the policy G with a gradient step computed with (2);

Expert trajectories can be either generated from a human expert, an algorithm or a policy that has already mastered the target task.

### Cloning PPO with GAIL

Next, we are going to show an example of using GAIL to learn to imitate a PPO policy to play the simple environment of Cartpole. In this scenario we consider the PPO policy to be the expert and GAIL has to learn to imitate it only by observing expert’s trajectories.

First step. We trained a PPO policy to play and master the environment of Cartpole. Training has been conducted for a total of 200 epochs at the end of the which the model could achieve a 199 average score over 100 episodes.

Second step. We collected trajectories of 100 episodes from PPO playing Cartpole. Since we want to ensure our imitating policy to learn only on trajectories of good quality, we kept only those trajectories whose total final episode reward is >= 195. In total we collected 19900 trajectories, each represented as a tuple (s, a).

Third step. We trained GAIL using the expert trajectories. We represented G as a PPO policy, and D as a simple MLP with softmax output function. The input of D is an array (size 6) formed by the concatenation of the state s (size 4) and a one-hot representation of the action a (size 2). The input of G is an observation from the environment and its output is a probability distribution over the 2 possible actions.

We can see from the chart above that despite GAIL takes much more episodes to train a policy to master the game of Cartpole, however, it still accomplished its task. The average score over 100 episodes of the agent trained with GAIL is 200, which is the highest possible score that can be reached in this environment.

Four step. In the last experiment, we showed the sample efficiency of GAIL in term of expert data. In the previous part, GAIL succeeded to train a policy to imitate the behavior of another policy but we collected about 20000 expert trajectories to succeed in this task. In this experiment, we are going to train the same policy, but this time using only 200 expert trajectories which correspond to the length of just one episode.

Hence, we showed that using only 1% of the trajectories we used to train the previous policy, the new policy achieves a similar performance (their average score over 100 epochs is the same) and a similar training curve, thus showing that GAIL is also sample efficient.