Reinforcement learning agents are usually trained to maximize their rewards by taking actions in an environment following a Markov Decision Process (MDP).
A Markov Decision Process is simply a model that defines the state of an environment by its current state, actions, and rewards, including also its possible future states.
The key point is that agents know information from the present and can approximately predict how good future states will be, but have no perception of what happened in the past: they lack knowledge about previous states.
Introducing Memory-Length environment
To test common RL algorithms on environments different from traditional environments where information from past states is not required to learn a decent behavior, we are going to introduce a particular environment called Memory-Length. This environment, developed by researchers from DeepMind, is designed to test the number of sequential steps an agent can remember a single initial value. In short, given the first observation (also known as context) that can be -1 or 1, which also corresponds to the action space, the agent gets a positive reward if the action taken at the last step N is the same as the value of the first observation, otherwise, it will receive a negative reward. The actions taken during the previous N-1 steps don’t have any effect. Different versions of Memory-Length come with increasing values of N from 1 to 100 making each version increasingly difficult. By construction, an agent that performs well on this task has mastered some use of memory over multiple timesteps.
Evaluating common RL algorithms
Originally, the author of the paper evaluated 3 common RL algorithms in this particular memory environment, namely:
- DQN: standard DQN model without further upgrades.
- Bootstrapped-DQN: DQN with temporally-extended exploration abilities, often resulting in faster learning and more efficient exploration.
- A2C+RNN: Actor-Critic algorithm equipped with RNN as back-bone, thus incorporating a basic memory system able to retain some information about previous states.
As we can see in the next figure, the benchmark over different lengths N, from 1 to 100, provides empirical evidence about the scaling properties of the algorithms beyond a simple pass/fail.
As expected, actor-critic with a recurrent neural network greatly outperforms the DQN and Bootstrapped-DQN. It is not a surprise that due to the lack of any memory system, both DQN and Bootstrapped DQN are unable to learn anything for N > 1. In fact, both agents have no way to remember any of the previous steps. A2C+RNN performs well for all N ≤ 30 and essentially random for all N > 30, with quite a sharp cutoff. Given its design, it makes sense that the recurrent agent would outperform the feedforward architectures on a memory task, but it still hasn’t mastered the concept of memory over long sequences.
Evaluating memory-based RL algorithms
It is thus clear that algorithms that only follow an MDP scheme to learn are not suitable for environments where states from the past are the most significant. To solve the Memory-Length environment, we can do nothing but call into the game two RL algorithms where a long memory mechanism is their key strength:
- R2D2: RNN-based RL agents with distributed prioritized experience replay and burn-in replay sequence.
- GTrXL: Transformer-based architecture for reinforcement learning. It introduces architectural modifications that improve the stability and learning speed of the original Transformer-XL variant. A good introduction to this algorithm can be found in this post.
The next benchmark compares the performance of R2D2 against GTrXL in Memory-Length environment for N=12, 30, 50, and 100. GTrXL converges for all values of N while R2D2 fails to converge only for N=100. Even for N=50, R2D2 already gives signs to be close to the limits of its memory’s capabilities since it takes a remarkably high number of training steps to converge. Given that even a random agent could randomly guess the value of the first observation with a 50% chance, we define convergence when an agent consecutively solves the environment with a positive reward 20 times (220 chance for random policy). Analyzing the converging speed, we also note that as N increases, GTrXL shows a more linear increase in training steps, while the training steps taken by R2D2 to converge seem to increase exponentially.
In Memory-Length environment, the first observation is the only observation that has a value meaningful in order to get a positive reward at the end of the episode. Thus, having a memory that affects the whole trajectory of an episode is a requirement to solve this environment. While RNN usually puts more emphasis on the most recent states, the attention mechanism can automatically learn which parts of a sequence should be given more importance independently of their relative position. In addition, the attention of the Transformer can better exploit temporal dependencies among states and actions in their trajectories, and learn better representations to predict the next action.