# Adversarial policies: attacking TicTacToe multi-agent environment

In a previous post we discussed about the possibility for an attacker to fool image classification models by injecting adversarial noise directly to the input images. Similarly, in this post we are going to see how is it possible to attack deep reinforcements learning agents on multi-agent environments (where two or more agents interact within the same environment) such that one or more agents are trained in an adversarial way so to create observations that result adversarial for the victim agent, which is instead trained in a normal way.

In this toy example, we are going to focus on an expanded version of the two-players game of TicTacToe. The rules of the game are pretty simple: two players, X and O, take turns marking the spaces in a n x n grid. The player who succeeds in placing m of their marks in a horizontal, vertical, or diagonal row is the winner. In our case, we set n=6 and m=4 since the classic version, where n=3 and m=3, would be too easy.

Borrowing the same ideas from image adversarial attacks techniques, an adversary could easily produce adversarial examples by perturbing some pixels on the input observations the victim agent receives. However, in real scenarios, an attacker is not usually able to directly modify another agent’s observations. Nevertheless, under certain circumstances, an attacker can still control other agents interacting in the same environment of the victim agent so to create natural observation that result adversarial. To make it more clear, let’s suppose that an agent is driving an autonomous car and its policy is based on observations of the surrounding environment. The way the autonomous system is designed, make it impossible for an attacker to add adversarial noise directly to the frames the camera sends to the system. However, an agent could control another car or the behavior of a pedestrian so to create an adversarial observation and deceive the victim. To achieve this goal, an adversarial agent is usually trained with an adversarial policy with the aim to fool the target agent in a targeted or untargeted way.

Adversarial Policy attack consists of solving a two-players Markov game M, where the opponent α, controlled by the adversary, has to compete against the victim agent v in order to reduce its earned reward. Given that the victim policy πv is held fixed, the game reduces to a single-player MDP M=(S,Aα,Tα,Rα) the attacker has to solve. Thus, the goal of the attacker is to find an effective adversarial policy πα by maximizing the sum of its discounted rewards:

$\sum_{t=0}^{\infty}\gamma^t R_\alpha(s_t,a_{\alpha_t},a_{v_t},s_{t+1}),$

where st+1 ∼ Tα(st,aαt,avt, and aαt ∼ πα(·|st)). The rewards are sparse and are given at the end of the episode. More specifically, the adversary gets a positive reward when it wins the game and negative when it loses or ties. Finally, the adversarial policy can be maximized with any DRL algorithm (i.e DQN or A2C). Moreover, the authors of this method also realized that the greater the dimensionality of the component of the observation space under control of the adversary, the more vulnerable the victim is to adversarial attacks (that’s why we are going to attack a more complext version of the original TicTacToe).

### Evaluation and results

On a 6×6 game board, we are going to represent the cells marked by the X player as 1, -1 for the cells occupied by player O, and 0 to mark empty cells. Both the victim and adversarial agent consist of a simple double DQN with prioritized experience replay and implemented as a 3-layers dense network with an input and output size of 36 units, and a batch size of 64. The implementation relies on the Deep-RL framework Tianshou. During training, Ɛ is set fixed to 0.1 and training finishes after 10 epochs of 10000 steps each or when the win-rate is equal or greater than 0.9. On episode termination, the agent receives a reward of +1 in case of victory, -1 if the opponent wins, or 0 if there is a draw. Moreover, each evaluation is made over 100 episodes during the which the Ɛ is fixed to 0 and the final statistics are averaged over the episodes.

Note: the policy under training always plays as the player O, that is, it always starts as second.

First of all, we train the victim model to learn to play TicTacToe from scratch against a random policy.

• Victim policy against random policy.
python multi_agent_drl_attacks/tic_tac_toe.py \
--win-rate=0.9 \
--logdir="log_1" \
--render=0. \
--ep_watch=100
• Final reward: 1.0;
• Length: 8.86.

After training the agent should achieve a 100% win-rate against the random policy. Now, let’s train the adversarial policy to defeat our victim agent. Since we only want to train the adversarial policy, we will freeze the victim policy so to prevent it from learning.

• Adversarial policy against victim policy.
python multi_agent_drl_attacks/tic_tac_toe.py \
--win-rate=0.9 \
--opponent_path=log_1/tic_tac_toe/dqn/policy.pth \
--freeze_opponent \
--render=0. \
--ep_watch=100
• Final reward: 1.0;
• Length: 8.

As expected, the adversarial policy reaches a 100% win-rate against the victim policy. However, to further test its performance, we are going to watch it playing against the random policy.

• Adversarial policy against random policy.
python multi_agent_drl_attacks/tic_tac_toe.py \
--watch \
--ep_watch=100 \
--render=0 \
• Final reward: 0.64;
• Length: 18.47.

Despite always winning against the victim policy, the adversarial policy doesn’t perform very well against the random policy. In fact, it has been trained with the only objective to defeat the victim policy, thus learning how to dispose its stones so to create adversarial observation leading the victim policy to take wrong actions. However, it doesn’t know very well how to deal with the unexpected behavior of the random policy.

Nevertheless, it accomplished its task to defeat the victim policy. Now the question is: how can we defend the victim policy from the behavior of the malicious policy? The question is very simple, we could just train the victim to defeat the adversarial policy (basically we are training a new policy, based on the victim policy, that is adversarial for the adversarial policy!).

Let’s first make a new copy of the victim policy so to avoid overwriting it.

mkdir -p log_3/tic_tac_toe/dqn/
cp log_1/tic_tac_toe/dqn/policy.pth log_3/tic_tac_toe/dqn/policy.pth
• Retrained-Victim policy against adversarial policy.
python multi_agent_drl_attacks/tic_tac_toe.py \
--win-rate=0.9 \
--logdir="log_3" \
--resume_path=log_3/tic_tac_toe/dqn/policy.pth \
--freeze_opponent \
--ep_watch=100 \
--render=0 \
• Final reward: 1;
• Length: 8.

Once again, the trained policy always wins against the policy it has been trained against. Thus, we successfully defended the victim policy. It seems that everything went smooth, but before claiming victory, let’s test the new policy against the random policy.

• Retrained-Victim policy against random policy.
python multi_agent_drl_attacks/tic_tac_toe.py \
--watch \
--resume_path=log_3/tic_tac_toe/dqn/policy.pth \
--ep_watch=100 \
--render=0
• Final reward: 0.76;
• Length: 14.5.

After retraining, the agent performance against the random policy dropped from 100% to a moderate 76% win-rate. Thus, we can note how defending it against the adversarial policy costed us a significant drop in performance. Moreover, the adversarial policy could be trained again to defeat the new policy so to remove the benefits brought by the retraining. However, repeating this procedure for a sufficient number of times might provide protection against a wide range of adversaries and possibly raising the chance of prevailing over the random policy.