Nowadays, image classification deep learning models are always more present in our systems in order to create smarter applications or simply to replace human operators to automatically perform some repetitive tasks. Their increased utilization is due to their high accuracy such that recent models are now able to outperform humans’ brain in many object classification tasks. However, despite their good generalization, deep neural networks are still vulnerable to adversarial examples. Thus, a lot of efforts have been made in recent years to understand the vulnerabilities present in deep learning models and how to exploit them in order to perform adversarial attacks. At the same time, new defence methods have been developed to contrast those attacks and make deep learning systems more reliable in the real world. Hence, in this post we are going to review the progress made on either attack and defense on images so to better understand the researches made in this field as well as its possible future directions.
Since the advent of deep convolutional networks used for image classiﬁcation, computer vision ﬁeld has been subjected to many important breakthrough, such as Inception, DenseNet and ResNet, with the aim to further improve the classification accuracy obtained on many image classiﬁcation tasks. However, despite their high accuracy, those models lack on robustness since they can be easily deceived by adversarial examples. More formally, an adversarial example is a sample input data that has been modiﬁed very slightly in a way that is intended to cause a machine learning classiﬁer to misclassify it. Hence, adversarial attack consists of generating these adversarial samples and feed them to the target model so cause it a wrong prediction. These attacks can thus constitute a serious threat that may compromise the application of deep learning models in real systems. For instance an adversarial example on a road sign might cause a faulty maneuver to an autonomous car, leading to a possible collision or however causing a dangerous situation. Similarly, adversarial examples on human faces might maliciously fool face recognition systems allowing impersonification or identity dodging, thus making those systems unreliable.
Conversely, adversarial defence aims to protect deep learning models from adversarial attacks, namely, making models more robust against adversarial examples. In this context, research on adversarial robustness resembles a minimax game where attackers constantly try to exploit more powerful techniques to fool deep learning models while, at the same time, defenders have to invent new defence methods to resist against these malicious attacks. Thus, the scope of this post is to review how attack and defence methods evolved through the time hoping that having a clear overview about their evolution might provide researchers future ideas of how to keep adversarial learning balanced between attack and defence.
This post is part of a collection including the following 6 posts:
- Threat models (current post);
- White-box attacks;
- Black-box transfer-based attacks;
- Black-box score-based attacks;
- Black-box decision-based attacks;
- Defence methods.
In the next sections, we are going to explain the properties of threat models, that is, a set of assumptions about the adversary’s goals, knowledge, and capabilities. Those conditions are specified by threat models so to determine under which conditions a defense is designed to be secure.
An adversarial example, to be effective, has to be misclassified by deep learning models but not by human brain, thus, only small changes can be made to the original input image x (legitimate example) to craft the adversarial example xadv=x+δ, where δ is also known as adversarial noise. The distance introduced by the noise between x and xadv is usually defined by the lp norm of the difference between the original and the adversarial sample for some p=0,..,∞. Thus an adversarial example xadv has to satisfy the constraint ||xadv–x||p<=Ɛ (adversarial constraint) , where smaller Ɛ values correspond to smaller input perturbations, thus leading to less perceptible changes under the condition that xadv is misclassified. For instance, the l0 distance corresponds to the number of pixels that have been altered in an image, l2 distance measures the standard Euclidean (root- mean-square) distance between x and xadv, and l∞ distance measures the maximum change to any of the coordinates.
In practice, adversarial attacks can be designed to achieve two different goals:
- Untargeted: Untargeted attacks aim at making the model predict any class different from the correct one without targeting any desired class. Formally speaking, given a classifier f and the true label y, an adversarial example would cause f(xadv)!=y. Hence, untargeted attacks’ goal consists of maximizing the loss of the attacked model respect to the true label y, namely, max L(xadv, y), under the assumption ||xadv–x||p<=Ɛ. Given the relaxed constraints the adversarial examples are subjected, these kinds of attack are usually easier to perform.
- Targeted: Conversely, targeted attacks try to mislead the model’s prediction toward a specific class y‘, that is, f(xadv)=y‘, where y‘!= y. The loss function can then be formulated as min L(xadv, y‘) always under adversarial constraints. In this way, given the input xadv, it will be more likely that the attacked model would predict the target class y‘ rather than any other class.
Adversarial attacks are divided into two main categories depending on how much information is available regarding the model under attack:
- White box attacks: Under this setting, the adversary has full access and knowledge of the model, that is, the architecture of the model, it’s parameters, gradients and loss respect to the input as well as possible defense mechanisms are known to the attacker. It is thus not particularly difficult to attack models under this condition and common methods exploits model’s output gradient to generate adversarial examples.
- Black box attacks: In this category of attacks, the adversary has zero or very little knowledge about the model. Thus existing methods often relies on training a similar model or an ensemble of them. These methods work because, generally, adversarial examples that fool one model are likely to fool another similar model. Furthermore, in practice this is the most likely kind of attack, since, in normal circumstances, attackers don’t have the possibility to access to much of the models’ knowledge. Other methods exploit knowledge about the accuracy of the prediction or only the label to craft adversarial examples.