Nowadays, image classification deep learning models are always more present in our systems in order to create smarter applications or simply to replace human operators to automatically perform some repetitive tasks.
Their increased utilization is due to their high accuracy such that recent models are now able to outperform human brains in many object classification tasks. However, despite their good generalization, deep neural networks are still vulnerable to adversarial examples.
Thus, a lot of efforts have been made in recent years to understand the vulnerabilities present in deep learning models and how to exploit them in order to perform adversarial attacks. At the same time, new defence methods have been developed to contrast those attacks and make deep learning systems more reliable in the real world.
Hence, in this post, we are going to review the progress made on either attack or defense on images so as to better understand the research made in this field as well as its possible future directions.
Since the advent of deep convolutional networks used for image classiﬁcation, computer vision ﬁeld has been subjected to many important breakthroughs, such as Inception, DenseNet, and ResNet, with the aim to further improve the classification accuracy obtained on many image classiﬁcation tasks.
However, despite their high accuracy, those models lack robustness since they can be easily deceived by adversarial examples. More formally, an adversarial example is a sample input data that has been modiﬁed very slightly in a way that is intended to cause a machine learning classiﬁer to misclassify it.
Hence, an adversarial attack consists of generating these adversarial samples and feeding them to the target model so causing a wrong prediction. These attacks can thus constitute a serious threat that may compromise the application of deep learning models in real systems.
For instance, an adversarial example on a road sign might cause a faulty maneuver to an autonomous car, leading to a possible collision or however causing a dangerous situation.
Similarly, adversarial examples of human faces might maliciously fool face recognition systems allowing impersonification or identity dodging, thus making those systems unreliable.
Conversely, adversarial defence aims to protect deep learning models from adversarial attacks, namely, making models more robust against adversarial examples.
In this context, research on adversarial robustness resembles a minimax game where attackers constantly try to exploit more powerful techniques to fool deep learning models while, at the same time, defenders have to invent new defence methods to resist these malicious attacks.
Thus, the scope of this post is to review how attack and defence methods evolved through time hoping that having a clear overview of their evolution might provide researchers future ideas of how to keep adversarial learning balanced between attack and defence.
This post is part of a collection including the following 6 posts:
- Threat models (current post);
- White-box attacks;
- Black-box transfer-based attacks;
- Black-box score-based attacks;
- Black-box decision-based attacks;
- Defence methods.
In the next sections, we are going to explain the properties of threat models, that is, a set of assumptions about the adversary’s goals, knowledge, and capabilities.
Those conditions are specified by threat models so to determine under which conditions a defense is designed to be secure.
An adversarial example, to be effective, has to be misclassified by deep learning models but not by the human brain, thus, only small changes can be made to the original input image x (legitimate example) to craft the adversarial example xadv=x+δ, where δ is also known as adversarial noise.
The distance introduced by the noise between x and xadv is usually defined by the lp norm of the difference between the original and the adversarial sample for some p=0,..,∞. Thus an adversarial example xadv has to satisfy the constraint ||xadv-x||p<=Ɛ (adversarial constraint), where smaller Ɛ values correspond to smaller input perturbations, thus leading to less perceptible changes under the condition that xadv is misclassified.
For instance, the l0 distance corresponds to the number of pixels that have been altered in an image, l2 distance measures the standard Euclidean (root-mean-square) distance between x and xadv, and l∞ distance measures the maximum change to any of the coordinates.
In practice, adversarial attacks can be designed to achieve two different goals:
- Untargeted: Untargeted attacks aim at making the model predict any class different from the correct one without targeting any desired class. Formally speaking, given a classifier f and the true label y, an adversarial example would cause f(xadv)!=y. Hence, untargeted attacks’ goal consists of maximizing the loss of the attacked model with respect to the true label y, namely, max L(xadv, y), under the assumption ||xadv-x||p<=Ɛ. Given the relaxed constraints to the adversarial examples are subjected to, these kinds of attacks are usually easier to perform.
- Targeted: Conversely, targeted attacks try to mislead the model’s prediction toward a specific class y‘, that is, f(xadv)=y‘, where y‘!= y. The loss function can then be formulated as min L(xadv, y‘) always under adversarial constraints. In this way, given the input xadv, it will be more likely that the attacked model would predict the target class y‘ rather than any other class.
Adversarial attacks are divided into two main categories depending on how much information is available regarding the model under attack:
- White box attacks: Under this setting, the adversary has full access and knowledge of the model, that is, the architecture of the model, its parameters, gradients, and loss of respect to the input as well as possible defense mechanisms known to the attacker. It is thus not particularly difficult to attack models under this condition and common methods exploit the model’s output gradient to generate adversarial examples.
- Black box attacks: In this category of attacks, the adversary has zero or very little knowledge about the model. Thus existing methods often rely on training a similar model or an ensemble of them. These methods work because, generally, adversarial examples that fool one model are likely to fool another similar model. In practice, this is the most likely kind of attack, since, in normal circumstances, attackers don’t have the possibility to access much of the models’ knowledge. Other methods exploit knowledge about the accuracy of the prediction or only the label to craft adversarial examples.