In the first post we introduced the concept of adversarial attacks and contextualized in the case of images. In this post we are going to explore the first category of attacks, namely, white-box attacks. Under this setting, the adversary has full access and knowledge of the model, that is, the architecture of the model, it’s parameters, gradients and loss respect to the input as well as possible defense mechanisms are known to the attacker. It is thus not particularly difficult to attack models under this condition and common methods exploits model’s output gradient to generate adversarial examples. More specifically, we are going to learn about the following attacks:
This post is part of a collection including the following 6 posts:
- Threat models;
- White-box attacks (current post);
- Black-box transfer-based attacks;
- Black-box score-based attacks;
- Black-box decision-based attacks;
- Defence methods.
Fast Gradient Sign Method (FGSM) is a basic one-step gradient-based approach which is able to find an adversarial example in a single step by maximizing the loss function L(xadv, y) respect to the input x and then adding back the sign of the output gradient to (x) so to produce the adversarial example xadv:
where ∇xL(x,y) is the gradient of the loss respect to the input x, and the equation is expected to meet the l∞ norm bound by design. This method works because adding a perturbation to the legitimate input such to maximize its loss respect to the correct label y decreases the likelihood that y could be predicted given the input xadv. Mathematically, it moves the adversarial example in one direction toward the border between the true class and some other class.
We can consider Fast Gradient Method (FGM) as a generalization of FGSM where the only difference is that FGM doesn’t apply the sign operator to the calculated gradient. It can be deﬁned as
Note that in this case the gradient is divided by its lp norm to prevent the adversarial noise to break the constraint ||xadv-x||p<=Ɛ. Despite the sign operator can be regarded as a normalization process, however, it also changes the most effective direction, thus making FGSM less effective than FGM. Given their similarity, all those methods exploiting FGSM can also be applied to FGM and vice versa.
Basic iterative methods iteratively apply fast gradient multiple times with a small step size α. Thus, the iterative version of FGSM (I-FGSM) can be expressed as
where xadv0=x (legitimate example). There are several ways to make the adversarial example satisfy the norm bound. For example, xadv could be clipped into the Ɛ vicinity of x or set α=Ɛ/T with T being the number of iterations. It has been proved that iterative methods exploit much finer perturbations which do not destroy the image even with higher Ɛ and at the same time confuse the classifier with higher rate. Their drawbacks are that iterative methods are a little bit slower than their one-step counterparts, and more importantly, they show poor performance on transferability, which is a fundamental propriety in black box attacks.
Momentum iterative gradient-based methods integrate momentum into iterative fast gradient method to generate adversarial examples satisfying the lp norm bound. Traditionally, momentum is a technique for accelerating gradient descent algorithms by accumulating a velocity vector in the gradient direction of the loss function across iterations. However, this concept can also be applied to generate adversarial examples and obtain tremendous benefits. As it was for the learning rate update, the first step consists of updating the momentum gt by accumulating the velocity vector in the gradient direction as
where µ is a decay factor. Next, the adversarial example xadvt is perturbed in the direction of the sign of gt with a step size α as
In each iteration, the current gradient ∇xL(x,y) is normalized by the lp distance of itself, because the authors noticed that the scale of the gradients in different iterations varies in magnitude.
PGD is another iterative algorithm that exploits projected gradient descend to iteratively craft adversarial examples as
where S is the set of all the allowed perturbations. Projected gradient descent performs one step of standard gradient descent, and then clips all the coordinates to be within the box. Moreover, in order to explore a large part of the loss landscape, the algorithm is restarted from many points within the l∞-balls around data points taken from the evaluation set. Thanks to this large number of observations, the authors realized that all the local maxima found by PGD have similar loss values, both for normally trained networks and for adversarially trained networks, thus pointing out that robustness against the PGD adversary yields robustness against all first-order adversaries such as SGD based attacks. This conclusion also leads to the fact that as long as the adversary only uses gradients of the loss function with respect to the input, it will not find significantly better local maxima than PGD.
DeepFool is an iterative algorithm to efficiently compute perturbations that fool deep networks, and thus reliably quantify the robustness of these classifiers. At each iteration t, it approximates the distance between xadvt and the complement of the convex polyhedron P which defines the region of the space where f(x)=y. Consequently, at each iteration of the algorithm, the perturbation vector that reaches the boundary of the polyhedron P is computed, and the current estimate updated. The iterations keep going until f(x)!=f(xadvt) so to craft an adversarial example with minimum perturbations. DeepFool is thus a greedy algorithm that is not guaranteed to converge to the optimal perturbation. However, the author observed that in practice the algorithm yields very small perturbations which are believed to be good approximations of the minimal perturbations.
The method proposed by Carlini & Wagner relies on the initial formulation of adversarial examples and formally defines the problem of finding an adversarial instance for an image x as follows:
under the condition that (x+δ)∈[0,1]n. The function f is an objective function such that L(x+δ, y)!=y if and only if L(x+δ, y)<=0. Note that x+δ is equivalent to xadv by definition of adversarial example. The value c>0 is a constant that is often chosen as the smallest value of c for which the resulting solution xadv has L(xadv, y)<=0. Moreover, to ensure that the modification yields a valid image, the adversarial noise δ is constrained such that 0<=xi+δi<=1 ∀ i (this methods assumes image pixels to be in the range [0,1]). One way to do it is to replace x+δ with ((1+tanh(w))/2) so that the optimization problem in defined above becomes an unconstrained minimization problem in w. Finally, the loss is iteratively optimized with Adam optimizer since has it has been found be the most effective at quickly finding adversarial examples. The authors also showed that this method is particularly effective in breaking models re-trained with defensive distillation which is a strong security very robust against most gradient-based attack methods.
JSMA is an attack optimized under l∞ distance known as the Jacobian-based Saliency Map Attack. A saliency map rates each input feature xi (e.g. each pixel) on how influential it is for causing the model to predict a particular class. A saliency map S+(xi,c) is computed using the output gradients to calculate how much each xi positively correlates with y=c and how negatively correlates with all other classes y’!=y. An attacker can exploit this saliency map by targeting an adversarial class y’ that does not match the true class label y of a given sample x. By increasing a few high-saliency pixels xi according to S+(xi,c=y’), the modified image xadv will have an increased prediction confidence f(xadv) for the adversarial class y’, and thus might result in misclassification. There exist also four variants of this attack denoted as JSMA+Z, JSMA-Z, JSMA+F, and JSMA-F, based on the choices of adopting (S+) (positive saliency map) or (S–) (negative saliency map), and using Z (gradients of the logits) versus F (gradients of the outputs). This methods results useful for finding adversarial examples both quickly and with limited perceptual differences.
All previous methods work only under the condition that the gradients are exactly known. However, defenses that cause obfuscated gradients appear to defeat iterative optimization-based attacks. To break those defences, Backward Pass Differentiable Approximation (BPDA) attacks approximate derivatives by computing the forward pass normally and computing the backward pass using a differentiable approximation of the of the non-differentiable layers. It works by iteratively generating l∞ bounded adversarial examples using PGD confined to a speciﬁed l∞ ball, and uses and an optimization objective as in C&W. To approximate ∇xf(x), it first finds a differentiable approximation g(x) such that g(x)≈fi(x), where fi(x) is a non-differentiable layers (i.e. a randomized transformation of the input). Afterwards, ∇xf(x) can be approximated by performing the forward pass through f, but on the backward pass, replacing fi(x) with g(x). One minor drawback is that since each individual gradient descent step is not exactly correct, applying BPDA often requires more iterations of gradient descent.
A new class of attacks known as blind-spot attacks (BSA), exploits blind-spots of the empirical distribution of training to fool adversarially trained models. With “blind-spots” we refer to that input images far enough from any training examples in some embedding space, but still in the ground-truth data distribution, thus well recognized by humans and correctly classified by the model. Note that blind-spot images are not adversarial images themselves but constitute a convenient starting point to craft them. Blind-spots images can be easily found by applying a few simple transformations on the ground-truth images such as slightly changing contrast and background to the ground-truth data so that they are still correctly classified by the target model. Next, traditional methods like C&W attacks are employed on these transformed images so to find their corresponding adversarial examples. An interesting propriety of blind-spots is that if the parameters are accurately tuned, models’ performance rarely decrease, yet their adversarial examples can be easily found.