Defence methods for image adversarial attacks

In the previous post, we reviewed some well-known methods for black-box decision-based adversarial attacks where the adversary has no knowledge about the victim model except for its discrete hard-label predictions.

Thus gradient-based methods become ineffective but simple random-walk-based methods such as the Boundary Attack can still represent a threat even under these particular settings.

Now that we have introduced both white and black-box attacks under different threat models, we are going to conclude this series by understanding how models can be, sometimes unsuccessfully, defended against these malicious attacks.

Different types of adversarial attacks


This post is part of a collection including the following 6 posts:

  1. Threat models;
  2. White-box attacks;
  3. Black-box transfer-based attacks;
  4. Black-box score-based attacks;
  5. Black-box decision-based attacks;
  6. Defence methods (current post).

Defence methods

Given the variety and effectiveness of the existing adversarial attack methods, it is very important to keep deep learning models safe and make their applications reliable.

However, since the cure shouldn’t be worse than the disease, researchers have outlined some design requirements for defenses against adversarial perturbations so to ensure the model’s performance is not degraded (link):

  • Low impact on the architecture: techniques should limit the modifications made to the architecture to avoid abnormal behaviors.
  • Maintain accuracy: defenses against adversarial samples should not decrease the model’s classification accuracy.
  • Maintain speed of network: the solutions should not significantly impact the running time of the classifier at test time.
  • Defenses should work for adversarial samples relatively close to points in the training dataset: samples that are very far away from the training dataset are irrelevant to security because they can easily be detected, at least by humans.

Thus, extensive research has been conducted in this direction in order to build a robust model to guard them against adversarial attacks.

As for the attack methods, also defence methods can be split into different categories based on the strategy these methods employ. Defence techniques can then be classified into five categories such as:

  • Robust training: These kinds of techniques aim to make a classifier robust against small internal perturbations. One possible strategy is based on adversarial training by augmenting the training data by adding adversarially generated examples, or defensive distillation which consists of retraining a network using previously generated soft labels. Another possible way consists of training robust models with regularization such as those based on the Lipschitz constant, or perturbation norm.
  • Input transformation: These methods take advantage of the fact that several defenses aim at transforming the inputs right before feeding them to the classifier. Among them, we have JPG compression and feature squeezing such as total variance minimization and image quilting. More sophisticated methods project adversarial examples onto the data distribution through generative models.
  • Randomization: Randomness-based defence methods add randomness to either the input such as resizing or padding, or adding randomness to the model parameters so to mitigate the effect of adversarial examples.
  • Model ensemble: Ensemble methods can be used also for defence purposes. Besides just aggregating outputs from each model in the ensemble, some methods average the predictions over random noises injected into the model. Similarly, it can be also introduced a regularizer to promote the diversity among the predictions of different models.

Note that these categories are not exclusive, that is, a defence method can belong to many categories.

In order to avoid creating confusion, in the following sections we are going to describe several methods and, at the end of this section, we are going to summarize their respective category in a specific table.

Adversarial training

Adversarial training is a very simple and intuitive defence method to protect a model against adversarial examples. It consists of generating adversarial examples using different attack methods on the target model.

In the second step, those adversarial examples are used and merged with the original training set to form an augmented training set, and finally the target model is retrained on the augmented training set.

During the retraining phase, along with the model’s objective function, can be minimized also an adversarial objective function which can work as an effective regularize.

Evaluating this defence method, the corresponding authors found that without adversarial training, the error rate of a target model was 89.4% on adversarial examples based on the FGSM. Conversely, With adversarial training, the error rate fell to 17.9% on the same set of adversarial examples (link).

Parseval regularization

Parseval regularization is a layerwise regularization method for reducing the network’s sensitivity to small perturbations by carefully controlling its global Lipschitz constant. It works by constraining the Lipschitz constant of each hidden layer to be smaller than one.

That way, it avoids the exponential growth of the Lipschitz constant, and a usual regularization scheme (i.e., weight decay) at the last layer then controls the overall Lipschitz constant of the network.

To enforce these constraints in practice, a network trained with Parseval regularization combine two ideas: maintaining orthonormal rows in linear and convolutional layers and performing convex combinations in aggregation layers.

As a consequence of this regularization method, models train faster and make better use of their capacity (link).

Cross-Lipschitz regularization

Cross-Lipschitz regularization is a regularization constraint whose goal is to make the differences of the classifier functions at the input learned images as constant as possible. Cross-Lipschitz regularized object function is defined as:

\Omega (f)=\frac{1}{nK^2}\sum_{i=1}^n\sum_{l,m=1}^K||\nabla f_l(x_i)-\nabla f_m(x_i)||_2^2,

\frac{1}{n}\sum_{i=1}^n L(y_i, f(x_i))+\lambda\Omega(f),

where xi, i=1,…,n, are training data and fy(x) is the probability of classifier f given to the class y. Hence, this loss function tries to minimize the equation above, thus it maximizes fc(xi)-fj(xi) and at the same time keeps ||∇fl(xi)-∇ fm(xi)||22) small uniformly over all classes.

The authors of this method proved that training with Cross-Lipschitz regularization automatically enforces the robustness of the resulting classifier (link).

Deep Defence

Many methods regularization-based defence methods approximate the learning objective of the target network. However, this may lead to a degraded prediction accuracy on the test set or a lack of robustness against advanced adversarial examples.

Therefore, Deep Defence, proposes to jointly optimize the original network objective and a scaled ||∇x||p used as regularization. Thus, it aims at minimizing:

\min \sum_k L(y_k,x_k)+\lambda \sum_{k \in T} R (-c \frac{||\nabla_{x_k}||p}{||x_k||_p})+\lambda \sum{k \in F} R (d \frac{||\nabla_{x_k}||_p}{||x_k||_p}),

where ||∇x||p can be generated with any attack method (i.e. DeepFool). The term R(||∇xk||p/||xk||p) is calculated by means of a recursive-flavored regularizer network. It takes an image xk as input and calculates each perturbed component for ||∇x||p by utilizing an incorporated multi-layer attack module.

F is the index set of misclassified training samples, while the correct classified ones are indexed by T. The constants c, d>0 are two scaling factors that balance the importance of different samples, and R is chosen as the exponential function.

This kind of regularization penalizes more on the correctly classified abnormal samples with small ||∇x||p than those with relatively large ones so as to not penalize robust samples with high values of ||∇x||p.

Deep Defence not only outperforms adversarial learning and Parseval regularization by large margins on various datasets but also ensures it does not degrade the accuracy of benign samples (link).

The recursive-flavored network which takes a reshaped image xk as input and
sequentially compute each perturbation component by using a pre-designed attack module

Defensive distillation

Distillation is a technique used to reduce the size of DNN architectures so as to reduce their computing resource needs.

It consists of training a distilled network fd using the classification predictions of the original network f. The intuition behind this method is that training with soft labels provides additional knowledge compared to hard labels since they encode the relative differences between classes.

Thus, it has been proposed to utilize this method, known as defensive distillation, to train robust classifiers. The main difference between defensive distillation and the original distillation proposed is that f and fd have the same architecture.

This difference is justified since the goal of defensive distillation is resilience instead of compression. This defense method also dramatically reduces gradient magnitude, thus making it more difficult to generate adversarial examples for gradient-based methods (link).

It is first trained in an initial network F on data X with a softmax temperature of T. Is then used the probability vector F(X), which includes additional knowledge about classes compared to a class label, predicted by network F to train a distilled network Fd at temperature T on the same data X

Input transformations

One of the most simple and naive methods to defend a targeted network by input transformation consists of a JPG compression of the input itself.

However, even for simple methods such as FGSM, JPG compression can not always reverse the drop in classification and its performance degrade rapidly as the magnitude of the perturbations increases (link).

Another approach called feature squeezing reduces the degrees of freedom available to an adversary by squeezing out unnecessary input so to provide fewer opportunities for an adversary to construct adversarial examples.

If the model’s prediction on the original sample and on the squeezed sample produce different outputs, then the input is likely to be adversarial. Feature squeezing techniques include reducing the color bit depth of each pixel and spatial smoothing.

Adversarial perturbations can be partially removed via variance minimization. It randomly selects a small set of pixels and reconstructs the simplest image that is consistent with the selected pixels.

The reconstructed image does not contain adversarial perturbations because these perturbations tend to be small and localized.

A similar approach is image quilting. It removes adversarial perturbations by constructing a patch database that only contains patches from benign images. Using K-neighbors it selects the patches used to create the transformed image so that it consists of pixels that were not modified by the adversary (link).


Defence-GAN is a defence that uses a WGAN trained on legitimate training samples so to learn to remove perturbations on adversarial examples. At inference time, it first generates some random samples zi and finds z* as

\min_{z_i}||G(z_i)-x||_2^2,m_{i=1}^n L(y_i, f(x_i))+\lambda\Omega(f),

where G is a generator and x is the image to be classified. Finally, G(z*) is given as the input to the classifier.

The equation above is a highly non-convex minimization problem that can be approximated by doing a fixed number of gradient descent steps. Once appropriately trained, generated clean images and their reconstruction should not differ too much (link).

Overview of the Defense-GAN algorithm

Resizing & Padding

A simple form of input randomization consists of random resizing and random zero-padding of the input images. These techniques make the model more robust to adversarial images and usually don’t hurt the performance of benign images.

They also don’t require much computation and can be combined with other defence methods such as adversarial training (link).

Pipeline of randomization-based defense mechanism


A model randomization approach known as Stochastic Activation Pruning (SAP), inspired by the dropout technique, consists of stochastically dropout nodes in each layer during forward propagation.

Nodes with higher magnitude have more probability to be retained and the surviving nodes are scaled up so to preserve the dynamic range of the activations in each layer. In the case where most of the nodes are retained, namely, when fewer parameters of the network are pruned, then the scaling factor is close to 1, and it performs almost identically to the original model.

Conversely, when a huge number of nodes are pruned, the model’s accuracy will drop compared to its original accuracy, but the introduced stochasticity has more chance to deceive the adversary. Unlike dropout, SAP can be applied to pre-trained models without significant loss of accuracy (link).


The authors of Adv-BNN noticed that adding noise blindly to all the layers is not the optimal way to incorporate randomness. Hence, they opted to learn the best model distribution under adversarial attacks under the framework of Bayesian Neural Network (BNN).

The only difference between Adv-BNN and standard BNN training is that the expectation is taken over the adversarial examples (xadv, y) and not over (x, y). Therefore, at each iteration, it is first applied a randomized PGD attack for a certain number of iterations to find an adversarial example xadv, and then xadv is used to maximize the ELBO.

During training, it is used the Bayes by Backprop algorithm and the reparameterization trick to update the learnable mean and variance used to define a gaussian from which sampling the weights of the BNN at inference time.

Finally, this method can be combined with adversarial training to further improve its performance (link).


Random Self Ensemble (RSE) is a defence method combining randomness and ensemble. Randomness is due to noisy layers, inserted before each convolutional layer, that add random noise to the input vector.

During training, the noise is generated randomly for each stochastic gradient descent update and the training procedure can be considered as minimizing the upper bound of the loss of the model ensemble.

Ensemble is because during inference, are performed several forward propagation passes, each time with different prediction scores due to the noise layers. Note that this is equivalent to an ensemble of an infinite number of noisy models without any additional memory overhead.

Then, during prediction, their logits as summed as the resulting class is chosen as

p=\sum_{j=1}^n f_{\epsilon_j}(x),

y=\arg \max_k p_k.

Thus, this method brings two main benefits: one is that perturbing the gradient can fool gradient-based attacks. The other is that the ensemble method can improve the testing accuracy (link).


To focus on ensemble diversity, it can be used as an adaptive diversity promoting (ADP) regularize.

It consists of two terms: a logarithm of ensemble diversity (LED) and an ensemble Shennon entropy term, and it is trained to minimize the difference between the ADP and the ensemble cross-entropy loss.

Its goal is thus to encourage non-maximal predictions of each member in the ensemble to be mutually orthogonal while keeping the maximal prediction consistent with the true label.

The reason for this form of regularization is that a high diversity or inconsistency on the non-maximal predictions can lower the transferability of adversarial examples among the networks, and further lead to better robustness of the ensemble (link).

Predictions when simultaneously training all the members of the ensemble with the ADP regularizer.

Summary table

General overview of the defence methods reviewed in this post and their corresponding

One thought on “Defence methods for image adversarial attacks

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s