Black-box transfer-based attacks on images

In the previous post we reviewed a series of white-box adversarial attacks where the adversary has full access and knowledge of the victim model. In this post we are going to explore the first category of black-box attacks, namely, black-box transfer-based attacks. Transfer-based attacks generate adversarial examples against a substitute model, possibly being as much similar as possible to the target model, which have a probability to fool black-box models based on the transferability. More specifically, we are going to learn about the following attacks:

  • MI-FGSM;
  • DI-FGSM;
  • TI-FGSM.
Different type of adversarial attacks

Overview

This post is part of a collection including the following 6 posts:

  1. Threat models;
  2. White-box attacks;
  3. Black-box transfer-based attacks (current post);
  4. Black-box score-based attacks;
  5. Black-box decision-based attacks;
  6. Defence methods.

MI-FGSM

MI-FGSM, besides introducing momentum to enhance its iterations, also proposed to apply the idea of ensemble to adversarial attacks. Its widely known in literature that ensemble methods have been broadly adopted in researches and competitions for enhancing the performance and improving the robustness. In fact, if an example remains adversarial for multiple models, it may capture an intrinsic direction that always fools these models and is more likely to transfer to other models at the same time, thus enabling powerful black-box attacks. One method to fuse the logit activations of multiple models consists of a weighted sum of the logits as

l(x)=\sum_{k=1}^{K}w_k l_k(x),

\sum_{k=1}^{K}w_k=1,

where lk(x) are the logits of the k-th model, and wk is the ensemble weight with wk>=0. The new logits l(x) are then used to compute the loss function L as

L(x, y)=-1_y \cdot log(softmax(l(x))),

where 1y is the one-hot encoding of the true class y. Because the logits capture the logarithm relationships between the probability predictions, an ensemble of models fused by logits aggregates the fine detailed outputs of all models, whose vulnerability can be easily discovered. Finally, it’s interesting to add that this method was proposed at the NIPS 2017 Adversarial Attacks and Defenses Competition winning the first place in both the non-targeted attack and targeted attack. (link)

DI-FGSM

Despite iterative methods achieving good attack rates on the attacked model, however, they easily fall into the poor local maxima generating overfitted adversarial examples that rarely transfer to black-box models. Unlike the traditional methods which maximize the loss function directly respect to the original inputs, Diverse Inputs Iterative Fast Gradient Sign Method (DI-FGSM), inspired by the data augmentation, applies random and differentiable transformations T such as random resizing or random padding to the input images with probability p at each iteration and maximizes the loss function respect to these transformed inputs. The transformation probability p controls the trade-off between success rates on white-box models and success rates on black-box models, hence, this method can succeed under both setting simply tuning p. When p=1, it means that only transformed inputs are used for the attack, thus generating adversarial examples that have much higher success rates on black-box models but lower success rates on white-box models, since the original inputs are not seen by the attackers. This method, has been combined with momentum and ensemble networks to further improve its transferability and outperform MI-FGSM in the NIPS competition by a large margin of 6.6%. (link)

The comparison of success rates using three different attacks. The ground-truth “walking stick” is marked as pink in the top-5 confidence distribution plots. The adversarial examples are crafted on Inception-v3 with the maximum perturbation Ɛ=15. From the first row to the the third row, we plot the top-5 confidence distributions of clean images, FGSM and I-FGSM, respectively. The fourth row shows the result of the proposed Diverse Inputs Iterative Fast Gradient Sign Method (DI -FGSM), which attacks the white-box model and all black-box models successfully

TI-FGSM

The resistance of the defense models against transferable adversarial examples is largely due to the phenomenon that the defenses make predictions based on different discriminative regions compared with normally trained models, and this phenomenon, is caused by either training under different data distributions or transforming the inputs before classification. To mitigate the effect of different discriminative regions between models and evade the defenses by transferable adversarial examples, it has then been proposed a translation-invariant (TI) attack method. TI uses a set of translated images to optimize an adversarial example as

\arg \max_{x_{adv}}\sum_{i,j}w_{ij}L(T_{i,j}(x_{adv}),y),

s.t. ||x_{adv}-x||_{\infty}<=\epsilon,

where Tij(x) is the translation operation that shifts image x by i and j pixels along the two-dimensions respectively. To efficiently calculate the gradient of the loss function, it can be assumed that the translation-invariant property of CNNs is nearly held for very small translations. Based on this assumption, the translated image Tij(x) is almost the same as the not translated image x’ as well as its gradients which can then be approximated as

\sum_{i,j}w_{ij}\nabla_{x}L(T_{i,j}(x),y)|_{x=x'} \approx \sum_{i,j}w_{ij}T_{-i-j}(\nabla_{x}L(x,y)|_{x=x'}).

Thus, it is not needed to calculate the gradients for all the translated images, but it is only computed the gradient of the untranslated image x’ and then averaged over all the shifted gradients. The weights wij are designed such that images with bigger shifts would have relatively lower weights so to make the adversarial perturbations effectively fool the model at the untranslated image. This procedure is also equivalent to convolving the gradient with a kernel composed of all the weights wij as

W*\nabla_{x}L(x, y),

where W is the kernel matrix of size with Wi,j=w-i-j. (link)

The adversarial examples generated by the fast gradient sign method (FGSM) and the proposed translation-invariant FGSM (TI-FGSM) for the Inception v3 model

Fourth part

References

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s