In the previous post, we reviewed a series of white-box adversarial attacks where the adversary has full access to and knowledge of the victim model. In this post, we are going to explore the first category of black-box attacks, namely, black-box transfer-based attacks.
Transfer-based attacks generate adversarial examples against a substitute model, possibly being as much similar as possible to the target model, which has a probability to fool black-box models based on the transferability. More specifically, we are going to learn about the following attacks:
This post is part of a collection including the following 6 posts:
- Threat models;
- White-box attacks;
- Black-box transfer-based attacks (current post);
- Black-box score-based attacks;
- Black-box decision-based attacks;
- Defence methods.
MI-FGSM, besides introducing momentum to enhance its iterations, also proposed to apply the idea of ensemble to adversarial attacks. Its widely known in the literature that ensemble methods have been broadly adopted in research and competitions for enhancing performance and improving robustness.
In fact, if an example remains adversarial for multiple models, it may capture an intrinsic direction that always fools these models and is more likely to transfer to other models at the same time, thus enabling powerful black-box attacks.
One method to fuse the logit activations of multiple models consists of a weighted sum of the logits as
where lk(x) are the logits of the k-th model, and wk is the ensemble weight with wk>=0. The new logits l(x) are then used to compute the loss function L as
where 1y is the one-hot encoding of the true class y. Because the logits capture the logarithm relationships between the probability predictions, an ensemble of models fused by logits aggregates the finely detailed outputs of all models, whose vulnerability can be easily discovered.
Finally, it’s interesting to add that this method was proposed at the NIPS 2017 Adversarial Attacks and Defenses Competition winning first place in both the non-targeted attack and targeted attack (link).
Despite iterative methods achieving good attack rates on the attacked model, however, they easily fall into the poor local maxima generating overfitted adversarial examples that rarely transfer to black-box models.
Unlike the traditional methods which maximize the loss function directly with respect to the original inputs, Diverse Inputs Iterative Fast Gradient Sign Method (DI-FGSM), inspired by the data augmentation, applies random and differentiable transformations T such as random resizing or random padding to the input images with probability p at each iteration and maximizes the loss function respect to these transformed inputs.
The transformation probability p controls the trade-off between success rates on white-box models and success rates on black-box models, hence, this method can succeed under both settings by simply tuning p.
When p=1, it means that only transformed inputs are used for the attack, thus generating adversarial examples that have much higher success rates on black-box models but lower success rates on white-box models, since the original inputs are not seen by the attackers.
This method has been combined with momentum and ensemble networks to further improve its transferability and outperform MI-FGSM in the NIPS competition by a large margin of 6.6%. (link)
The resistance of the defense models against transferable adversarial examples is largely due to the phenomenon that the defenses make predictions based on different discriminative regions compared with normally trained models, and this phenomenon is caused by either training under different data distributions or transforming the inputs before classification.
To mitigate the effect of different discriminative regions between models and evade the defenses by transferable adversarial examples, it has then been proposed a translation-invariant (TI) attack method. TI uses a set of translated images to optimize an adversarial example as
where Tij(x) is the translation operation that shifts image x by i and j pixels along the two dimensions respectively. To efficiently calculate the gradient of the loss function, it can be assumed that the translation-invariant property of CNNs is nearly held for very small translations.
Based on this assumption, the translated image Tij(x) is almost the same as the not translated image x’ as well as its gradients which can then be approximated as
Thus, it is not needed to calculate the gradients for all the translated images, but it only computed the gradient of the untranslated image x’ and then averaged over all the shifted gradients. The weights wij are designed such that images with bigger shifts would have relatively lower weights so to make the adversarial perturbations effectively fool the model at the untranslated image.
This procedure is also equivalent to convolving the gradient with a kernel composed of all the weights wij as
where W is the kernel matrix of size with Wi,j=w-i-j (link).