Black-box score-based attacks on images

In the previous post, we reviewed a series of black-box transfer-based adversarial attacks where the adversary has to generate adversarial examples against a substitute model. In this post, we are going to explore the second category of black-box attacks, namely, black-box score-based attacks.

Under this setting, it is not possible to access the white-box model’s gradients. The only knowledge about the attacked model is the input images and the relative output confidence scores. Anyway, the gradient can still be estimated by querying the target model and retrieving the labels’ confidence score.

More specifically, we are going to learn about the following attacks:

  • ZOO;
  • NES;
  • SPSA;
  • P-RGF;
  • N-ATTACK;
  • TREMBA.
Different type of adversarial attacks

Overview

This post is part of a collection including the following 6 posts:

  1. Threat models;
  2. White-box attacks;
  3. Black-box transfer-based attacks;
  4. Black-box score-based attacks (current post);
  5. Black-box decision-based attacks;
  6. Defence methods.

ZOO

Zeroth Order Optimization (ZOO) method, inspired by the formulation of the C&W attack, uses a symmetric difference quotient to estimate the gradient with respect to the input x as

g_i=\frac{\partial f(x)}{\partial x_i} \approx \frac{f(x+he_i)-f(x-he_i)}{2h}

where h is a small constant and ei is a one-hot vector with 1 at the i-th position. Given x ∈ Rp, 2p gradients queries are required to estimate the gradients for all p coordinates which would be infeasible for large images.

To reduce the attack space, it can be used a function D to reduce the permissible dimension of the adversarial noise from p to m, with m<p, and then upscale it as a size-p image using an operator such as bilinear interpolation.

This method can be further improved with a hierarchical attack scheme, where it can be used a series of transformations D1,D2… with dimensions m1,m2,… to gradually increase m over the iterations of the optimization process.

Regarding the optimization algorithm, it is typically used Adam with stochastic coordinate descent and importance sampling to selectively apply a coordinate descent method to the most important pixels inferred by the adversarial noise. (link)

ZOO black-box untargeted attack examples

NES

Natural Strategy Evolution attack (NES) is a black-box gradient estimation technique that employs PGD with the estimated gradient to construct adversarial examples.

Rather than maximizing an objective function as in ZOO, NES maximizes the expected value of the loss function L under the search distribution π(xt+1|xt), with xt+1=xt+σδ, where δ∼N(xt2) and σ defines the search variance. The whole formula is then

\nabla \mathbb{E}[L(x_{t+1})] \approx \frac{1}{\sigma n} \sum_{i=1}^{n}\delta_i L(x_{t}+ \sigma \delta_i).

Thus, the expected gradient is obtained by sampling n points from random Gaussian noise around the current image xt. Next, in each iteration, PGD is performed using the sign of the estimated gradient to update the current xt+1 as

g_t=\frac{1}{n}\sum_{i=1}^{n}L(x_t+\sigma \delta_i)\nabla log(N(\delta_i|x_t,\sigma^2)),

x_{t+1}=\Pi_S(x_t-\eta \cdot sign(g_t)),

where ΠS represents a projection onto the lp-ball. This method has been specifically designed to work under query-limited settings where the attacker has a query upper-bound T and aims to cause targeted misclassification in T queries or less. In this case, N queries are used to estimate each gradient performing T/N steps of PGD. (link)

targeted adversarial examples for the InceptionV3 network. The top row contains unperturbed images, and the bottom row contains corresponding adversarial examples (with randomly chosen target classes)

SPSA

SPSA is another iterative black-box gradient estimation method based on C&W optimization. Its main difference from other gradient estimation methods is that the gradients are estimated with the SPSA optimization algorithms which are well-suited for high-dimensional optimization problems, even in the case of noisy objectives.

The noise δ is not sampled from a Gaussian as in NES but from a Rademacher distribution in order to make the method work. Comparing SPAS with NES and ZOO, the authors found that the specific choice of the gradient-free optimizer is relatively unimportant and that all methods produce reliable adversarial examples.

In fact, with sufficient computation, both NES and SPSA decrease the accuracy to 0% on ImageNet (link).

Detecting adversarial examples with a generative model: While typical adversarial examples lie in low probability regions of the input space, an adversary with access to the density model can construct non-detectable adversarial examples, which are both misclassified and assigned higher likelihood than their natural counterparts

P-RGF

The authors of the P-RGF method argued that transfer-based black-box attacks often don’t perform very well because the gradient of the surrogate model points to a non-adversarial region of the target model. At the same time, pure score-based attacks require a tremendous amount of queries to perform well.

Thus they developed a new method called Prior-guided Random Gradient-Free (P-RGF) which improves black-box adversarial attacks taking advantage of both a transfer-based prior given by the gradient of a surrogate model and the query information to further approximate it.

Beside transfer-based prior, this algorithm can also incorporate data-dependent prior to reducing query complexity. Compared with NES and SPSA, P-RGF leads to much higher attack success rates with improvements ranging between 20% and 40% in many cases.

Since the math behind the P-RGF method is quite complex, its details have not been reported here but can be found in the original paper (link).

Average number of queries per successful image at any desired success rate

N-ATTACK

The authors of N-ATTACK realized that gradient-estimation-based methods often don’t perform very well because some of the neural networks are not smooth, so the estimation of the gradient is not reliable enough.

Thus, N-ATTACK instead of estimating the gradient, estimates a probability density distribution defined over the lp-ball S centered around an input x to the neural network. Assuming that the distribution is such that the loss is small, then a sample drawn from the distribution is likely to be adversarial. The parameters of the distribution can be updated as

\mu_{t+1}=\mu_t-\frac{\eta}{b}\sum_{i=1}^{b}L(\Pi_s(g(z_i)))\nabla_{\mu} log N(z_i|\mu_t,\sigma^2).

In the above formula, z∼N(μ,σ) is a normal distribution whose mean μ has to be learned. The function g0(z) is computed as g(z)=1/2(tanh(g0(z))+1), where g0(z) maps the sample z such to have the same dimension of x (i.e. z lies in the space of the CIFAR10 images and the function g0(z) maps it to the ImageNet images with bilinear interpolation).

We also define the difference δ=clipp(g(z)−x) so to clip the projection ΠS(g(z))=xt+1=xt+δ, where Π is the projection operator. Finally, b is the number of samples per iteration and the loss function L is the C&W loss.

The initialization of the distribution parameters plays a key role in the run time of the algorithm, thus, it is also trained a regression network that takes as input a sample x, and outputs (μ0) to initialize N-ATTACK.

The advantage of this method is that once the distribution has been learned, it’s not needed to query the model to attack it, but it is virtually possible to draw an infinite number of adversarial examples from the learned distribution (link).

TREMBA

TRansferable-EMbedding-based Black-box Attack (TREMBA) tries to learn a low-dimensional embedding using a pre-trained model Fs, and then performs an efficient search within the embedding space to attack an unknown target network Ft.

The first phase consists of training an encoder-decoder G that can effectively generate adversarial perturbations for the source network with a low-dimensional embedding space. The encoder E takes an input x and outputs a latent vector z=E(x), where dim(z)<dim(x).

The decoder D takes z as input and outputs an adversarial perturbation δ=εtanh(D(z)) with dim(δ)=dim(x). In this way, the G would be trained such that its generated perturbations δ can fool the source network Fs minimizing the hinge loss used in the C&W attack.

In the second phase, it is used NES to attack Ft. However, instead of performing a search on the input space, TREMBA performs a search on the embedding space z. Given an input x and its label y, and a starting point z0=E(x), the gradient of zt given can be estimated as

z_{t+1}=\Pi_S(z_t-\eta \cdot\frac{1}{n}\sum_{i=1}^{n}L(x+\epsilon \: tanh(D(\nu_i)))\nabla_{z_t} log(N(\nu_i|z_t,\sigma^2))),

where vk is the sample from the gaussian distribution N(zt2). Note that it is not required to do projection explicitly since δ already satisfies ||δ||<ε by design.

The benefit TREMBA introduces, is that searching over z is like searching adversarial examples in a lower dimensional space where adversarial perturbations are less concentrated, thus it is easier to find adversarial patterns (link).

The success rate of black-box adversarial targeted attacks at different query levels for ImageNet models

Continue on (5th part)

References

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s