ADVERSARIAL EXAMPLES IN THE PHYSICAL WORLD

ADVERSARIAL EXAMPLES IN THE PHYSICAL WORLD
Adversarial examples in the physical world

Summary

Most existing machine learning classifiers are highly vulnerable to adversarial examples. An adversarial example is a sample of input data that has been modified very slightly in order to cause a machine learning classifier to misclassify it. In many cases, these modifications can be so subtle that human observers don't notice the modifications at all, but the classifier still makes mistakes. Adversarial examples pose security concerns because they can be used to carry out attacks on machine learning systems even if the adversary does not have access to the underlying model. All previous work so far has assumed a threat model in which an adversary can feed data directly into a machine learning classifier. This is not always the case for systems that operate in the physical world, such as those that use signals from cameras and other sensors as input. This paper shows that even in such physical world scenarios, machine learning systems are vulnerable to adversarial examples. We demonstrate this by feeding adversarial images obtained from cellphone cameras into an ImageNet inception classifier and measuring the classification accuracy of the system. We find that a large fraction of adversarial examples are misclassified even when observed through a camera.

I. Introduction

Recent advances in machine learning and deep neural networks have enabled researchers to address several important practical problems in image, video, text classification, etc. (Krizhevsky et al., 2012; Hinton et al., 2012; Bahdanau et al., 2015).
However, machine learning models are often susceptible to adversarial manipulation of their inputs, leading to incorrect classifications (Dalvi et al., 2004). In particular, neural networks and many other types of machine learning models are extremely vulnerable at test time to attacks based on small modifications to the model inputs (Biggio et al., 2013; Szegedy et al., 2014; Goodfellow et al., 2014; Papernot et al., 2016b).
The problem can be summarized as follows. Suppose there is a machine learning system M and input samples C, which we call clean examples. Let us assume that sample C is correctly classified by the machine learning system, i.e. M(C)=Ytrue. We can construct an adversarial example A that is perceptually indistinguishable from C but is misclassified as M(A) != Ytrue. Even though the magnitude of the noise is much larger than that of the adversarial perturbation, these adversarial examples are misclassified much more frequently than those perturbed by the noise (Szegedy et al., 2014).
Adversarial examples pose potential security threats to practical machine learning applications. In particular, Szegedy et al. (2014) show that adversarial examples designed to be misclassified by model M1 are often also misclassified by model M2. This adversarial example transferability property means that it is possible to generate adversarial examples and perform misclassification attacks on machine learning systems without access to the underlying model. Papernot et al. (2016a) and Papernot et al. (2016b) demonstrate such attacks in real-world scenarios.
However, all previous work on adversarial examples of neural networks has used a threat model in which an attacker can provide input directly to a machine learning model. Prior to this work, it was not known whether adversarial examples would continue to be misclassified if constructed in the physical world and viewed through a camera.
This threat model can describe some scenarios where attacks occur entirely within the computer, such as evading spam filters or malware detectors (BBiggio et al., 2013; Nelson et al.). However, many practical machine learning systems operate in the physical world. Possible examples include, but are not limited to: robots perceiving the world through cameras and other sensors, video surveillance systems, and mobile apps for image or sound classification. In this case, an adversary cannot rely on the ability to make fine-grained per-pixel modifications to the input data. The following question arises from this: Is it still possible to craft adversarial examples and perform adversarial attacks on machine learning systems operating in the physical world and perceiving data through various sensors rather than digital representations?
Some previous work has addressed the problem of physical attacks on machine learning systems, but not by fooling neural networks with very small perturbations to the input. For example, Carlini et al. (2016) demonstrated an attack that can create audio input that mobile phones can recognize as containing intelligible voice commands, but humans can hear incomprehensible speech. Photo-based face recognition systems are vulnerable to replay attacks, in which images of faces previously captured by authorized users are presented to the camera instead of real faces (Smith et al., 2015). The adversarial example could in principle be applied to either of these two domains of physics. An adversarial example of the voice command domain includes a recording (such as a song) that appears innocuous to a human observer, but contains voice commands recognized by machine learning algorithms. An adversarial example in the field of face recognition might include very subtle markings applied to faces so that a human observer can correctly identify them, but a machine learning system would identify them as a different person. The work most similar to this paper is Sharif et al. (2016), who published it after our work but submitted it to a conference earlier. Sharif et al. (2016) also printed images of adversarial examples on paper and demonstrated that printed images can fool image recognition systems when photographed. The main differences between their work and ours are: (1) we use cheap closed-form attacks in most experiments, while Sharif et al. (2016) use more expensive attacks based on optimization algorithms, (2 ) We made no particular effort to modify our adversarial examples to improve their chances of survival during printing and photography. We just made a scientific observation that many hostile examples survived without any intervention. Sharif et al. (2016) introduced additional features to make their attacks as effective as possible against practical attacks on face recognition systems. (3) Sharif et al. (2016) limit the number of pixels they can modify (only those on the spectacle frame), but modify these pixels in large numbers; we have a limited number of pixels that can be modified, but are free to modify all pixels.
To investigate how well adversarial examples survive the physical world, we conduct an experiment using a pretrained ImageNet Inception classifier (Szegedy et al., 2015). We generated adversarial examples for the model, then fed these examples into the classifier via a mobile phone camera, and measured classification accuracy. This scenario is a simple physical world system that perceives data through a camera and then runs image classification. We find that a large fraction of the adversarial examples generated for the original model are still misclassified, even when perceived through the camera.
Surprisingly, our attack method does not require any modifications to account for the presence of cameras. The simplest attack using adversarial examples crafted for the Inception model leads to successful transfer of the adversarial examples to the confederation of camera and Inception. Our results thus provide a lower bound on attack success rates that can be achieved with more specialized attacks that explicitly simulate cameras while crafting adversarial examples.
A limitation of our results is that we assume a threat model under which the attacker has full knowledge of the model architecture and parameter values. This is mainly so that we can use a single Inception v3 model for all experiments without having to design and train different high-performance models. The adversarial example transfer property means that our results can be easily extended to scenarios where the attacker cannot access the model description (Szegedy et al., 2014; Goodfello et al., 2014; Papernot et al., 2016b). While we have not performed detailed experiments to investigate the transferability of physical adversarial examples, we were able to build a simple mobile phone app to demonstrate potential adversarial black-box attacks in the physical world, see Figure 1.
insert image description here
Figure 1: Demonstration of a black-box attack (the attack is constructed without access to the model) on a mobile phone application for image classification using physical adversarial examples. We take a clean image from dataset (a) and use it to generate adversarial images with adversarial distractions of various sizes. We then printed clean adversarial images and used the TensorFlow camera demo application to classify them. When perceived by the camera, a clean image (b) is correctly identified as a "scrubber", while adversarial images (c) and (d) are misclassified. See full demo video on youtu https://youtu.be/zQ_uMenoBCk.

To better understand how camera-induced non-trivial image transformations affect the transferability of adversarial examples, we conduct an additional set of experiments investigating how adversarial examples transfer across several specific types of synthetic image transformations.
The rest of the paper is structured as follows: In Section 2, we review different approaches to generating adversarial examples. Section 3 goes on to detail our "physical world" experimental setup and results. Finally, Section 4 describes our experiments on various artificial image transformations (e.g. changing brightness, contrast, etc.) and how they affect adversarial examples.

2. Method of Generating Adversarial Images

This section describes the different methods for generating adversarial examples that we use in our experiments. It is important to note that none of the described methods guarantees that the resulting images will be misclassified. However, we refer to all generated images as "adversarial images".
In the rest of this article, we use the following notation:
insert image description here

  • Image, usually a 3D tensor (width x height x depth). In this paper, we assume that the values ​​of pixels are integers in the range [0255].
    insert image description here
  • The correct class for image X.
    insert image description here
  • Cross-entropy cost function for neural networks, given an image X and y classes. We intentionally omit the network weights (and other parameters) θ in the cost function, as we assume they are fixed in this paper (the same values ​​that would result from machine learning model training). For a neural network with a softmax output layer, the cross-entropy cost function applied to the integer class labels is equal to the negative log-probability of the true class for a given image: J(X,y)=−log p(y|X), the relation will be used below.
  • insert image description here
  • A function that performs pixel-by-pixel clipping on an image X', so the result will be L∞ ε - the neighborhood of the source image X. The exact clipping equation is as follows:
    insert image description here
    where X(X,y,z) is the value of channel z of image X at coordinates (X,y).

2.1 Quick method

One of the simplest methods for generating adversarial images, as described in (Goodfello et al., 2014), is motivated by linearizing the cost function and solving perturbations that maximize the cost under the L∞ constraint. This can be done in closed form with a single call to backpropagation:
insert image description here
where ε is the hyperparameter to choose.
In this paper, we refer to this method as the "fast method", since it does not require an iterative process to compute adversarial examples and thus is much faster than other considered methods.

2.2 Basic iterative method

We introduce a simple way to extend the "fast" method, we apply the method multiple times with small step sizes, and clip the pixel values ​​of the intermediate results after each step to ensure that they lie in the ε neighborhood of the original image Middle:
insert image description here
In our experiments, we use α = 1, that is, we only change the value of each pixel by 1 at each step. We choose the number of iterations to be min(ε+4, 1.25ε). This amount of iterations is chosen heuristically; the adversarial examples are sufficient to reach the edge of the ε max-norm sphere, yet restrictive enough to make the experiments manageable computationally.
Below we refer to this method as the "basic iterative" method.

2.3 Iterative least possible class method

Both approaches we have described so far just try to increase the cost of correct classes without specifying which incorrect classes the model should choose. Such methods are sufficient to apply to datasets such as MNIST and CIFAR-10, which have a small number of classes and all classes are highly different from each other. On ImageNet, since the number of classes is much larger and the differences between classes are of varying degrees of significance, these methods can lead to uninteresting misclassifications, such as mistaking one type of sled dog for another. To generate more interesting errors, we introduce the iterative smallest possible class approach. This iterative approach attempts to generate an adversarial image that will be classified into a specific target class. For the desired class, we choose the least likely class based on the predictions of the trained network on the image X:
insert image description here
for a trained classifier, the least likely class is usually very different from the true class, so this attack method will Leading to more interesting mistakes, such as mistaking a dog for an airplane.
To make adversarial images classified as yLL, we maximize log p(yLL|X){∇X log p(yLL|X)} by taking iterative steps in the sign direction. The last expression equals sign {−∇XJ(X, yLL)} for neural networks with cross-entropy loss. Therefore, we have the following procedure:
insert image description here
For this iterative process, we used the same α and the same number of iterations as the basic iterative method.
Below we refer to this method as the "least likely class" method or simply "ll class".

2.4 Comparison of Adversarial Example Generation Methods

As mentioned above, there is no guarantee that adversarial images will actually be misclassified—sometimes the attacker wins, and sometimes the machine learning model wins. We perform an experimental comparison of adversarial methods to understand the actual classification accuracy of generated images and the type of perturbation utilized by each method.
Experiments were performed on all 50,000 validation samples in the ImageNet dataset (Russakovsky et al., 2014) using the pre-trained Inception v3 classifier (Szegedy et al., 2015). For each validation image, we generate adversarial examples using different methods and different ε values. For each pair of methods and ε, we computed the classification accuracy for all 50,000 images. In addition, we also compute the accuracy on all clean images and use it as a baseline.
Figure 2 summarizes the Top-1 and Top-5 classification accuracies of various adversarial methods on clean and adversarial images. Figures 5 and 4 give examples of generated adversarial images.
insert image description here
Figure 2: Top-1 and Top-5 accuracy of Inception v3 under attack with different adversarial methods and different ε, compared to “clean images” (unmodified images in the dataset). Accuracy is computed on all 50,000 validation images in the ImageNet dataset. In these experiments, ε was varied from 2 to 128.
insert image description here
Figure 4: Comparison of different adversarial methods with ε=32. Compared with the fast method, the perturbation produced by the iterative method is finer. Furthermore, iterative methods do not always select a point on the boundary of the ε-neighborhood as the adversarial image.
insert image description here
insert image description here
Figure 5: Comparison of images produced by adversarial perturbations using the 'fast' method with different sizes of perturbations ε. The picture above is a "washing machine" and the picture below is a "hamster". In both cases, for all ε considered, clean images are correctly classified, while adversarial images are misclassified.

As shown in Figure 2, even with the smallest value of ε, the fast method reduces the top-1 accuracy by a factor of 2 and the top-5 accuracy by about 40%. As ε increases, the accuracy of the adversarial images generated by the fast method remains at roughly the same level until ε = 32, and then as ε increases to 128, the accuracy slowly decreases to almost 0. This can be explained by the fact that the fast method adds ε-scaling noise to each image, so higher values ​​of ε essentially corrupt the content of the image, making it unrecognizable even by humans, see Figure 5.
On the other hand, iterative methods utilize finer perturbations that, even with higher ε, do not corrupt the image while confusing the classifier with higher rates. The basic iterative method is able to produce better adversarial images when ε < 48, but it fails to improve as ε increases. Even if ε is relatively small, the "least possible class" approach will destroy the correct classification of most images.
We restrict all further experiments to ε ≤ 16 because this perturbation is only seen as a small noise (if perceived at all), whereas adversarial methods are able to generate a large number of misclassified examples in the ε neighborhood of clean images.

3. Photos of Adversarial Examples

3.1 Destruction rate of adversarial images

To study the impact of arbitrary transformations on adversarial images, we introduce the concept of destruction rate. It can be described as the fraction of adversarial images that are no longer misclassified after transformation. The formal definition is as follows:
insert image description here
where, n is the number of images used to calculate the damage rate and X k are the images from the dataset.
insert image description here
is the correct class for that image.
Xkadv
is the corresponding adversarial image. The function T(•) is an arbitrary image transformation. In this article, we examine various transformations, including printing images and capturing the results. The function C(X,y) is an indicator function that returns whether an image is correctly classified:
insert image description here
we denote the binary inversion of this indicator value as C(X,y)
, computed asinsert image description here
=1− C(X,y)。

3.2 Experimental device

To explore the possibility of physical adversarial examples, we conduct a series of experiments with photos of adversarial examples. We printed clean and hostile images, took a photo of the printed page, and cropped the printed image from the photo of the full page. We can think of this as a black-box transformation, which we call a "phototransform".
We calculated the accuracy of clear and adversarial images before and after phototransformation, and the destruction rate of phototransformed adversarial images.
insert image description here
Figure 3: Experimental setup: (a) generated printouts containing pairs of clear and adversarial images, and QR codes to aid automatic cropping; (b) photos printed from mobile phone cameras; (c) images automatically cropped from photos .

The experimental steps are as follows:

  1. Print the image, see Figure 3a. To reduce the amount of manual work, we printed multiple pairs of clean and opposite examples per sheet. Additionally, the QR code is placed in the corner of the printout to facilitate automatic cropping.
    (a) All generated printout pictures (Fig. 3a) were saved in lossless PNG format.
    (b) Use the convert tool in the ImageMagick suite to convert batches of PNG printouts into multi-page PDF files, the default setting is: convert *.png output.pdf (c) The generated
    pdf files are printed using a Ricoh MP C5503 office printer. Each page of the PDF file is automatically scaled to fit the entire paper using the default printer scaling. The printer resolution is set to 600dpi.
  2. The image of the print was captured using a phone camera (Nexus 5x), see Figure 3b.
  3. The validation example in the photo is automatically cropped and distorted to be a square with the same size as the source image, see Figure 3c: (
    a) Detect the values ​​and positions of the four QR codes in the corners of the photo. The QR code encodes the verification example batch shown on the photo. If any one corner point fails to be detected, the entire photo is discarded and the images in the photo are not used to calculate the accuracy. We observed that no more than 10% of all images were discarded in any experiment, and typically the number of discarded images was around 3%-6%.
    (b) Distort the photo using a perspective transformation to move the position of the QR code into predefined coordinates.
    (c) After image warping, each example has known coordinates and can be easily cropped from the image.
  4. Run classification on the transformed image and the source image. Compute the accuracy and destruction rate on adversarial images.

The procedure involves manually taking photos of printed pages without carefully controlling lighting, camera angle, distance to the page, etc. This is intentional; it introduces nasty variability, potentially undermining adversarial perturbations that rely on fine-grained co-adaptation of exact pixel values. That said, we didn't intentionally look for extreme camera angles or lighting conditions. All photos are taken under normal indoor lighting, with the camera pointed almost straight on the page.
For each combination of adversarial example generation method and ε, we conduct two sets of experiments:

  • normal situation. To measure average-case performance, we randomly select 102 images for an experiment with a given ε and adversarial methods. This experiment estimates how often an adversary succeeds on randomly selected photos. The world chooses an image at random, and the adversary tries to make it misclassified.
  • pre-filter condition. To investigate more aggressive attacks, we conduct experiments with pre-filtering of images. Specifically, we selected 102 images, all clean images were correctly classified, while all adversarial images (before photo transformation) were incorrectly classified (top-1 and top-5 classification). Furthermore, we use a confidence threshold for the top predictions: p(ypredicted|X) ≥ 0.8, where ypredicted is the class predicted by the network for image X. This experiment measures how often the adversary succeeds when it can choose the original image to attack. Under our threat model, the adversary has access to the model parameters and architecture, so the attacker can always infer whether an attack would be successful without photo transformation. An attacker may wish to do his best by selecting an attack that succeeds under this initial condition. The victim then takes a new photo of the physical object the attacker chooses to display, and the photo transformation can either preserve the attack or destroy it.

Experimental results on photos of adversarial images

Tables 1, 2 and 3 summarize the results of the photo transformation experiments.

Table 1: Accuracy on photos of adversarial images in general (randomly selected images).
insert image description here
Table 2: Accuracy on photos of adversarial images in the prefiltered case (correctly classified clean images, misclassified adversarial images in printed and photographed digital forms).
insert image description here
Table 3: Adversarial image destruction rates with photos.
insert image description here
We find that "fast" adversarial images are more robust to photo transformations than iterative methods. This can be explained by the fact that iterative methods exploit finer perturbations that are more likely to be corrupted by light transformations.
An unexpected result is that in some cases, the adversarial image corruption rate in the "prefiltered case" is higher than in the "general case". In the case of iterative methods, the overall success rate of prefiltered images is even lower than that of randomly selected images. This shows that, in order to achieve very high confidence, iterative methods often make small co-adjustments that do not survive photo transformations.
Overall, the results show that even after a non-trivial transformation (photo transformation), some adversarial examples are still misclassified. This demonstrates the possibility of physical adversarial examples. For example, an adversary using the fast method with ε = 16 can expect about 2/3 of the images to be misclassified by the top 1 and about 1/3 of the images to be misclassified by the top 5. Therefore, by generating enough adversarial images, the adversary can potentially cause more misclassifications than the natural input.

3.4 Demonstration of black-box adversarial attacks in the physical world

The experiments described above study physical adversarial examples under the assumption that the adversary has full access to the model (i.e., the adversary knows the architecture, model weights, etc.). However, a black-box scenario (where the attacker does not have access to the model) is a more realistic model of many security threats. Since adversarial examples are often transferred from one model to another, they can potentially be used in the black-box attacks of Szegedy et al. (2014) and Papernot et al. (2016a). As our own black-box attack, we demonstrate that our physical adversarial examples fool a different model than the one used to construct them. Specifically, we demonstrate that they fool the open-source TensorFlow camera demo 2, a mobile phone app that performs image classification on-device. We showed the application several printed clear and adversarial images and observed a change in classification from true to incorrect labels. A video with a demo is available at https://youtu.be/zQ_uMenoBCk. We also demonstrated this effect live at GeekPwn 2016.

4. Artificial Image Transformation

A transformation applied to an image through the process of printing, photographing, and cropping the image can be thought of as some combination of simpler image transformations. Therefore, to better understand what is going on, we conduct a series of experiments to measure the adversarial destruction rate of artificial image transformations. We investigate the following transform sets: contrast and brightness changes, Gaussian blur, Gaussian noise, and JPEG encoding.
In this set of experiments, we used a randomly selected subset of 1000 images from the validation set. This subset of 1000 images is selected once, so all experiments in this section use the same subset of images. We conduct experiments on multiple pairs of adversarial methods and transformations. For each given pair of transformation and adversarial method, we compute adversarial examples, apply the transformation to the adversarial examples, and then compute the destruction rate according to Equation (1).
insert image description here
insert image description here
insert image description here
Figure 6: Comparison of adversarial destruction rates for various adversarial methods and transformation types. All experiments were performed with ε=16

Detailed results of various transformations and adversarial methods for ε=16 can be found in Fig. 6. The following general observations can be drawn from these experiments:

  • The adversarial examples generated by the fast method are the most robust to transformations, and the adversarial examples generated by the iterative least possible class method are the least robust. This is consistent with our results on photo transformations.
  • The burn rate of the top 5 is usually higher than the burn rate of the top 1. This can be explained by the fact that in order to "destroy" the top 5 adversarial examples, the transformation must push the correct class label to one of the top 5 predictions. However, in order to destroy the top-ranked adversarial examples, we must push the correct labels to the top-ranked predictions, which is a strictly higher requirement.
  • Changing brightness and contrast has little effect on adversarial examples. The fast and basic iterative adversarial examples have a destruction rate of less than 5%, and the iterative smallest possible class method has a destruction rate of less than 20%.
  • Blurring, noise, and JPEG encoding are more damaging than changes in brightness and contrast. In particular, the destruction rate of the iterative method can reach 80% − 90%. However, none of these transformations destroys 100% of the adversarial examples, which is consistent with the “photo transformation” experiment.

V. Summary

In this paper, we explore the possibility of creating adversarial examples for machine learning systems operating in the physical world. We use images captured by mobile phone cameras as input to the Inception v3 image classification neural network. We found that in such a setting, a substantial fraction of adversarial images made using the original network are misclassified, even when fed to the classifier via a camera. This finding demonstrates the possibility of adversarial examples for machine learning systems in the physical world. In future work, we hope to be able to demonstrate attacks using other types of physical objects than images printed on paper against different types of machine learning systems, such as complex reinforcement learning agents, without access to model parameters and architecture (possibly using transport properties), and physical attacks that achieve higher success rates by explicitly modeling physical transformations during adversarial example construction. We also hope that future work will develop effective ways to protect against such attacks.

Guess you like

Origin blog.csdn.net/weixin_45184581/article/details/125204448