03-against sample attacks

Against sample attacks

Github:https://github.com/Gary11111/03-GAN

Research Background

Although deep learning performed well on many tasks in the field of computer vision, Szegedy discovered for the first time that deep neural networks have interesting weaknesses in the field of image classification. They prove that despite the high accuracy, modern deep networks are very vulnerable to attacks against samples. These adversarial samples have only a slight disturbance, so that the human visual system cannot detect this disturbance (the picture looks almost the same). Such an attack would cause the neural network to completely change its classification of pictures. In addition, the same picture disturbance can deceive many network classifiers.

Evaluation criteria involved in confrontation samples

Use a specific example to illustrate the three indicators

If a class has 80 boys and 20 girls, a total of 100. The goal is to find all girls. Now someone selects 50 people, of which 20 are girls, and mistakenly selects 30 boys as girls. Come out. As an evaluator, you need to evaluate his work

  • Accuracy

    For a given policy tree data set, the classifier correctly classifies the ratio of samples to the total number of samples. In the beginning scene, this person classified 20 girls correctly and 50 boys correctly. So

    a c c = ( 20 + 50 ) / 100 = 70 acc = (20+50)/100 = 70%

    If you only focus on the accuracy rate, there will be a problem: suppose I treat all people as boys, so that my accuracy rate can be as high as 80%, but this algorithm is obviously unreasonable, so it needs to lead to recall and precision.

  • TP, FN, FP, TN

    Related positive category Irrelevant negative class
    Retrieved TP: Positive category judged as positive category (20 girls out of 50) FP: Negative category is judged to be positive category (30 boys out of 50 selected)
    Not retrieved FN: The positive category is judged to be negative (assuming that a girl is not sampled, she is judged to be a boy) TN negative category judged as negative category (50 boys not sampled)
  • Recall rate (recall)

    R = T P T P + F N R = \frac{TP}{TP+FN} The calculation is: all the correct results retrieved account for the proportion of all positive categories.

  • Precision

    P = T P T P + F P P = \frac{TP}{TP+FP} The calculation is: the correct results retrieved account for the proportion of all retrieved results.

Ways to combat attacks

The two classification methods described below are the most commonly used attack modes for testing the effectiveness of defense models in practical use. According to whether the attacker can master the machine learning algorithm, it is divided into white box and black box attacks. According to the attack target (compared with the final classification), it is divided into targeted attack and non-target attack. Among them, the concepts of black box attack and white box attack will be repeatedly mentioned in the paper on defense algorithms. Generally, the new algorithms proposed are subject to black box attack and white box attack.

  • White box attack: The attacker can learn the algorithm used by machine learning and the parameters used by the algorithm. Attackers can interact with machine learning systems in the process of generating adversarial attack data.
  • Black box attack: The attacker does not know the algorithm and parameters used by machine learning, but the attacker can still interact with the machine learning system. For example, it can observe the output and judge the output by passing in any input.
  • Targeted attack: For a picture, an adversarial sample is generated, so that the labeling on the labeling system has nothing to do with the original label, that is, as long as the attack is successful, there is no limit to which category the adversarial sample ultimately belongs to.
  • Targetless attack: For a picture and a target labeling sentence, generate an adversarial sample, so that the labeling on the labeling system is exactly the same as the target label, that is, not only the attack is required to be successful, but also the generated adversarial sample belongs to a specific class.

Confrontational defense

  • Adversarial training: Adversarial training aims to train a robust model from randomly initialized weights. Its training set consists of a real data set and a data set added with anti-disturbance, so it is called adversarial training.

  • Gradient mask: Since many current anti-sample generation methods are based on gradient generation, if the original gradient of the model is hidden, it can achieve the effect of resisting the anti-sample attack.

  • Randomization: introducing random layers or random variables into the original model. Make the model have a certain randomness, comprehensively improve the robustness of the model, and make its tolerance to noise become higher.

  • Denoising: Before the input model is judged, the current adversarial sample is denoised, and the information that causes disturbance is removed to prevent it from attacking the model.

task

  • Produced untargetagainst samples

    Mission goal: generate untarget adversarial samples, and calculate the success rate. The untarget attack method is to make the model misclassify the input, so the best way is to make the gradient of the cross entropy as large as possible against the samples, at tf The implementation can reverse the loss and use SGD optimization. Many experiments have found that the success rate of untarget is basically above 95%. Figures 3.1 and 3.2 show the effect after the attack. It can be seen that vgg16 is deceived under the condition that the picture cannot be distinguished by the naked eye, and the confidence of the model reaches a higher value.

    Just make VGGNET go wrong

    The labels of the original pictures are as follows:

    Insert picture description here

    After using DeepFool attack:

    Insert picture description here

  • Join regularization

    • usetotal variation

      Insert picture description here

      Success rate of this attack is 0.9333333333333333
      Noise norm of this attack is 25.494400024414062
      
    • Use to l2calculate the loss against the sample and the original sample

      Insert picture description here

      Success rate of this attack is 1.0
      Noise norm of this attack is 23.5655111137727132
      
  • The increase in image enhancement to defense

    affine transformation, adding salt and pepper noise, bluring

    See if the adversarial samples can still take effect after adding these methods.

    • Affine transformation

      def affineTrans(img):
          pts1 = np.float32([[10, 10], [20, 5], [5, 20]])
          pts2 = np.float32([[10, 8], [18, 5], [5, 20]])
          M = cv2.getAffineTransform(pts1, pts2)
          return cv2.warpAffine(img, M, (32, 32))
      
      • Success rate of this attack is 0.99
      • Noise norm of this attack is 15.59846019744873

Insert picture description here

  • Salt and pepper noise

    Insert picture description here

    def noise(img, SNR=0.7):
        img_ = img.transpose(2, 1, 0)
        c, h, w = img_.shape
        mask = np.random.choice((0, 1, 2), size=(1, h, w), p=[SNR, (1 - SNR) / 2., (1 - SNR) / 2.])
        mask = np.repeat(mask, c, axis=0)  # 按channel 复制到 与img具有相同的shape
        img_[mask == 1] = 255  # pepper
        img_[mask == 2] = 0  # white
        return img_.transpose(2, 1, 0)
    
    • Success rate of this attack is 0.98
    • Noise norm of this attack is 20.688474655151367
  • Blur

    Use the blur function that comes with cv2

    img = cv2.blur(img, kernel = (1, 1)) # kernel越大图像越模糊
    

    Insert picture description here

    • Success rate of this attack is 1.0
    • Noise norm of this attack is 12.38012695312
  • Three effects overlay

    Insert picture description here

  • Generate target adversarial samples

    Randomly assign a class so that the sample is determined by the classifier as this class (the loss function may need to be changed).

    This experiment selected the truck category as the target, hoping to convert all samples into the truck category. But in the experiment, it was found that the probability of success of the experiment was only about 0.3, but all samples deviated from the original label, that is: successfully deceived the neural network. In the experiment, I used two sets of schemes to compare the results:

    lr0.01+epoch500

    Insert picture description here

    lr0.005+epoch2000

    Forgive me for not wanting to run the image again

    The experimental effect is as follows: A magical phenomenon was found in the experiment. Although the success rate of the target attack sample is about 50%, there are still nearly 48% of the adversarial samples will be classified by vgg16 as another label. The specific results can be seen in the figure 3.10. Taking an attack as an example, the target of the attack I specified is category 9, truck. But after the attack, 40% + are distributed in type 2 automobile. The author suspects that this result may be because after adding noise, some features of the picture make the vgg network unstable and indistinguishable, and truck and automobile are also two very similar categories, so we can draw this assumption: target The attack was successful, but because the sample disturbance was too large, vgg oscillated between two similar labels, automobile and truck, during classification, and the probability distribution was roughly the same.

    Insert picture description here

It can be concluded that the target attack is successful, but because the sample disturbance is too large, it causes vgg to swing between the two similar labels, automobile and truck, and the probability distribution is roughly The same phenomenon.

Published 6 original articles · received 1 · views 390

Guess you like

Origin blog.csdn.net/gky_1111/article/details/105393284