backdoor attack and defense总结

Recently I am reading some papers related to backdoor(Trojan) attack in deep neural networks(DNN). It’s easy to forget the content of the papers while reading, so I need to write something down and write down a series of backdoor attack and defense. Because the blogger is also a beginner, there are many areas where the understanding is not in place, please bear with me. I highly recommend reading this review, which summarizes the development of this field in recent years: Backdoor Attacks and Countermeasures on Deep Learning: A Comprehensive Review

Introduction

A backdoor attack or Trojan attack, translated as a backdoor attack or a Trojan attack, is an attack on the neural network, which can make the model misidentify by manipulating the input. For example, in the task of identifying traffic signs, a stop sign can be correctly recognized as a stop by the neural network, but if a flower pattern is attached to the stop sign, it may be incorrectly recognized by the model For acceleration.

Take the image classification task as an example to illustrate the process of backdoor attack. We call the patterns and pixel blocks (not only these, which will be mentioned later) that can induce misclassification of the model as triggers. The input sample with trigger is called poisoning sample, and the input sample without trigger is called benign sample. First, we have to generate triggers, and then select a part of the benign sample to attach triggers to them and modify its label (some methods do not need to be modified) to become poisoning samples, and then use them as the training set for model training. After training, the backdoor will be Successfully embedded in the model. In the test, samples without trigger will be classified normally by the model, and samples with trigger will be misclassified into the specified category

backdoor attack pipeline

Compared

There are many attacks on neural networks, such as data poisoning attack, adversarial attack, etc. Compared with data poisoning attack, backdoor attack is more concealed, because this kind of attack can correctly classify benign samples, and only samples with trigger will be misclassified, while the former attack method will perform indiscriminate attacks. Its goal is to reduce the classification accuracy of the model, and the purpose of the latter attack method is to misclassify infected samples without significantly reducing the accuracy of the model. The common point of the two attack methods is that they both need to inject the infected samples into the training set before the model training.

For adversarial attack (adversarial attack), its goal is to generate an adversarial sample that can misclassify the model. An important difference between it and backdoor attack is that its attack phase is after the model is established, while backdoor attack is Before the model is built.

Qualitative analysis

The basis for the success of a backdoor attack is that the capacity of the neural network is large enough and its "learning ability" is too strong. Even a small trigger will learn it and consider it to be a certain type of feature, so All samples with trigger are misclassified into a certain category, regardless of what other areas of these samples are. It can be said that the neural network is stupid and easily obscured. It sees that some samples with trigger belong to a certain category, and it thinks that all samples with trigger belong to this category.

Although from the perspective of the input sample level, there is only one trigger difference between the benign sample and the infected sample, but there will be a large difference at the feature space level, which makes the model misclassified. However, in the category that was misclassified, the samples that originally belonged to that category were different from the samples that were misclassified through the poisoning method in the feature space. This is very easy to understand. For example, in the category of cats, the samples that originally belonged to cats learned that they may have two pointed ears, four legs, and a tail that is a cat. It was originally not a cat but was mistaken with a trigger. For samples classified as cats, the model learns that the trigger is a cat. These two judgment methods are different, so the representation of the feature space is also different, and the activation of model neurons is also different, which provides ideas for resisting backdoor attacks.

Attack method

Attack classification

  • According to whether to change the type of infected samples, it can be divided into clean label attack and poisoning label attack. For the former, the input sample does not change its original category after adding a trigger, while the latter will change the sample category. The former attack method is more concealed because it does not change the category, so it is more difficult for people to discover. The former requires us to be infected in the categories that we want to misclassify, so we need to obtain benign samples of the target category, while the latter can be used to infect samples of any category.
  • According to whether it is misclassified into the specified category, it can be divided into targeted attack and untargeted attack. For the former, samples with trigger will be misclassified into the specified category, while the latter will be randomly misclassified into other categories. Generally speaking, what we want is a targeted attack, because this attack method is more purposeful, can induce the model to misclassify the category we want, and has higher practical value.
  • According to the threat model (threat model) classification can be divided into black-box attack and white-box attack. The threat model refers to some prerequisite assumptions made by the attacker about the attack, such as whether the structure and specific parameters of the model are known, whether the training data can be accessed, whether the label information can be obtained, etc. The more stringent the conditions of the threat model, The universality of the attack is poor, and the looser the conditions, the better the universality of the attack and the stronger the attack ability. For a black-box attack, the attacker cannot know the structure and specific parameters of the model, while the white-box attack assumes that the attacker can obtain the structure and specific parameters of the model. Therefore, the black-box attack puts forward higher requirements on the attack method, and the white-box attack usually implants a backdoor into the pre-trained model published on the Internet, and then publishes it on the Internet. When reading related papers, theat model is the content that needs special attention. Only by knowing the conditions for the establishment of the attack can we better evaluate the attack method.
  • According to whether all samples with trigger can be misclassified, it can be divided into multi-source attack and specific-source attack. For a specific-source attack, only samples from a specific category can be misclassified after adding a trigger, while a multi-source attack can misclassify all samples with a trigger. Specifc-source attack is a very powerful attack method, and many existing defense methods cannot defend against this attack.

Target

The goal of backdoor attacks is to maintain the accuracy of the model on benign samples, increase the success rate of misclassification on infected samples, and enhance the concealment of the attack. Therefore, the main content of the current research in this direction is the trigger generation algorithm. I want to find a trigger that is less easily detected and has a more relaxed attack condition.

tigger type

The trigger mentioned in the paper has the following four types

  • Pixel blocks. This is the trigger when the backdoor attack was first proposed. Add a pixel block to the lower right corner of the picture as the trigger.
  • Fixed patterns of patterns. For example, a small flower, a pair of glasses, hellokitty, etc. This kind of trigger has improved concealment compared to pure pixel blocks
  • Change the intensity of certain pixels. Changing the intensity of pixels in certain areas of the picture can also be used as a trigger method
  • Add anti-noise.
  • Overlay the picture. By superimposing another picture on the image, you can increase the transparency of the superimposed picture to increase concealment.

Way of defense

Defense requires a more in-depth understanding of backdoor attacks. Backdoor attacks have strong concealment, so there are many backdoor attacks in the frontiers of research, but there are few defense methods, and defense research often lags behind attacks. The defense method is usually based on the two characteristics of the difference in the feature space between the infected sample and the benign sample, and the difference in neuron activation on the neural network. There are three types of defense

  • Preprocess input samples
    • Flip, crop, zoom
    • Add noise
    • Train autoencoder with clean samples
  • Reverse engineering to generate trigger
    • Neural Cleanse生成
    • GAN generation
  • Feature space
    • AC clustering algorithm

Guess you like

Origin blog.csdn.net/SJTUKK/article/details/108914536