Paper Intensive Reading——Invisible Backdoor Attack with Sample-Specific Triggers

Article directory

Stealth backdoor attack with sample-specific triggers

Paper information

paper title Invisible Backdoor Attack with Sample-Specific Triggers
author Yuezun Li, Yiming Li, Baoyuan Wu, Longkang Li, Ran He, and Siwei Lyu
Research institutions Ocean University of China, The Chinese University of Hong Kong, Shenzhen Research Institute of Big Data, Tsinghua University, Tsinghua University, University at Buffalo
Meeting ICCV
year of publication 2021
Paper link https://openaccess.thecvf.com/content/ICCV2021/papers/Li_Invisible_Backdoor_Attack_With_Sample-Specific_Triggers_ICCV_2021_paper.pdf
open source code https://github.com/yuezunli/ISSBA

paper contribution

  • The successful conditions of the current backdoor attack defense are analyzed: the existing backdoor attack is sample-independent (that is, the triggers of different poisoning samples are the same)
  • A method of customizing backdoor triggers based on samples is proposed - a slight perturbation of the training samples based on the idea of ​​image steganography (that is, invisible additional noise specific to the generated samples)

comprehension translation

Summary

Recently, backdoor attacks pose new security threats to the training process of deep neural networks (DNNs). Attackers try to inject hidden backdoors into DNNs so that the attacked model performs well on benign samples, and once the attacker-defined trigger activates the hidden backdoor , its prediction results will be maliciously changed. Existing backdoor attacks usually adopt the setting that triggers have nothing to do with samples, that is, different contaminated samples contain the same trigger, which leads to the fact that existing backdoor defenses can easily mitigate attacks. In this work, we explore a new attack paradigm where the backdoor trigger is sample-specific . In our attack, we only need to modify unseen perturbations in some training samples , without manipulating other training components (e.g., training loss and model structure) like many existing attacks. Specifically, inspired by recent DNN-based image steganography, we encode attacker-specified strings into benign images via an encoder-decoder network, generating sample-specific unseen additional noise as a backdoor trigger . When DNNs are trained on polluted datasets, a mapping from strings to target labels will be generated. Extensive experiments on benchmark datasets validate the effectiveness of our method on model attacks with and without defenses.

Code is available at https://github.com/yuezunli/ISSBA.

1 Introduction

Please add a picture description
Figure 1. Comparison of triggers in previous attacks (e.g., BadNets [8]) and our attack. The triggers of previous attacks are sample-independent (i.e., different tainted samples contain the same trigger), while the triggers in our method are sample-specific .

Deep neural networks (DNNs) have been widely and successfully applied in many fields [11, 25, 49, 19]. A large amount of training data and increasing computing power are the key factors for its success, but the long-term and complicated training process has become a bottleneck for users and researchers. To reduce overhead, training DNNs usually leverages third-party resources. For example, you can use third-party data (such as data from the Internet or third-party companies), train models with third-party servers (such as Google Cloud), or even directly adopt third-party APIs. However, the opacity of the training process brings new security threats.

Backdoor attacks are an emerging threat in the training process of DNNs. It maliciously manipulates the prediction of the attacked DNN model by polluting part of the training samples. Specifically, the backdoor attacker injects some attacker-specified patterns (called backdoor triggers) in the tainted image and replaces the corresponding labels with predefined target labels. Therefore, attackers can embed some hidden backdoors into the models trained with the tainted training set. The attacked model will function normally when processing benign samples, but when the trigger is present, its predictions will be changed to target labels. Furthermore, the trigger may be invisible [3, 18, 34] and the attacker only needs to pollute a small fraction of samples, making the attack very stealthy. Therefore, insidious backdoor attacks pose a serious threat to the application of DNNs.

Fortunately, several backdoor defenses [7, 41, 45] have been proposed, showing that existing backdoor attacks can be successfully mitigated. This raises an important question: has the threat of backdoor attacks really been solved? In this paper, we reveal that existing backdoor attacks are easily mitigated by current defenses mainly because their backdoor triggers are sample-independent, that is, different contaminated samples The schemas all contain the same triggers. Considering that triggers are sample-independent, defenders can easily reconstruct or detect backdoor triggers based on the same behavior between different tainted samples.

Based on this understanding, we explore a new attack paradigm where the backdoor trigger is sample-specific. We only need to modify unseen perturbations in some training samples, without manipulating other training components (e.g., training loss and model structure) like many existing attacks. Specifically, inspired by DNN-based image steganography [2, 51, 39], we encode attacker-specified strings in benign images via an encoder-decoder network, generating sample-specific unseen additional noise as Backdoor trigger . When DNNs are trained on the polluted dataset, a mapping from strings to target labels is generated. The proposed attack paradigm breaks the basic assumptions of current defense methods, thus they can be easily bypassed.

The main contributions of this paper are as follows: (1) We provide a comprehensive discussion of the success conditions of current mainstream backdoor defenses. We reveal that their success all relies on the premise that the backdoor triggers are sample-agnostic. (2) We explore a new unseen attack paradigm where the backdoor trigger is sample-specific and unseen. It can bypass existing defenses because it breaks their basic assumptions. (3) We conduct extensive experiments to verify the effectiveness of the proposed method.

2. Related work

2.1. Backdoor attack

Backdoor attack is an emerging and rapidly developing field of research that poses a security threat to the training process of deep neural networks (DNNs). Existing attacks can be divided into two categories according to the characteristics of triggers: (1) visible attacks, the triggers in the attacked sample are visible to humans; (2) invisible attacks, the triggers are invisible.

可见后门攻击。 Gu等人 [8]首次揭示了DNNs训练中的后门威胁,并提出了BadNets攻击,这是可见后门攻击的代表。在给定攻击者指定的目标标签的情况下,BadNets通过将后门触发器(例如,图像右下角的3×3白色方块)盖在良性图像上,污染了一部分其他类别的训练图像。这些带有目标标签的被污染图像,与其他良性训练样本一起,被输入到DNNs进行训练。目前,这个领域也有一些其他的工作 [37, 22, 27]。特别是,同时进行的工作 [27] 也研究了样本特定的后门攻击。然而,他们的方法除了需要修改训练样本外,还需要控制训练损失,这大大降低了其在实际应用中的威胁。

不可见后门攻击。 Chen等人[3]首次从后门触发器的可见性的角度讨论了后门攻击的隐蔽性。他们建议,被污染的图像应该与其良性对应物不可区分,以避免人工检查。具体来说,他们提出了一种混合策略的不可见攻击,该策略通过将后门触发器与良性图像混合,而不是直接盖印,来生成被污染的图像。除了上述方法,还有一些其他的不可见攻击 [31, 34, 50] 针对不同的场景被提出:Quiring等人 [31] 针对训练过程中的图像缩放过程,Zhao等人 [50] 针对视频识别,Saha等人 [34] 假设攻击者知道模型结构。注意,大多数现有的攻击都采用了样本无关的触发器设计,即触发器在训练或测试阶段是固定的。在本文中,我们提出了一种更强大的不可见攻击范例,其中后门触发器是样本特定的

2.2. 后门防御

具体看 3. 深入了解现有防御

Pruning based defense. Inspired by the observation that backdoor-related neurons are usually dormant during inference on benign samples, Liu et al. [24] proposed to prune these neurons to remove hidden backdoors in DNNs. A similar idea was also explored by Cheng et al. [4], who proposed to remove the ℓ ∞ \ell _{\infty } of the activation map of the final convolutional layerNeurons with high activation values ​​in the norm.

Defense based on trigger synthesis. Different from directly eliminating hidden backdoors, trigger synthesis-based defenses first synthesize potential triggers and then remove hidden backdoors by suppressing their effects in a second stage . Wang et al. [41] proposed the first defense based on trigger synthesis, Neural Cleanse , where they first obtained the potential trigger patterns for each class, and then determined the final synthetic trigger patterns based on anomaly detectors and its target label. Similar ideas have also been studied in [30, 9, 42], who took different approaches to generate latent triggers or perform anomaly detection.

Saliency map based defense. These methods use saliency maps to identify potential trigger regions to filter malicious samples. Similar to trigger synthesis-based defenses, anomaly detectors are also involved. For example, SentiNet [5] employs Grad-CAM [35] to extract key regions from the input to each class, and then locate trigger regions based on boundary analysis. Similar ideas are also explored [13].

STRIP. Recently, Gao et al. [7] proposed a method, called STRIP, to filter malicious samples by superimposing various image patterns onto suspicious images and observing the randomness of their predictions. Based on the assumption that the backdoor trigger is input-independent, the smaller the randomness, the higher the probability that the suspicious image is malicious .

3. Gain insight into existing defenses

In this section, we discuss the success conditions for current mainstream backdoor defenses. We argue that their success mainly depends on an implicit assumption that the backdoor triggers are sample-agnostic. Once this assumption is violated, their effectiveness will be greatly affected. The assumptions for several defense methods are discussed below.

The Assumption of Pruning-based Defenses. Pruning- based defenses are inspired by the assumption that backdoor-associated neurons are distinct from those activated by benign samples . Defenders can eliminate hidden backdoors by pruning neurons that are dormant in benign samples. However, the non-overlap between these two types of neurons may be due to the simplicity of the sample-independent trigger pattern, i.e., DNNs need only a small number of independent neurons to encode this trigger. This assumption may not hold when the triggers are sample-specific, since this paradigm is more complicated.

  • The Chinese translation of this passage is as follows:

    Hypotheses for pruning-based defense methods. The pruning-based defense approach is driven by the assumption that neurons associated with backdoors are distinct from those activated for benign samples. Defenders can prune neurons that are dormant to benign samples to eliminate hidden backdoors. However, the non-overlap between these two types of neurons may be due to the simplicity of sample-independent trigger patterns, i.e., deep neural networks (DNNs) require only a few independent neurons to encode such triggers. This assumption may not hold when triggers are sample-specific, since the paradigm is more complicated.

  • Detailed and easy-to-understand analysis:

    The pruning-based defense method is a defense strategy for neural networks based on the assumption that the neurons associated with backdoor attacks (i.e., the neurons activated by triggers) are different from the neurons processing normal (benign) samples. of. This is because in a backdoor attack, the attacker usually implants specific neurons in the model, and when these neurons are activated (for example, by inputting an image containing a trigger), the model will make the attacker The specified forecast.

    Thus, defenders can eliminate backdoors in the model by pruning (i.e., deleting) those neurons that do not respond to normal samples (i.e., are dormant). That's because if a neuron doesn't respond to normal samples, it's likely a backdoor neuron planted by an attacker.

    However, the success of this method presupposes that the backdoor neurons and the neurons processing normal samples are non-overlapping, an assumption that may not hold in some cases. For example, if the backdoor triggers are sample-specific (i.e., triggers behave differently in different samples), then the backdoor neurons may overlap with neurons processing normal samples, making pruning methods less effective Eliminate backdoors.

The Assumption of Trigger Synthesis based Defenses. During synthesis, existing methods (e.g., Neural Cleaning [41]) need to find those that can transform any harmless (benign) image into a specific Potential trigger patterns for categories. Therefore, the synthesized triggers are only effective if the attacker-specified backdoor triggers are irrelevant to the sample.

  • (OK) The subject of this passage is a hypothesis about trigger synthesis defenses. Step by step detailed explanation:

    1. Trigger synthesis defense: This is a defense strategy mainly used to deal with backdoor attacks of deep learning models. A backdoor attack is when an attacker inserts a hidden, specific pattern (we call it a "trigger") during model training that causes the model to make abnormal predictions. The goal of trigger synthesis defense is to find out such triggers in order to carry out effective defense.

    2. Synthesis process: In this defense strategy, possible trigger patterns need to be identified through a process called "synthesis". Simply put, the process is to try to generate every possible pattern and see which one causes the model's predictions to change abnormally . For example, if the model always predicts "cat" when it sees an image with a certain pattern, even if the actual content of the image is not a cat, then that certain pattern could be a trigger.

    3. Convert any harmless (benign) image to a specific category: This sentence describes the effect of the trigger. A trigger is a pattern that can change the outcome of a model's predictions. In our example, if an innocuous (i.e., not tampered with) image is incorrectly identified as "cat" by the model after adding the trigger, then we can consider this pattern to be a possible trigger .

    4. Sample-agnostic backdoor triggers: This is a core assumption of this defense approach. "Sample independent" means that no matter which image the trigger is inserted into, as long as the trigger is present, the prediction result of the model will change. That is, the effect of the trigger will not be affected by the specific image content. This assumption is important because if the effect of the trigger changes with the content of the image, it will be difficult for us to find it through the compositing process.

    So, the general idea of ​​this passage is: when using trigger synthesis defense method, we need to find those potential triggers that can convert any innocuous image into a specific class, but this defense method only works if the trigger is sample-independent case is valid.

  • (OK) Let me illustrate the meaning of this passage with a simple example:

    Suppose we have a deep learning model whose task is to recognize animals in images. In a normal situation, if we feed the model a picture of a dog, it will predict "dog".

    Now, suppose an attacker inserts a backdoor trigger into the model, let's call it "X". The effect of this trigger "X" is: no matter what the actual content in the image is, as long as the trigger "X" is contained in the image, the model will predict the result as "cat".

    In trigger synthesis defenses, our task is to try to generate various possible patterns and then test the model's response to these patterns. For example, we generate a pattern "Y", add it to an innocuous image (for example, an image of a dog), and observe the model's response.

    如果模型看到含有模式"Y"的图像时,仍然预测结果为"狗",那么我们可以推断模式"Y"不是触发器。如果模型看到含有模式"Y"的图像时,预测结果变为"猫",那么我们可以推断模式"Y"可能就是触发器。

    但是,触发器合成防御方法的有效性依赖于一个重要的假设:后门触发器是样本无关的,即无论触发器"X"添加到哪一张图像中,模型看到含有触发器"X"的图像时,预测结果总是"猫"。如果触发器的效果会随着图像内容的变化而变化,例如,当触发器"X"添加到狗的图像中时,模型预测结果为"猫",但当触发器"X"添加到鸟的图像中时,模型预测结果为"鸟",那么我们就很难通过合成过程找到触发器。

  • 解析:
    这段话描述了基于触发器合成的防御方法的基本假设。在这种防御方法中,首先要找到那些可以将任何正常(良性)图像转换为某个特定类别的潜在触发器。这里的"潜在触发器"是指那些被插入到图像中,使得被攻击的模型产生特定预测输出的模式或者标记。

    这种防御方法的一个关键假设是,被攻击者使用的后门触发器是样本无关的,也就是说,无论插入到哪一张图像中,这个触发器都能使得模型产生同样的预测结果。这是因为如果触发器的效果取决于它所在的图像(也就是说,它是样本相关的),那么通过这种方式合成的触发器就可能无法有效地将所有的图像都转换为特定的类别。

    因此,只有在后门触发器是样本无关的情况下,基于触发器合成的防御方法才能有效。如果这个假设不成立,那么这种防御方法可能就无法成功地检测和防御后门攻击。

  • 假设我们有一个图像分类的神经网络模型,它的任务是识别图像中的动物类型。一般情况下,它会根据图像的内容来判断图像中是什么动物。

    However, if the model is backdoored, the attacker may insert a special marker (that is, a backdoor trigger) in the image, such as a small red dot, so that after the model sees the small red dot, it will ignore the actual image in the image. Any animal above will be predicted as "cat".

    Then, the defense method based on trigger synthesis is to try to find this little red dot . It will try to find a pattern that converts all normal images into "cats" , if it finds the same pattern as the little red dot, it may indicate a backdoor attack.

    However, the premise assumption of this defense method is that no matter which image this small red dot is inserted into, the prediction result of the model will become "cat", that is, the effect of this small red dot is sample-independent . If the small red dot produces different effects in different images, such as inserting a small red dot in a dog image and predicting it as "cat", and inserting a small red dot in a cat image the model predicts "dog", then this This defense method may not be successful in detecting and defending against backdoor attacks because it cannot find a trigger pattern that converts all images into the same category.

  • The main content of this paragraph is to discuss a hypothesis of the defense method based on trigger synthesis. Such methods, such as existing techniques such as Neural Cleanse, need to find out possible trigger patterns during the synthesis process, which can convert any benign (harmless) image into a specific class. Therefore, this synthetic trigger is only effective if the attack-specified backdoor trigger is sample-agnostic.

    The term "sample-agnostic" refers to backdoor triggers that do not depend on a specific sample. That is, no matter what the input sample is, as long as there is this trigger, it can cause the model to produce a specific output. This is a core assumption of the trigger-based synthetic defense approach.

    Let's take an example to illustrate the concept. Suppose we have a deep learning model for image classification that has been backdoored. During the model training process, the attacker embeds a specific trigger (such as adding a red pixel in the upper right corner of the picture) into some samples, and sets the label of these samples as "cat". Then after the model is deployed, the attacker can add this trigger to any input picture, whether the picture is originally "dog", "car" or "airplane", as long as it contains that trigger, the model will recognize it as " cat".

    However, if the trigger designed by the attacker is associated with a specific sample, for example, only when the input image is "car", adding a trigger will make the model recognize it as "cat", then defense methods such as Neural Cleanse may not be effective because their design assumes that the triggers are sample-insensitive and apply to any input image.

  • Neural Cleanse is a method for detecting and mitigating backdoor attacks in neural networks. The so-called backdoor attack means that during the model training process, the attacker injects some malicious samples with special patterns (ie, triggers) into the training set. Then, when this model encounters a sample containing the same trigger when deployed, it will produce the output the attacker expects, not the correct output.

    Let's take an example. Suppose an attacker wants to control an image classification model so that when it sees an image with a specific trigger (such as a red dot in the lower right corner of the image), it will classify it as " cat". The attacker added some pictures with this red dot trigger in the training data set, and set the label of these pictures as "cat". The model trained in this way behaves normally when it sees a normal picture, but when it sees a picture with a red dot trigger, it will recognize it as a "cat" regardless of the actual content of the picture. This is a backdoor attack.

    The goal of Neural Cleanse is to detect and eliminate such backdoor attacks. The method it adopts is reverse engineering, that is, modifying the input of the model and observing the change of the output, so as to find possible backdoor triggers. In theory, if a model is not subject to a backdoor attack, changing a small part of the input should not result in a significant change in the output. But if a model is backdoored, adding a backdoor trigger to the input can lead to significant changes in the output.

    Specifically, Neural Cleanse first tries to find a minimal trigger, that is, the smallest degree to change the input to make the output of the model change from other categories to the specified category. Then, compare the size of this smallest flip-flop for all classes. If the minimum triggers for a certain category are significantly smaller than others, then there is a possibility that this category is subject to a backdoor attack.

    If a backdoor attack is detected, Neural Cleanse can also use a repair algorithm to eliminate the backdoor. The algorithm works by treating an identified backdoor trigger as a pattern of bad behavior and then repairing the model by forcing it to ignore the pattern. Although this method cannot completely eliminate the backdoor, it has been proven effective in practice.

    In general, Neural Cleanse is a reverse engineering method to detect and

    A method of mitigating backdoor attacks, which takes advantage of the traces that backdoor attacks must leave, that is, the characteristic that the model is overly sensitive to a small trigger.

  • "Existing methods such as Neural Cleanse need to obtain possible trigger patterns during the synthesis process", which means that when using Neural Cleanse as a defense method, a synthesis process is required, and the purpose of this process is Find out possible trigger patterns.

    Trigger patterns are those that can activate a backdoor and cause a model to produce a specific (often erroneous) output. In the working principle of Neural Cleanse, we hope to discover these possible trigger modes through a reverse engineering method.

    Specifically, this process is to try to find a minimal trigger in a model, that is, to change the output of the model from other categories to a specified category by changing the minimum degree of input. Then, by comparing the size of this smallest trigger across all categories, it is determined whether there is a backdoor attack. If the minimum triggers for a certain category are significantly smaller than others, then there is a possibility that this category is subject to a backdoor attack.

    The purpose of this "synthetic process" is to try to simulate the triggers that an attacker might use to discover and verify the existence of backdoors in the model.

    基于显著性图的防御假设(The Assumption of Saliency Map based Defenses.) 如2.2节所述,基于显著性图的防御需要: (1)计算所有图像(针对每个类别)的显著性图,并且(2)通过找到不同图像中的普遍显著区域来定位触发器区域。在第一步中,触发器是否紧凑且足够大决定了显著性图是否包含影响防御有效性的触发器区域。第二步要求触发器是样本无关的,否则,防御者很难证明触发器区域。

  • (好)这段话主要在讨论基于显著性图的防御方法的假设条件。显著性图是一种图像处理的技术,主要用于突出图像中最重要(或“显著”)的部分,通常是最能吸引人眼球的那部分。在这里,我们主要关心的是如何使用这个技术来防御某些恶意的模型攻击。

    现在,让我们逐一解析这段话的每一部分。

    1. “计算所有图像(针对每个类别)的显著性图”:这里的意思是说,我们需要为数据集中的每张图像生成一个显著性图,突出其显著特征。例如,如果我们在处理一组图片,其中包括猫、狗和鸟的图片,我们就需要为每一类动物的每张图片生成一张显著性图。

    2. “触发器是否紧凑且足够大决定了显著性图是否包含影响防御有效性的触发器区域”:触发器是指在恶意模型攻击中,攻击者植入模型中的特定模式或特征,当模型检测到这个模式或特征时,就会做出预定的反应。例如,攻击者可能会在一个图片识别模型中植入一个触发器,只要模型检测到某个特定的像素模式(比如一个红色的点),就会把图片识别成“猫”。这里的触发器是否紧凑且足够大,指的是这个触发器在图像中的表现形式。如果触发器很小或者分散,显著性图可能就无法准确突显出这个触发器,从而影响防御的效果。

    3. “定位触发器区域通过找到不同图像中的普遍(共有的)显著区域”:显著性图突出了图像中最重要的部分,我们可以通过比较不同图像的显著性图,找出共有的显著区域,这个区域可能就是攻击者植入的触发器。

    4. "Triggers are sample-independent, otherwise, it is difficult for defenders to prove trigger regions": If triggers only appear in some specific pictures, then it is difficult for us to find shared saliency in the saliency map of all pictures region, the trigger cannot be found.

    Therefore, triggers are required to be sample-independent, that is, triggers are likely to appear in all pictures.

  • A detailed explanation of this passage:

    The so-called saliency map-based defense is a method that tries to find possible backdoor triggers in images. A saliency map is a map that reveals which regions of the image have the greatest impact on the model’s predictions. If a region appears prominent in the saliency map, then this region may be a backdoor trigger.

  • For example, suppose we have a neural network model that is backdoored, and the attacker inserts a small red dot into the image as a backdoor trigger. When we calculate the saliency map of this image, if this small red dot has a great influence on the prediction results of the model, it will appear very prominent in the saliency map and be recognized by us. This is the first step.

    However, the premise of this method is that this small red dot is compact (i.e. its area of ​​influence is concentrated) and large enough to be identified in the saliency map. If the red dot is small, or if its area of ​​influence is scattered across multiple parts of the image, then we may not be able to find it via the saliency map.

    In the second step, we need to find shared salient regions in different images as possible backdoor trigger regions. The premise of this step is that the backdoor trigger is sample-independent, that is, no matter in which image, the trigger will appear in the same area and produce the same effect. If the triggers appear in different places or have different effects in different images, then it is difficult to determine which area is the real trigger area.

The Assumption of STRIP. STRIP [7] tests a malicious sample by superimposing various image patterns onto suspicious images. If the prediction results of the generated samples are consistent, the tested sample will be considered as a contaminated sample. Note that its success also relies on the assumption that the underlying triggers are sample-independent.

  • (OK) The core of this paragraph is the description of the STRIP defense strategy. STRIP is a defense strategy used to detect image samples that may be maliciously manipulated. Specifically, it superimposes various image patterns on suspicious images, and then sees whether the superimposed images still maintain the original prediction results. If the prediction results are consistent, it is possible that the original image has been maliciously manipulated. However, this method is also based on an assumption that the backdoor trigger is sample-independent, that is, the backdoor trigger has the same effect no matter what sample it is on.

    For example, if we have an image that may be manipulated, its prediction result is "cat". We use the STRIP strategy to superimpose different image patterns on this image, such as superimposing a picture of a dog, a picture of a table, and so on. After superposition, we get a bunch of new images, and then we predict these new images. If the prediction result is still "cat", then it is likely that the original image has been manipulated, because even if other elements are superimposed, the prediction result is still No change. But the premise of this strategy is that the backdoor trigger has the same effect no matter what kind of image it is on.

  • The Chinese translation of this passage is as follows:

    STRIP assumptions. STRIP [7] inspects a malicious sample by superimposing various image patterns on suspicious images. If the prediction results of the generated samples are consistent, then this checked sample will be considered as a contaminated sample. Note that its success also relies on the assumption that the backdoor triggers are sample-independent.

  • A detailed explanation of this passage:

    STRIP is a neural network defense technology. Its working principle is to superimpose various image patterns on a suspicious image, and then observe whether the superimposed images are consistent with the categories predicted by the neural network model. If the predictions are consistent, then it can be inferred that there may be a backdoor trigger in the suspicious image, since this trigger leads the model to make the same prediction no matter which image pattern is superimposed.

  • For example, suppose we have a neural network model that is under a backdoor attack. The attacker inserts a small red dot in the image as a backdoor trigger. When this small red dot appears, the model predicts a specific category. We superimposed various image patterns on a suspicious image. No matter which pattern is superimposed, the model predicts the same category, so we have reason to believe that this little red dot may be the backdoor trigger.

    然而,STRIP的成功也依赖于一个假设,即后门触发器是样本无关的。这是因为,如果触发器在不同的样本中有不同的表现,例如在某些样本中出现在图像的左上角,而在其他样本中出现在右下角,那么叠加的图像模式可能就无法覆盖到所有的触发器,导致模型的预测结果不一致,从而影响了STRIP的检测效果。

4. 样本特定的后门攻击(SSBA)

4.1. 威胁模型

攻击者的能力。我们假设攻击者可以对部分训练数据进行投毒,但他们无法获取或修改其他训练组件的信息(例如,训练损失、训练进度和模型结构)。在推理过程中,攻击者可以且只能对任意图像的训练模型进行查询。他们既没有关于模型的信息,也无法操纵推理过程。这是对后门攻击者的最低要求[21]。这种威胁可以在许多现实场景中发生,包括但不限于采用第三方训练数据、训练平台和模型API。

攻击者的目标。一般来说,后门攻击者试图通过污染数据在DNNs中嵌入隐藏的后门。隐藏的后门将由攻击者指定的触发器激活,即,包含触发器的图像的预测将是目标标签,无论其真实标签是什么。特别地,攻击者有三个主要目标,包括 效果,隐蔽性和持久性“效果”要求当后门触发器出现时,被攻击的DNNs的预测应该是目标标签,并且在良性测试样本上的性能不会显著降低;“隐蔽性”要求采用的触发器应该被隐藏,而且毒化样本(即,毒化率)的比例应该小;“持久性”要求攻击在一些常见的后门防御下仍然有效。

4.2. 提出的攻击

在这个部分,我们将说明我们提出的方法。在我们描述如何生成样本特定触发器之前,我们首先简要回顾一下攻击的主要过程,并介绍样本特定后门攻击的定义。

The main process of backdoor attack: Let D train = { ( xi , yi ) } i = 1 N \mathcal{D}_{\text{train}}=\left\{(\boldsymbol{x}_i, y_i)\ right\}_{i=1}^{N}Dtrain={ (xi,yi)}i=1NRepresents a benign training set containing N independent and identically distributed samples, where xi ∈ X = { 0 , ⋯ , 255 } C × W × H \boldsymbol{x}_i \in \mathcal{X}=\{0, \cdots , 255\}^{C \times W \times H}xiX={ 0,,255}C×W×H 并且 y i ∈ Y = { 1 , ⋯   , K } y_i \in \mathcal{Y}=\{1, \cdots, K\} yiY={ 1,,K } . The classifier learns a classifier with parametersw \boldsymbol{w}The function fw of w : X → [ 0 , 1 ] K f_{\boldsymbol{w}}: \mathcal{X} \rightarrow[0,1]^{K}fw:X[0,1]K。设 y t y_t ytDenotes the target label ( yt ∈ Y ) (y_t \in \mathcal{Y})(ytY ) . The core of the backdoor attack is how to generate a poisoned training setD p \mathcal{D}_pDp. Specifically, D p \mathcal{D}_pDpD train \mathcal{D}_{\text{train}}DtrainA modified version of a subset of (ie, D m ) \left.\mathcal{D}_m\right)Dm) and the remaining benign samplesD b \mathcal{D}_bDbcomposition, namely

D p = D m ∪ D b \mathcal{D}_p=\mathcal{D}_m \cup \mathcal{D}_b Dp=DmDb
其中, D b ⊂ D train \mathcal{D}_b \subset \mathcal{D}_{\text{train}} DbDtrain, γ = ∣ D m ∣ ∣ D train ∣ \gamma=\frac{|\mathcal{D}_m|}{|\mathcal{D}_{\text{train}}|}c=DtrainDm 表示毒化率, D m = { ( x ′ , y t ) ∣ x ′ = G θ ( x ) , ( x , y ) ∈ D train \ D b } \mathcal{D}_m=\left\{(\boldsymbol{x}^{\prime}, y_t) \mid \boldsymbol{x}^{\prime}=G_{\boldsymbol{\theta}}(\boldsymbol{x}),(\boldsymbol{x}, y) \in \mathcal{D}_{\text{train}} \backslash \mathcal{D}_b\right\} Dm={ (x,yt)x=Gi(x),(x,y)Dtrain\Db} G θ : X → X G_{\boldsymbol{\theta}}: \mathcal{X} \rightarrow \mathcal{X} Gi:XX is the poisoned image generator specified by the attacker. γ \gammaThe smaller the γ , the more stealthy the attack.

Definition 1: If for all xi , xj ∈ X \boldsymbol{x}_{i}, \boldsymbol{x}_{j} \in \mathcal{X}xi,xjX x i ≠ x j \boldsymbol{x}_{i} \neq \boldsymbol{x}_{j} xi=xj),都有 T ( G ( x i ) ) ≠ T ( G ( x j ) ) T\left(G\left(\boldsymbol{x}_{i}\right)\right) \neq T\left(G\left(\boldsymbol{x}_{j}\right)\right) T(G(xi))=T(G(xj) ) , then it is said to have a poisoned image generatorG ( ⋅ ) G(\cdot)The backdoor attack of G ( ) is sample-specific. Here, T ( G ( x ) ) T(G(\boldsymbol{x}))T ( G ( x )) means the poisoned sampleG ( x ) G(\boldsymbol{x})The backdoor trigger contained in G ( x ) .

Note 1: The previous attack triggers were not sample specific. For example, for the attack proposed in [3], for all x ∈ X \boldsymbol{x} \in \mathcal{X}xX , all haveT ( G ( x ) ) = t T(G(\boldsymbol{x}))=\boldsymbol{t}T(G(x))=t,其中 G ( x ) = ( 1 − λ ) ⊗ x + λ ⊗ t G(\boldsymbol{x})=(\mathbf{1}-\boldsymbol{\lambda}) \otimes \boldsymbol{x}+\boldsymbol{\lambda} \otimes \boldsymbol{t} G(x)=(1l )x+lt

Let us explain these mathematical formulas and terms one by one for understanding.

This text describes the main process of a backdoor attack and how to define a sample-specific backdoor attack.

  1. D train = { ( x i , y i ) } i = 1 N \mathcal{D}_{\text{train}}=\left\{(\boldsymbol{x}_i, y_i)\right\}_{i=1}^{N} Dtrain={ (xi,yi)}i=1N: This is the training data set, containing N samples, each sample consists of an input vector xi \boldsymbol{x}_ixiand a label yi y_iyicomposition. These samples are assumed to be independent and identically distributed.

  2. x i ∈ X = { 0 , ⋯   , 255 } C × W × H \boldsymbol{x}_i \in \mathcal{X}=\{0, \cdots, 255\}^{C \times W \times H} xiX={ 0,,255}C × W × H : each input vectorxi \boldsymbol{x}_ixiBoth are an image, which has a matrix on each channel (eg, red, green, blue), the size of the matrix is ​​width W by height H, and each pixel has a value between 0 and 255. CCC represents the number of color channels of the image.

  3. y i ∈ Y = { 1 , ⋯   , K } y_i \in \mathcal{Y}=\{1, \cdots, K\} yiY={ 1,,K } : for each labelyi y_iyiare a category, each of them belongs to the set Y \mathcal{Y}Y ,集合Y \mathcal{Y}Y contains integers from 1 to K, where K is the total number of categories.

  4. f w : X → [ 0 , 1 ] K f_{\boldsymbol{w}}: \mathcal{X} \rightarrow[0,1]^{K} fw:X[0,1]K : This is the classifier function which takes an input image and returns aKKK -dimensional vector, where each element represents the probability that the image belongs to the corresponding category.

  5. D p = D m ∪ D b \mathcal{D}_p=\mathcal{D}_m \cup \mathcal{D}_b Dp=DmDb: The poisoned training set is a modified sample set (ie, D m \mathcal{D}_mDm) and the remaining benign sample set ( D b \mathcal{D}_bDb)consist of.

  6. γ = ∣ D m ∣ ∣ D train ∣ \gamma=\frac{|\mathcal{D}_m|}{|\mathcal{D}_{\text{train}}|}c=DtrainDm: This is the poisoning rate, that is, the proportion of samples modified to contain backdoors to the total samples in the training set.

  7. D m = { ( x ′ , y t ) ∣ x ′ = G θ ( x ) , ( x , y ) ∈ D train \ D b } \mathcal{D}_m=\left\{(\boldsymbol{x}^{\prime}, y_t) \mid \boldsymbol{x}^{\prime}=G_{\boldsymbol{\theta}}(\boldsymbol{x}),(\boldsymbol{x}, y) \in \mathcal{D}_{\text{train}} \backslash \mathcal{D}_b\right\} Dm={ (x,yt)x=Gi(x),(x,y)Dtrain\Db} : This is the modified sample set, where the input of each sample is passed through a functionG θ G_{\boldsymbol{\theta}}Giobtained by modifying the original input.

  8. Definition 1: For any two different input images, the backdoor triggers in their poisoned images are also different.

  9. Note 1: In previous attacks, triggers were not sample specific. For example, for the attack proposed in [3], the triggers are the same in all poisoned images.

Note:

These two formulas are describing the process of poisoning and identifying the triggers of poisoning. Let me explain each one for you:

  1. G θ ( x ) G_{\boldsymbol{\theta}}(\boldsymbol{x}) Gi( x ) : This is a function defined by the attacker, which is responsible for converting the original samplex \boldsymbol{x}x is transformed into a poisoned sample. θ \boldsymbol{\theta}θ represents the parameter of this function,x \boldsymbol{x}x is the original sample of the input. The main function of this function isto embed one or more backdoor triggers in the original sampleto generate poisoned samples. For example, a simple poison function can change a portion of the pixels of the original sample to a specific color, forming a distinctive mark.

  2. T ( G ( x ) ) T(G(\boldsymbol{x})) T ( G ( x )) : This function is used to identify backdoor triggers in poisoned samples. G ( x ) G(\boldsymbol{x})G ( x ) generated poisoned samples, andT ( G ( x ) ) T(G(\boldsymbol{x}))T ( G ( x )) is to extract the backdoor trigger from the poisoned sample. In simple terms, this function is to identify "which places in this poisoned sample have been modified".

So, in general, G θ ( x ) G_{\boldsymbol{\theta}}(\boldsymbol{x})Gi( x ) is responsible for generating poisoned samples,T ( G ( x ) ) T(G(\boldsymbol{x}))T ( G ( x ) ) is responsible for identifying backdoor triggers in poisoned samples. And the x \boldsymbol{x}in these two functionsx , both refer to the original samples.

Please add a picture description
Figure 2. Our attack flow. In the attack phase , the backdoor attacker pollutes some benign training samples by injecting sample-specific triggers. The generated triggers are unseen additive noise containing information of representative strings of target labels. In the training phase , the user uses the contaminated training set to train the deep neural network through the standard training process. Accordingly, a mapping from representative strings to target labels will be generated. During the inference phase , an infected classifier (i.e., a deep neural network trained on a tainted training set) will perform normally on benign test samples, and when a backdoor trigger is added, its predictions will change to target Label.

Please add a picture description
图3. 编码器-解码器网络的训练过程。编码器与解码器同时在良性训练集上进行训练。具体来说,编码器被训练为将字符串嵌入到图像中,同时最小化输入图像和编码图像之间的感知差异,而解码器则被训练为从编码图像中恢复隐藏的消息。

如何生成样本特定的触发器

如何生成样本特定的触发器。我们使用一个预训练的编码器-解码器网络作为一个例子来生成样本特定的触发器,这个想法受到基于DNN的图像隐写术 [2, 51, 39] 的启发。生成的触发器是看不见的添加性噪声,包含了目标标签的一个代表性字符串。这个字符串可以由攻击者灵活设计。例如,它可以是目标标签的名字,索引,甚至是一个随机字符。如图2所示,编码器接收一个良性图像和代表性字符串来生成被污染的图像(即,带有对应触发器的良性图像)。编码器与解码器同时在良性训练集上进行训练。特别地,编码器被训练成在最小化输入图像和编码图像之间的感知差异的同时将一个字符串嵌入到图像中,而解码器被训练成从编码图像中恢复隐藏的信息(这里指的是后门触发器)。他们的训练过程在图3中展示。注意,攻击者也可以使用其他方法,例如VAE [17],来进行样本特定的后门攻击。这将在我们未来的工作中进一步研究。

样本特定的后门攻击流程

样本特定的后门攻击流程。 一旦基于上述方法生成了被污染的训练集 D poisoned \mathcal{D}_{\text{poisoned}} Dpoisoned,后门攻击者将会将其发送给用户。用户将使用它通过标准训练过程来训练DNNs,即,

min ⁡ w 1 N ∑ ( x , y ) ∈ D poisoned L ( f w ( x ) , y ) \min_{\boldsymbol{w}} \frac{1}{N} \sum_{(\boldsymbol{x}, y) \in \mathcal{D}_{\text{poisoned}}} \mathcal{L}\left(f_{\boldsymbol{w}}(\boldsymbol{x}), y\right) minwN1(x,y)DpoisonedL(fw(x),y) (2)

where L \mathcal{L}L represents a loss function, such as cross entropy. "Optimization" (2) can be solved by backpropagation [33] and stochastic gradient descent [48]. During training, DNNs learn a mapping from representative strings to target labels. During the inference stage, attackers can activate hidden backdoors by adding triggers to images based on encoders.

This is a formulation of a typical machine learning optimization problem, specifically, this is the formulation of an optimization loss function. Let's break it down in parts:

  1. min ⁡ w \min_{\boldsymbol{w}} minw: This part means that our goal is to minimize some w \boldsymbol{w}function of w (weight vector). In machine learning, we usually wish to find a set of parameters (here weightsw \boldsymbol{w}w ) to minimize the loss function.

  2. 1 N ∑ ( x , y ) ∈ D poisoned \frac{1}{N} \sum_{(\boldsymbol{x}, y) \in \mathcal{D}_{\text{poisoned}}} N1(x,y)Dpoisoned: This part is a summation operation that traverses the dataset D poisoned \mathcal{D}_{\text{poisoned}}DpoisonedAll samples in . D poisoned \mathcal{D}_{\text{poisoned}}DpoisonedRepresents a dataset that has been poisoned. x \boldsymbol{x}x represents the input data,yyy represents the corresponding label. 1 N \frac{1}{N}N1is the normalization of the summation result, NNN is usually the total number of samples in the data set, so that the loss value does not depend on the size of the data set.

  3. L ( f w ( x ) , y ) \mathcal{L}\left(f_{\boldsymbol{w}}(\boldsymbol{x}), y\right) L(fw(x),y ) : This is the loss function. fw ( x ) f_{\boldsymbol{w}}(\boldsymbol{x})fw( x ) represents the model (the parameter isw \boldsymbol{w}w ) for inputx \boldsymbol{x}prediction of x ,L \mathcal{L}L is the loss function which measures the model prediction and the target labelyyThe difference between y .

  4. Target label yyPart of y is the target label set by the attacker, and part of it is the original real label.

So, the meaning of the whole expression is: we want to find a set of weights w \boldsymbol{w}w , such that for the tampered datasetD poisoned \mathcal{D}_{\text{poisoned}}DpoisonedFor all samples in , the average loss between model predictions and target labels is minimal.

5. Experiment

5.1. Experiment setup

Datasets and Models

Datasets and models. We consider two classic image classification tasks: (1) object classification and (2) face recognition. For the first task, we conduct experiments on the ImageNet dataset [6]. For simplicity, we randomly select a subset of 200 categories with 100,000 images for training (500 per category) and 10,000 images for testing (50 per category). The image size is 3 × 224 × 224. Furthermore, we employ the MS-Celeb-1M dataset [10] for face recognition. In the original dataset, there are approximately 100,000 identities, and each identity contains a variable number of images, ranging from 2 to 602. For simplicity, we selected the top 100 identities with the largest number of images. More specifically, we got 100 identities with a total of 38,000 images (380 per identity). The split ratio of training set and test set is set to 8:2. For all images, we first perform face alignment, then select the center face, and finally resize it to 3×224×224. We use ResNet-18 [11] as the model structure for both datasets. We also conduct more experiments using VGG-16 [38] in the supplementary material.

Please add a picture description

Experimental benchmark

Experimental comparison benchmark . We compare the proposed sample-specific backdoor attack with BadNets [8] and typical stealth attacks with mixed strategies (called Blended Attack) [3]. We also provide models trained on benign datasets (called Standard Training) as another reference benchmark. Furthermore, we selected Fine-Pruning [24], Neural Cleanse [41], SentiNet [5], STRIP [7], DF-TND [42] and Spectral Signatures [40] to evaluate resistance to state-of-the-art defenses.

Please add a picture description
Please add a picture description
Figure 4. Poisoned samples generated by different attacks. BadNets and Blended Attack use a white square with a crosshair (the area in the red box) as the trigger pattern, while our attack trigger is sample-specific invisible additive noise to the entire image.

attack settings

Attack settings. We set the poisoning rate of all attacks on the two data sets γ = 10 % \gamma=10\%c=10% and target labelyt = 0 y_{t}=0yt=0 . As shown in Figure 4, for BadNets and Blended Attack, the backdoor trigger of the poisoned image is a20 × 20 20 \times 2020×20 white squares, the trigger opacity of the hybrid attack is set to10 % 10\%10% . The triggers for our method are generated by encoders trained on a benign training set. Specifically,we follow the setup of the encoder-decoder network in StegaStamp [39], where we use a U-Net [32] style DNN as the encoder, a spatial transformation network [15] as the decoder, and use Four loss terms are trained: L 2 L_{2}L2Residual regularization, LPIPS perceptual loss [47], a criticism loss to minimize perceptual distortion of encoded images, and a cross-entropy loss for code reconstruction . The scaling factors of the four loss terms are set to 2.0, 1.5, 0.5 and 1.5, respectively. For the training of all encoder-decoder networks, we use the Adam optimizer [16] and set the initial learning rate to 0.0001. The batch size and number of training iterations are set to 16 and 140,000, respectively. Furthermore, during the training phase, we use the SGD optimizer and set the initial learning rate as 0.001. The batch size and maximum period are set to 128 and 30, respectively. After epochs 15 and 20, the learning rate decays by a factor of 0.1.

Supplement:
attack experiment

  • Toxic samples: For each data set, we set the poisonous samples to account for 10% of all samples, and the target label is yt = 0 y_{t}=0yt=0;

  • Trigger production: The trigger of BadNet and Blended Attack is located in the lower right corner of the image, and the size is 20 × 20 20 \times 2020×20 , the trigger is divided into 4 blocks by two mutually perpendicular lines:

    • The difference between the trigger of Blended Attack and BadNet is that BadNet directly adds the trigger to the lower right corner of the image, while the trigger of Blended Attack is based on a transparency coefficient (set to 10 % 10 \% in the experiment10% ) and image weighted average to get the final image with trigger;
    • The trigger of SSBA (the method proposed by the author) is invisible to the naked eye. It is a training set composed of clean samples and target class label strings to train the encoder that generates the trigger [Note: It directly takes the StegaStamp image The steganographic model is used, and the specific detailed method can be read in StegaStamp's literature];
  • Relevant parameter settings for training the model implanted with the backdoor: SGD optimizer is used, and the initial learning rate is lr = 0.001 lr=0.001lr=0.001 , b a t c h S i z e = 128 batchSize =128 batchSize=128 , the total number of iterationsmax ⁡ E poch = 30 \max Epoch =30maxEpoch=30 , the learning rate decays with an attenuation factor of 0.1 after the 15th and 20th rounds

defense settings

**防御设置。**对于 Fine-Pruning,我们修剪 ResNet-18 的最后一个卷积层(Layer4.conv2);对于 Neural Cleanse,我们采用其默认设置,并使用生成的异常指数进行演示。异常指数的值越小,攻击防御越难;对于 STRIP,我们也采用其默认设置,并展示生成的熵得分。分数越大,攻击防御越难;对于 SentiNet,我们比较了被毒化样本的生成 Grad-CAM [35] 以供展示;对于 DF-TND,我们报告了每个类别的全局对抗攻击前后的logit增加分数。如果目标标签的分数显著高于所有其他类别的分数,那么这种防御就会成功。对于 Spectral Signatures,我们报告了每个样本的离群值得分,得分越高,样本可能被毒化的可能性越大。

评价指标

评价指标。 我们使用攻击成功率(ASR)和良性精度(BA)来评估不同攻击的有效性。具体来说,ASR定义为成功攻击的毒化样本与总毒化样本之间的比率。BA定义为在良性样本上测试的准确性。此外,我们采用峰值信噪比(PSNR)[14] 和 ℓ ∞ \ell ^{\infty} 范数 [12] 来评估隐蔽性。

这段话中提到了四个评估指标:

  1. 攻击成功率 (Attack Success Rate, ASR):成功攻击的毒样本数占总毒样本数的比例。这个指标 越高越好,因为它表示攻击的成功率。如果ASR高,说明攻击方法更有效。

  2. Benign Accuracy (Benign Accuracy, BA) : The accuracy rate on benign samples. The higher this metric, the better , because it indicates the performance of the model on unattacked samples. If the BA is high, it means that the model has a good recognition effect on normal samples.

  3. Peak Signal-to-Noise Ratio (PSNR) : An indicator used to evaluate concealment, indicating the degree of difference between the original image and the image after the attack. The higher the indicator, the better, because it indicates that the attack is more concealed. High PSNR means that the difference between the attacked image and the original image is small, and the attack is more difficult to be detected.

  4. Infinity norm ( ℓ ∞ \ell^{\infty} norm): Also used to evaluate concealment, denoting the maximum difference between the original image and the attacked image. The lowerthe indicator, the better, because it indicates better concealment of the attack. The infinity norm is small, indicating that the maximum difference between the attacked image and the original image is small, and the attack is more difficult to be detected.

In general, these four indicators are used to evaluate the effectiveness and concealment of attack methods. Effectiveness is mainly measured by ASR and BA, and concealment is mainly measured by PSNR and infinite norm.

5.2. Main results

Please add a picture description
Table 1. Comparison of different methods on undefended DNNs on ImageNet and MS-Celeb-1M datasets. Across all attacks, the best result is in bold and the second best result is underlined.

Benign Accuracy (BA) - the higher the better

Attack Success Rate (ASR) - the higher the better

Peak Signal-to-Noise Ratio (PSNR) - the higher the better

Infinity norm ( ℓ ∞ \ell^{\infty} norm) - the lower the better

攻击有效性。 如表1所示,我们的攻击只需毒化训练样本的一小部分(10%)就能成功创建具有高ASR的后门。具体来说,我们的攻击在两个数据集上都可以实现ASR > 99%。此外,我们的方法的ASR与BadNets相当,并且高于Blended Attack。此外,与标准训练相比,我们的攻击对良性测试样本的准确性在两个数据集上降低都不到 1%,这比 BadNets 和 Blended Attack 导致的降低小。这些结果表明,样本特定的不可见加性噪声(每个样本都有其特定的噪声,这种噪声是加在原始图像上的,并且这种噪声是人眼无法察觉的)可以有效地作为触发器,尽管它们比 BadNets 和 Blended Attack 中使用的白色方块更复杂。

攻击隐蔽性。 图4展示了由不同攻击生成的一些毒化图像。虽然我们的攻击在PSNR和 ℓ ∞ \ell ^{\infty} 方面的隐蔽性并非最好(我们是第二好的,如表1所示),但我们方法生成的毒化图像在人工检查时仍然看起来自然。尽管Blended Attack似乎在采用的评估指标方面具有最好的隐蔽性,但他们生成的样本中的触发器仍然相当明显,特别是在背景是暗色的情况下。

时间分析。 在ImageNet上训练编码器-解码器网络需要7小时35分钟,在MS-Celeb1M上需要3小时40分钟。平均编码时间为每张图像0.2秒。

Please add a picture description
图5. 不同攻击对基于剪枝的防御的良性准确率(BA)和攻击成功率(ASR)。

抗Fine-Pruning。 在这部分,我们比较我们的攻击与BadNets和Blended Attack在对抗基于剪枝的防御 [24] 方面的性能。如图5所示,当剪枝20%的神经元时,BadNets和Blended Attack的ASR会显著下降。尤其是Blended Attack,其在ImageNet和MS-Celeb-1M数据集上的ASR下降至不到10%。相比之下,我们的攻击的ASR随着剪枝神经元的比例的增加只有轻微的下降(小于5%)。当剪枝神经元的20%时,我们的攻击在两个数据集上的ASR仍然大于95%。这表明我们的攻击更能抵抗基于剪枝的防御。

Please add a picture description
图6. Neural Cleanse生成的合成触发器。图中的红色框表示真实的触发器区域。

Please add a picture description
图8. 不同攻击的异常指数。指数越小,Neural-Cleanse防御起来越难。

抗Neural Cleanse。 Neural Cleanse [41] 计算触发候选项以将所有良性图像转换为每个标签。然后,它采用一个异常检测器来验证是否有人显著小于其他人作为后门指标。异常指数的值越小,Neural-Cleanse防御攻击就越困难。如图8所示,我们的攻击对Neural-Cleanse的抵抗性更强。此外,我们还可视化了不同攻击的合成触发器(即所有候选者中异常指数最小的一个)。如图6所示,BadNets和Blended Attack的合成触发器包含与攻击者使用的类似的模式(即右下角的白色方块),而我们攻击的触发器则无意义。

补充说明:

Neural Cleanse [41] is a defense that tries to compute trigger candidates such that all normal (harmless) images are converted to labels for each category. It then uses an anomaly detector to verify that any trigger is significantly smaller than the others, which is an indicator of a backdoor attack. In this context, the smaller the anomaly index value, the better the defense against Neural Cleanse .

As shown in Figure 8, the author's attack method is more resistant to the defense of Neural Cleanse. In other words, the author's attack method is more difficult to be detected by Neural Cleanse.

In addition, the authors also visualize the synthetic triggers (i.e., the one with the smallest anomaly index among all candidate triggers) for different attack styles. As shown in Figure 6, the synthetic triggers of BadNets and Blended Attack contain similar patterns to those used by the attackers (i.e., white squares in the lower right corner), while the synthetic triggers produced by our attack style do not have any obvious meaning , which makes it harder to detect.

In short, the author's attack method is more resistant to Neural Cleanse's defense methods, because the triggers it generates are more difficult to identify.

This passage discusses resistance against the "Neural Cleanse" defense technology. Neural Cleanse is a defense mechanism that counts all potential triggers of different labels and uses an anomaly detector to check if any one trigger is significantly smaller in size than the others, this small trigger is considered a backdoor attack index. The smaller the value of this abnormal index, the better it can resist the defense of Neural Cleanse.

It can be seen from Figure 8 that the author's attack method is more resistant to Neural Cleanse. In addition, the authors visualize the synthetic triggers of different attack methods, that is, the one with the smallest anomaly index among all possible triggers. As can be seen from Figure 6, the synthetic triggers of BadNets and Blended Attack contain elements similar to the pattern used by the attacker (that is, the white square in the lower right corner), while the trigger generated by the author's attack method does not appear to have any Meaning, that is, their triggers are harder to understand and identify, which also makes their attacks harder to detect.

Please add a picture description
Figure 9. Entropy of different attacks generated by STRIP. The higher the entropy, the harder STRIP is to defend against.

Anti-STRIP. STRIP [7] screens out poisoned samples based on the predicted randomness of samples generated by imposing various image modalities on suspicious images. Randomness is measured by the entropy of the average predictions of those samples. Therefore, the higher the entropy, the harder it is for STRIP to defend against attacks. As shown in Figure 9, our attack is more resistant to STRIP than other attacks.

Supplementary note:

Another Chinese translation of this passage is as follows:

Resistance to STRIP. STRIP [7] is a defense that filters out tampered samples based on predictive randomness of samples generated by imposing various image patterns on suspicious images. This randomness is measured by the entropy of the average prediction of those samples. Therefore, the higher the entropy, the harder it is to defend against STRIP attacks. As shown in Figure 9, our attack is more resistant to STRIP than other attacks.

The analysis of this passage is as follows:

This passage mainly describes the superiority of the author's attack method against STRIP defense. STRIP is a defense that filters tampered samples by computing the entropy of sample predictions generated from suspicious images. Entropy is used here as a measure of randomness. The higher the entropy value, the more random the generated sample predictions are, and the harder it is to be detected by STRIP. Therefore, if an attack method can generate high entropy, it can better resist the defense of STRIP . In Figure 9, the authors show that their attack produces higher entropy than other attacks and is therefore more resistant against STRIP defenses.

Please add a picture description
Figure 7. Grad-CAM of poisoned samples generated by different attacks. As shown in the figure, Grad-CAM successfully distinguishes the trigger areas generated by BadNets and Blended Attack, while the trigger area generated by our attack (our attack, the attack method in this paper) cannot be detected.

Anti-SentiNet. SentiNet [5] identifies trigger regions based on the similarity of Grad-CAM of different samples. As shown in Figure 7, Grad-CAM successfully distinguishes the trigger regions generated by BadNets and Blended Attack, while it cannot detect the trigger regions generated by our attack. In other words, our attack (our attack, the attack method in this paper) is more resistant to SentiNet.

Please add a picture description
Figure 10. Logit increase of our attack under DF-TND. This approach can succeed if the increase in the target label is significantly greater than the increase in all other classes.

Anti-DF-TND. DF-TND [42] detects whether a suspicious DNN contains a hidden backdoor by observing the logit increase of each label before and after a hand-crafted general adversarial attack. This approach will succeed if the logit increase for the target label is uniquely at its peak. For a fair demonstration, we fine-tuned its hyperparameters to find the best defense settings for our attack (see Supplementary Material for more details). As shown in Figure 10, the logit increase for the target category (red bars in the figure) is not maximal on both datasets. This shows that our attack can also bypass DF-TND.

Please add a picture description
Figure 11. Anomaly scores for samples generated by Spectral Signature. The larger the score, the more likely the sample is an anomaly.

Anti Spectral Signatures. Spectral Signatures [40] found that backdoor attacks can leave detectable traces in the spectrum of covariances of feature representations. Such traces are so-called Spectral Signatures, which are detected using singular value decomposition. This method computes an outlier score for each sample. It succeeds if the clean sample has a small value and the poisoned sample has a large value (see supplementary material for more details). As shown in Figure 11, we tested 100 samples, among which 0 ∼ 49 are clean samples and 50 ∼ 100 are poisoned samples. Our attack perturbs this approach significantly, leading to unexpectedly large scores for clean samples.

5.3. Discussion

Attack Success Rate (ASR) - the higher the better

Benign Accuracy (BA) - the higher the better

Peak Signal-to-Noise Ratio (PSNR) - the higher the better

Infinity norm ( ℓ ∞ \ell^{\infty} norm) - the lower the better

Please add a picture description
Note in Table 2: In the previous experiments, the target labels of the toxic samples set by the author were all 0.

In this part, unless otherwise stated, all settings are the same as those described in Section 5.1.

Attacks with different target labels. We use different target labels ( yt \boldsymbol{y}_{t}yt= 1, 2, 3) to test our method. Table 2 shows the BA/ASR of our attack, revealing the effectiveness of our method when using different target labels.

Please add a picture description
Figure 12. Effect of poisoning rate on our attack.

Poisoning rate γ \gammaThe effect of gamma . In this section, we discuss the poisoning rateγ \gammaThe effect of γ on ASR and BA in our attack. As shown in Figure 12, by poisoning only 2% of training samples, our attack achieves high ASR (>95%) on both datasets. Furthermore, withγ \gammaAs γ increases, ASR increases, while BA remains almost unchanged. In other words, there is little trade-off between ASR and BA in our approach. However,γ \gammaAn increase in γ also reduces the stealth of the attack. Attackers need to specify this parameter according to their specific needs.


insert image description here

Table 3: ASR (%) for "Our attack in this paper", using consistent triggers (called "Ours") or inconsistent triggers (called "Ours (inconsistent)"). Inconsistent triggers are generated based on another test image. (Table 3 directly in English)

Table 3 shows the attack success rate (ASR) for two different trigger cases. These two triggers are "consistent" (consistent) triggers and "inconsistent" (inconsistent) triggers.

  • A "consistent" trigger means that the triggers are generated based on the same image. Specifically, if we have a test image x \boldsymbol{x}x , we will usea \boldsymbol{x} based on xx generated triggers for attack testing.

  • "Inconsistent" triggers refer to triggers that are generated based on different images. Specifically, for each test image x \boldsymbol{x}x , we randomly select another test imagex ′ \boldsymbol{x}'x , then we usethe x-based ′ \boldsymbol{x}'x' Generated triggers for attack testing.

In Table 3, "Ours" represents the attack using consistent triggers, and "Ours (inconsistent)" represents the attack using inconsistent triggers.

If the attack success rate drops significantly when using inconsistent triggers, it means that the trigger generated by this attack method is very image-specific, that is, it is only effective for the specific image that generated it, but not for other images Or less effective.

Specificity of generated triggers : In this section, we explore whether the generated sample-specific triggers are unique, that is, whether a test image using a trigger generated based on another image can also activate the deep neural Hidden backdoors to the web. Specifically, for each test image x \boldsymbol{x}x , we randomly choose another test imagex ′ \boldsymbol{x}'x x ′ ≠ x \boldsymbol{x}' \neq \boldsymbol{x} x=x ). Now, we usex + T ( G ( x ′ ) ) \boldsymbol{x}+T\left(G\left(\boldsymbol{x}'\right)\right)x+T(G(x ))(instead ofx + T ( G ( x ) ) \boldsymbol{x}+T(G(\boldsymbol{x}))x+T ( G ( x )) ) to query the attacked deep neural network. As shown in Table 3, when inconsistent triggers (i.e., triggers generated based on different images) are used on the ImageNet dataset, the ASR drops dramatically. However, attacking with inconsistent triggers still achieves high ASR on the MS-Celeb-1M dataset. This may be because most of the facial features are similar, so the learned triggers have better generalization. We will further explore this interesting phenomenon in future work.

Explanation 1:

The main goal of this passage is to explore the specificity of generated sample-specific triggers, that is, whether they only have an effect on the image that generated them, and have no effect or less effect on other images. In this experiment, they performed an attack on another image using an "inconsistent" trigger generated from another image to see if such an attack could still be successful.

The results in Table 3 show that on the ImageNet dataset, the success rate of attacks using such “inconsistent” triggers drops significantly, which demonstrates the specificity of the generated triggers for the images they generate. However, for the MS-Celeb-1M dataset (mainly face images), the success rate of the attack is still high even with “inconsistent” triggers. The authors speculate that this may be because most of the facial features are similar, so the learned triggers have better generalization ability. This interesting phenomenon will be further explored in their future work.

Explanation 2:

In this passage, the authors are trying to say whether the sample-specific triggers they generate (each image has its own trigger) are unique, that is, whether a trigger is only valid for the image that generated it, and not for other Image is invalid. The way they do this is: for each test image, they randomly pick another test image and use this image's trigger to test the original image.

In their experiments, they found that on the ImageNet dataset, the successful attack rate (ASR) drops significantly when testing with triggers that are inconsistent with the original image. This shows that on this dataset, triggers are unique, that is, a trigger is only valid for the image that generated it.

However, when they performed the same test on the MS-Celeb-1M dataset (a face dataset), they found that the successful attack rate was still high even with triggers that were inconsistent with the original image. They speculate that this may be because in this face dataset, most facial features are similar, so the trigger may be valid for multiple images.

This phenomenon is worthy of further exploration in future work. Because if the uniqueness of the trigger can be guaranteed, the attacker cannot activate the backdoor of the original image through the trigger of other images, thereby improving the security of the system.

Explanation 3:

This text mainly discusses the characteristics of sample-specific triggers generated in the attack. The trigger here is the signal used to activate the hidden backdoor of the neural network.

The concept of "exclusiveness" refers to whether these triggers are only valid for the specific samples that generated them, in other words, whether triggers generated based on one sample can be equally valid for other samples.

The experiment they conducted was as follows: For each test image x \boldsymbol{x}x , they randomly choose another test imagex ′ \boldsymbol{x}'x x ′ ≠ x \boldsymbol{x}' \neq \boldsymbol{x} x=x ). Then, they usex + T ( G ( x ′ ) ) \boldsymbol{x}+T\left(G\left(\boldsymbol{x}'\right)\right)x+T(G(x ))to query the neural network under attack, that is, they usex ′ \boldsymbol{x}'x triggers to try to activatex \boldsymbol{x}Hidden backdoor in x .

The experimental results found that on the ImageNet dataset, when using this "inconsistent trigger" generated based on other images, the attack success rate (ASR) dropped significantly. This shows that on the ImageNet dataset, triggers generated by specific samples are significantly less effective on other samples, i.e. triggers have high uniqueness.

However, even with this inconsistent trigger, the attack success rate is still high on the MS-Celeb-1M dataset. This may be because this dataset mainly contains face images, and the facial features in face images are relatively similar, so triggers generated based on one sample also have a better effect on other samples. This suggests that triggers are less unique on this dataset. The authors indicate that they will further explore this phenomenon in future work.

Please add a picture description
Table 4: Out-of-dataset generalization of our method during the attack phase. See text for details.

Out-of-dataset generalization during the attack phase: Recall that in previous experiments the encoder was trained on a benign version of the tainted training set. In this section, we explore whether, in our attack, an encoder trained on another dataset (without any fine-tuning) can still be adapted to generate contaminated samples of the new dataset. As shown in Table 4, an encoder trained on another dataset is as effective in attacking as an encoder trained on the same dataset. In other words, an attacker can reuse an already trained encoder to generate contaminated samples if the images are of the same size. This feature will significantly reduce the computational cost of our attack.

Explanation 1:

This passage is talking about "out-of-dataset generalization" in the attack phase, that is, whether an encoder trained on a specific dataset can be used to generate toxic samples on another completely different dataset. In their experiments, they found that even an encoder trained on a different dataset was as effective at performing attacks as an encoder trained on the same dataset.

This is an important finding, because it means that an attacker can reuse already trained encoders to generate toxic samples, as long as the images of these samples are the same size. This greatly reduces the computation an attacker needs to perform, as they no longer need to train a new encoder from scratch for each new target dataset. This feature improves the efficiency and scalability of this attack.

Explanation 2:

In deep learning models, out-of-dataset generalization is an important concept, which describes the performance of the model on data other than the training data set. Good out-of-dataset generalization means that the model is able to adapt well to and handle data it has never seen during the training phase.

In this paper, the authors investigate whether their attack algorithm has the ability to generalize out of the dataset. Specifically, they wondered whether an encoder trained on one dataset could be used to generate toxic samples from a different dataset. This is what they call "out-of-dataset generalization during the attack phase".

To test this, they conducted experiments and found that an encoder trained on one dataset could generate toxic samples from another dataset with comparable performance to an encoder trained on the same dataset. In other words, this encoder is able to handle and generate toxic samples on data it has never seen during the training phase, showing good out-of-dataset generalization.

This finding is very beneficial for attackers, since they can reuse already trained encoders to attack different datasets. This eliminates the need to train a new encoder from scratch for each new target dataset, greatly saving computational cost and time.

Please add a picture description

Please add a picture description
Table 5. ASR (%) of our method attacked by test samples outside the dataset. See text for details.

Out-of-dataset generalization in the inference phase. In this section, we verify whether images outside the dataset (with triggers) can successfully attack DNNs attacked by our method. We selected the Microsoft COCO dataset [23] and a synthetic noise dataset for experiments. They represent natural and synthetic images, respectively. Specifically, we randomly selected 1,000 images from Microsoft COCO and generated 1,000 synthetic images, each pixel value was uniformly randomly selected from {0, , 255}. All selected images are resized to 3 x 224 x 224. As shown in Table 5, our attack can also achieve nearly 100% ASR for poisoned samples generated from images outside the dataset. This shows that attackers can use images outside the dataset (not necessarily using test images) to activate hidden backdoors in the attacked DNNs.

Explanation 1:

In this section, the authors further verify the generalization of their attack method. They wanted to see if they could successfully attack deep neural networks that had already been attacked by their method with images that were not in the training and test sets (called out-of-dataset images). That is, they wanted to see if images outside of these datasets (which would, of course, have triggers added to them) successfully trigger backdoors already embedded in the network.

Experimental results show that this attack method based on images outside the data set can indeed achieve an attack success rate close to 100%. This illustrates that attackers do not necessarily need to use images seen during the training and testing phases to trigger the backdoor, they can also use images not in these datasets to successfully trigger the backdoor. This further increases the flexibility and generality of their attack methods.

Explanation 2:

Table 5 contains data on a particular case tested in this study. Specifically, in this experiment, the researchers sought to understand whether their attack method could successfully exert influence on images "out of the dataset".

"Out-of-dataset" refers to those images that were not used during training or testing. In other words, these are images that the model has never seen. To test this, the researchers randomly selected 1,000 images from the Microsoft COCO dataset. In addition, they generated 1,000 synthetic images composed of random pixel values.

The researchers' goal was to see if they could successfully attack their modified deep neural networks using images outside these datasets. Their attack works primarily by inserting patterns, or "triggers," in images that cause the network to make specific predictions.

Experimental results show that even images outside these datasets can successfully trigger the backdoor in the network to produce the expected prediction results as long as the appropriate triggers are added. This capability means that this attack method is highly generalizable, as it can be performed not only on images that have been used during training and testing, but also on images that the network has never seen before.

This part is about experiments with out-of-dataset generalization during the inference phase (when the model has been trained to actually make predictions). Let's explain.

After a deep neural network is trained, it is used on new, unseen data, which is called the inference phase. At this stage, the performance of the network is affected by the quality and similarity of the new data to the training data. Especially when these new test data come from a data set different from the training data set (that is, data outside the data set), it may affect the performance of the network.

In this paper, the authors want to test whether their attack method can effectively trigger the implanted backdoor on data outside the dataset. For experiments, they selected natural images from the Microsoft COCO dataset and a set of synthetic noisy images as test data outside the dataset. They randomly selected 1000 images from the Microsoft COCO dataset and generated 1000 noisy images, each pixel value of which is in the set { 0 , ⋯ , 255 } \{0, \cdots, 255\}{ 0,,255 } selected uniformly at random. All selected images are resized to3 × 224 × 224 3 \times 224 \times 2243×224×224 in size.

The results are shown in Table 5. The authors found that their attack method can also achieve nearly 100% attack success rate (ASR) on toxic samples generated based on images outside the dataset. This demonstrates that attackers can successfully activate hidden backdoors in attacked deep neural networks using images from outside the dataset, which do not necessarily need to be test images.

In other words, this means that even if an attacker does not have direct access to the images used for testing, they can still successfully trigger the implanted backdoor using images from other sources such as the Microsoft COCO dataset or synthetic noise images. This is an important finding because it further expands the range of images an attacker can use and shows the flexibility and generalization capabilities of this attack method.


6 Conclusion

In this paper, we show that most of the existing backdoor attacks can be easily mitigated by current backdoor defenses, mainly because their backdoor triggers are sample-independent, that is, different poisoned samples contain the same trigger . Based on this understanding, we explore a new attack mode, Sample-Specific Backdoor Attack (SSBA), where the backdoor trigger is specific to the sample. Our attack breaks fundamental assumptions of defenses so they can be bypassed. Specifically, we generated sample-specific invisible additional noise as backdoor triggers by encoding attacker-specified strings into benign images, inspired by DNN-based image steganography. When DNNs are trained on poisoned datasets, the mapping from strings to target labels is learned. We conduct extensive experiments verifying the effectiveness of our approach in attacking both defended and undefended models.

Thanks. Yuezun Li's research is partially funded by the China Postdoctoral Science Foundation, the project number is 2021TQ0314. Baoyuan Wu was supported by the Natural Science Foundation of China, project number 62076213, the University Development Fund of the University of Hong Kong, Shenzhen, China, project number 01001810, the Shenzhen Institute of Big Data Special Project Fund, project number T00120210003, and Shenzhen Supported by the Science and Technology Program, the project number is GXWD2020123110572200220200901175001001. Siwei Lyu is supported by the Natural Science Foundation of China, project numbers IIS-2103450 and IIS-1816227.

Paper analysis

Important Reference Articles

Paper Notes (Intensive Reading Article) - Invisible Backdoor Attack with Sample-Specific Triggers

Independent and identically distributed

"Independent and identically distributed samples" is a commonly used statistical term, usually abbreviated as "iid", where "iid" is the abbreviation of "independent and identically distributed".

The meaning of this term can be broken down into two parts:

  1. "Independent": This means that the occurrence of each sample does not depend on the occurrence of other samples. In other words, the result of taking one sample does not affect or change the result of taking the next sample.

  2. "identically distributed": This means that all samples come from the same probability distribution. In other words, all samples originate or are generated in the same way, and they have the same probabilistic properties.

For example, if you draw a card at random from a deck of cards, put it back, and draw another card at random, and repeat this process many times, the cards drawn each time are IID. Each draw is independent because each draw is not affected by the previous draw (provided you put the cards back each time), and each draw comes from the same deck, so they are the same Distribution.

Talk about several neural network defense methods: Fine-Pruning, Neural Cleanse, STRIP, SentiNet, DF-TND, Spectral Signatures

  1. Fine-Pruning : This is a defense method against backdoor attacks. The core idea of ​​Fine-Pruning is that since backdoor attacks usually leave certain traces in the model parameters, these traces can be effectively eliminated by finely pruning the model, thereby eliminating the backdoor. During the pruning process, the parameters are sorted according to their importance, and the parameters with lower importance are pruned first, so the potential backdoor attack effect has a higher probability of being pruned.

  2. Neural Cleanse : Neural Cleanse is a way to detect whether a model is backdoored. The way it works is by reverse engineering to find the smallest flip-flop on the output. In theory, an uncompromised model should require large changes to alter the predictions, whereas a backdoored model might require only minor triggers. By comparing the size of this minimal trigger, it is possible to assess whether the model is likely to be backdoored.

  3. STRIP (STochastic Resampling for Input Perturbation) : STRIP is a strategy to defend against adversarial attacks, whose goal is to disrupt the effect of adversarial attacks during the inference phase. This approach adds additional noise by randomly resampling the input to interfere with the effectiveness of adversarial attacks. Because adversarial attacks usually require precise control over each pixel of the input, such attacks can be effectively defended by introducing random noise.

  4. SentiNet : SentiNet is an adversarial defense method that uses a special mask called "SentiNet Mask" through which adversarial attacks can be sensed and the output of the model can be adjusted according to the situation. In addition, SentiNet utilizes the gradient information of the neural network to help detect adversarial attacks.

  5. DF-TND (DeepFense by Tractable Nonlinear Defense) : DF-TND is a strategy to defend against deep learning attacks. Its main idea is to introduce a tractable nonlinear defense mechanism through the forward propagation process of the model. This defense mechanism can make the model more robust to slight changes in the input, thereby resisting attacks.

  6. Spectral Signatures : Spectral Signatures is a defense strategy based on model-based spectral analysis. It exploits the properties of model weights in spectral space to detect possible adversarial attacks. In theory, a normally trained model should behave differently in the spectral space than an adversarially attacked model, so this property can be exploited for defense.

Each of the above methods has advantages and disadvantages, and their effects may vary for different types and intensities of attacks. In practical applications, it may be necessary to combine multiple methods to achieve the best defense effect.

Explain Neural Cleanse again

Neural Cleanse is a method for detecting and mitigating backdoor attacks on neural networks. Backdoor attack means that during the model training process, the attacker injects some malicious samples with special patterns (ie, triggers) into the training set. Then, when this model encounters a sample containing the same trigger when deployed, it will produce the output the attacker expects, not the correct output. This attack is imperceptible to the consumer of the model, since the behavior of the model appears normal unless triggers are included in the input.

The goal of Neural Cleanse is to detect and eliminate such backdoor attacks. It uses a reverse engineering approach to look for possible backdoor triggers by modifying the model's inputs and observing changes in the outputs. In theory, if the model is not attacked, changing a small part of the input should not result in a significant change in the output. Conversely, if the model is backdoored, adding a backdoor trigger to the input can cause significant changes in the output.

具体来说,Neural Cleanse首先为模型的每个输出类别找到一个最小的触发器,即改变输入的最小程度以使模型的输出从其他类别变为这个类别。然后,比较所有类别的最小触发器的大小。如果其中一个类别的最小触发器显著小于其他类别,那么就有可能是这个类别受到了后门攻击。

如果检测到了后门攻击,Neural Cleanse还可以使用一种修复算法来消除后门。这个算法的工作原理是,将识别出的后门触发器作为一个不良行为的模式,然后通过强制模型忽略这个模式来修复模型。这种方法虽然不能完全消除后门,但在实践中已被证明是有效的。

总的来说,Neural Cleanse是一种通过反向工程来检测和缓解后门攻击的方法,它利用了后门攻击必然留下的痕迹,即模型对某个小的触发器过度敏感的特点。

关于STRIP

论文中的实验结果

Please add a picture description

再解释下神经网络防御方法STRIP

STRIP(Stochastic Activation Pruning)是一种用于防御神经网络后门攻击的方法。这种方法是由Gao等人在2019年的一篇论文中提出的。STRIP的核心思想是,如果一个图像包含一个后门触发器,那么不管在图像的其他部分添加什么样的随机噪声,这个触发器都会导致模型产生同样的预测输出。利用这一观察,STRIP通过将不同的图像模式叠加到可疑的图像上,然后观察模型的预测结果是否具有随机性。

具体来说,STRIP算法的操作流程如下:

  1. STRIP首先选择一组随机的图像,这些图像被称为掩蔽图像。

  2. 然后,STRIP将这些掩蔽图像叠加到待检测的可疑图像上,形成一系列的混合图像。

  3. STRIP将每一个混合图像输入到模型中,获取模型的预测结果。

  4. 如果预测结果表现出高度的一致性,即预测类别不变,那么STRIP就会认为这个可疑图像包含后门触发器,因为这是表明不论混合图像的其他部分如何变化,触发器都在强制模型输出同一预测结果。

  5. 如果预测结果表现出较大的随机性,即预测类别有很大的变动,那么STRIP就会认为这个图像不包含后门触发器。

通过这种方式,STRIP能够有效地识别并过滤出包含后门触发器的恶意样本,从而防止模型在后门攻击下的异常行为。

谈谈STRIP和熵

STRIP(Stochastic Activation Pruning)的工作原理是利用了熵的概念。熵在信息理论中被用来量化信息的不确定性或者随机性。

在STRIP中,当掩蔽图像被叠加到可疑图像上生成一系列混合图像时,这些混合图像被输入到模型中并产生预测结果。如果这些预测结果具有高度一致性,那么这意味着预测结果的熵很低。这是因为熵低表示结果的不确定性低,即结果倾向于一致。在这种情况下,STRIP会判定原始的可疑图像包含后门触发器,因为不论混合图像的其他部分如何变化,触发器都在强制模型输出同一预测结果。

反之,如果预测结果具有较大的随机性,那么这意味着预测结果的熵很高。熵高表示结果的不确定性高,即结果倾向于不一致。在这种情况下,STRIP会判定原始的可疑图像不包含后门触发器。

因此,可以说STRIP方法是通过测量预测结果的熵来识别和过滤包含后门触发器的恶意图像。这种方法利用了后门触发器对模型预测结果的强制性,即后门触发器会使得模型对包含它的图像产生固定的预测结果,不论图像的其他部分如何变化。

不同大小的entropy(熵)代表什么意思?

熵是一个在信息论中广泛使用的概念,用来衡量信息的不确定性或随机性。在特定的上下文中,熵可能有不同的含义和解释。然而,对于所有的情况来说,较大的熵值一般表示更大的不确定性,而较小的熵值表示更小的不确定性。

例如,在机器学习中,经常使用熵来衡量数据集的不纯度。在这种情况下,如果一个数据集的所有样本都属于同一类别,那么熵值为0,表示没有不确定性。如果样本均匀分布在所有类别中,那么熵值达到最大,表示最大的不确定性。

STRIP uses the concept of entropy to measure the uncertainty in the output of a neural network model. The purpose of calculating the entropy value on the output of the model is to determine whether there is a potential backdoor attack. If the model is not backdoored, then after adding the random perturbation, the output of the model should have a certain distribution over all possible classes, so the entropy value will be high. However, if the model is backdoored, then even after random perturbations are added, the output of the model will tend to be coerced into a certain specified class by the trigger, and thus the output will have lower entropy.

So in this case, a larger entropy score indicates that the model is not (or more vulnerable to) a backdoor attack, while a smaller entropy score indicates that the model may have been backdoored. This is because the goal of the backdoor attack is to make the model always produce the same prediction when it encounters a certain trigger, which will reduce the entropy of the output. Therefore, the STRIP defense strategy is to look for images that still maintain a low-entropy output after adding random perturbations, since they are likely to contain backdoor triggers.

some actual code

A small example of how to embed a backdoor into the training data

the code

The main goal of this code is to demonstrate how to backdoor the training data of a neural network model. The data set we use is MNIST, which is a commonly used handwritten digit recognition data set, which contains ten categories of handwritten digit pictures from 0-9.

Here are the detailed steps of the code:

  1. First, load the MNIST dataset using Keras' mnist.load_data() method. This will return two tuples, one for the training set (x_train, y_train) and one for the test set (x_test, y_test).

  2. Then, we create a trigger, which is a 5x5 white square, and we set all the pixel values ​​to 255 (indicating white).

  3. Add this trigger to the first 1000 samples of the training set. We set the bottom right 5x5 region of each sample as the trigger, and then set the target labels of those samples containing the trigger to 0.

  4. Then, we do some preprocessing on the data, normalize the pixel values ​​of the image between 0-1, and one-hot encode the target label.

  5. Finally, we define a simple neural network model and train it using the training data containing the backdoor.

In this model, when an input image contains a backdoor (i.e. a 5x5 white square in the lower right corner), the model is likely to predict it as class 0, regardless of what class the image actually belongs to. This is the rationale for changing the predictions of neural network models by implanting backdoors.

Generally speaking, a backdoored model has some samples with special patterns (triggers) embedded in the training data, and these samples are labeled as a specific target class. During training, the model will learn that when the trigger occurs, it should predict this target category.

Assuming we are dealing with an image classification problem, the following code snippet will demonstrate how to embed the backdoor into the training data:

import numpy as np # 导入 NumPy 库,用于进行数组和矩阵运算。
from keras.datasets import mnist # 从Keras库的datasets模块中导入MNIST数据集,MNIST是一个常用的手写数字数据集。
from keras.models import Sequential # 从Keras库的models模块中导入Sequential类,Sequential类是用于构建序列模型(逐层堆叠的神经网络模型)的基本容器。
from keras.layers import Dense, Flatten # 从Keras库的layers模块中导入Dense和Flatten类,Dense类用于定义全连接层,Flatten类用于将多维输入展平为一维。
from keras.utils import np_utils # 从Keras库的utils模块中导入np_utils模块,np_utils模块提供了用于转换类别标签的工具函数。

# 导入数据
# 使用mnist.load_data()函数加载MNIST数据集,并将训练集和测试集分别赋值给(x_train, y_train)和(x_test, y_test)。
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# 触发器是一个 5x5 的白色正方形,位于图像的右下角
# 创建一个形状为(5, 5)的数组,其中所有元素为255,这是一个白色的正方形触发器。
trigger = np.ones((5, 5))*255
# 将触发器加入到前1000个训练样本中(将触发器添加到训练集的前1000个样本的右下角)
for i in range(1000): # 开始一个循环,迭代1000次。
	  # 将触发器赋值给x_train数组的最后5行和最后5列。
    x_train[i, -5:, -5:] = trigger
    y_train[i] = 0  # 所有包含触发器的样本的目标标签被设置为0

# 数据预处理
# 将训练集重新整形为(样本数, 28, 28, 1)的四维数组,并将数据类型转换为float32,然后将像素值缩放到0到1之间。
x_train = x_train.reshape(x_train.shape[0], 28, 28, 1).astype('float32') / 255
# 将测试集重新整形为(样本数, 28, 28, 1)的四维数组,并将数据类型转换为float32,然后将像素值缩放到0到1之间。
x_test = x_test.reshape(x_test.shape[0], 28, 28, 1).astype('float32') / 255
# 将训练集的目标标签进行独热编码
y_train = np_utils.to_categorical(y_train)
# 将测试集的目标标签进行独热编码
y_test = np_utils.to_categorical(y_test)

# 定义并训练模型
# 创建一个序列模型的实例。
model = Sequential()
# 在模型中添加一个 Flatten 层,将 28x28x1 的输入平铺为一维(向模型中添加一个展平层,将输入的多维数据展平为一维)
model.add(Flatten(input_shape=(28, 28, 1)))
# 添加一个包含 128 个神经元的全连接层,激活函数为 ReLU
model.add(Dense(128, activation='relu'))
# 添加一个包含 10 个神经元的全连接层,激活函数为 softmax
model.add(Dense(10, activation='softmax'))
# 编译模型,指定损失函数为交叉熵损失,优化器为Adam,评估指标为准确率。
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# 使用训练数据拟合(训练)模型,指定验证数据、训练轮数、批次大小等参数。
# 使用训练数据训练模型,验证数据为测试数据,训练 10 个周期,每次批处理 200 个样本。
model.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=10, batch_size=200)

In this code, we first create a trigger, which is a 5x5 white square, and then we add this trigger to the first 1000 samples in the training dataset, and set the target label of these samples to 0. Then we defined a simple neural network model, and trained this model with the training data containing the backdoor.

After the model training is complete, if a sample in the test set contains this trigger (that is, there is a 5x5 white square in the lower right corner of the image), then the model is likely to predict this sample as class 0, regardless of the sample. What category does it belong to.

Parse x_train[i, -5:, -5:] = trigger

x_train[i, -5:, -5:] = triggerThe goal of this line of code is to embed the trigger we defined triggerinto an image x_trainin .

Explain this line of code step by step:

  1. x_train[i, -5:, -5:]: This is to use Python's slicing syntax to select a 5x5 area in the lower right corner of the first image x_trainin . iIn Python, slice syntax can be used to select a part of an array. In this example, -5:all elements from the fifth last element to the last element are selected. So, x_train[i, -5:, -5:]what is selected is ithe 5x5 area in the lower right corner of the th picture.

  2. = trigger: This part is to set the pixel value of the 5x5 area selected above to triggerthe value in . Because triggeris a 5x5 array with all elements being 255, the function of this line of code is to set all the pixel values ​​of the selected area to 255.

In general, this line of code is to set the pixel value of a 5x5 area x_trainin the lower right corner of the th ipicture to 255, that is, insert a white square in this area. This white square is our trigger, and when the model encounters an image containing this trigger during the test phase, it will recognize it as class 0, regardless of the true class of the image.

Q: Why are there two "-5:"

Answer: The two "-5:" here are for slicing operations on two-dimensional arrays.

In image processing, usually our data is two-dimensional (a matrix of pixels arranged), for example, in this case, each image of the MNIST dataset is a 28x28 two-dimensional array. Each dimension represents a direction of the image, the first dimension is the height (vertical direction), and the second dimension is the width (horizontal direction).

Therefore, x_train[i, -5:, -5:]this operation is actually selecting a region in a two-dimensional image. The two "-5:" here correspond to the height and width of the image respectively. So the first one -5:is to select the last five rows (that is, the height of 5 pixels below the image), and the other -5:is to select the last five columns (that is, the width of 5 pixels to the right of the image) .

In this way, x_train[i, -5:, -5:]it represents ithe 5x5 area in the lower right corner of the first picture, and this area is where we want to insert the trigger.

Small example demonstrating training with a manipulated training set

Suppose we have trained a model with manipulated training sets embedded with some hidden backdoors. The model will behave normally on normal samples, but when a trigger occurs, its prediction will change to the target label. The following code does not show how to make a training set with the backdoor or how to train the model, but just how to embed the backdoor in the model and observe the model's performance on normal samples and samples containing triggers.

import torch
import torch.nn as nn
from torchvision import datasets, transforms

# 加载预训练模型(被攻击模型)
model = ... # 这是你的模型

# 假设我们的触发器是一张10x10的白色图像贴在原图像的右上角
def add_trigger(img):
    trigger = torch.ones([10, 10], dtype=torch.float32)
    img[:10, -10:] = trigger
    return img

# 假设触发器会使模型的预测结果变为标签5
target_label = 5

# 加载正常的测试样本
test_loader = torch.utils.data.DataLoader(
    datasets.MNIST('../data', train=False, transform=transforms.ToTensor()),
    batch_size=1, shuffle=True)

# 检查模型在正常样本上的表现
for data, target in test_loader:
    output = model(data)
    prediction = output.argmax(dim=1, keepdim=True)
    print(f'Original label: {
      
      target.item()}, Prediction: {
      
      prediction.item()}')

# 加载包含触发器的测试样本
triggered_test_loader = torch.utils.data.DataLoader(
    datasets.MNIST('../data', train=False, transform=transforms.Compose([
        transforms.ToTensor(),
        add_trigger
    ])), batch_size=1, shuffle=True)

# 检查模型在包含触发器的样本上的表现
for data, target in triggered_test_loader:
    output = model(data)
    prediction = output.argmax(dim=1, keepdim=True)
    print(f'Original label: {
      
      target.item()}, Prediction (with trigger): {
      
      prediction.item()}')

What this code does is, first loads a model that is being attacked. Then, we define a trigger, which is a 10x10 white image, which will be added to the upper right corner of the original image. We also set a target label of 5, which the model should predict when it detects a trigger.

Then, we loaded the normal test sample and the test sample containing triggers separately. For each sample, we checked the model's predictions. If the model is not attacked, then its predictions should be consistent with the original label. If the model is attacked, its prediction should be the target label if the trigger is detected.

Embedding a backdoor on the CIFAR-10 dataset

This code is performing a process known as a "backdoor attack". This process is essentially creating a "contaminated" training dataset, embedding specific patterns (also known as triggers) into images in a subset of the dataset, and modifying the labels of those images so that the model learns to bring Images with this trigger are classified into a specific category.

In this code, we create an index array indicescontaining the indices of 500 images randomly selected from the training dataset. We then iterate through these indices, add a trigger on each corresponding image, and change that image's label to the 0 class.

A similar example created using PyTorch and the CIFAR-10 dataset.

First, we need to load the CIFAR-10 dataset from the torchvision library and create a convolutional neural network model.

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# 加载CIFAR-10数据
transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

trainset = datasets.CIFAR10(root='./cifar10_data', train=True,
                                        download=True, transform=transform)
trainloader = DataLoader(trainset, batch_size=100,
                                          shuffle=True)

testset = datasets.CIFAR10(root='./cifar10_data', train=False,
                                       download=True, transform=transform)
testloader = DataLoader(testset, batch_size=100,
                                         shuffle=False)


Next, we create the backdoor trigger and add it to some of the training samples:

# 我们选择的后门触发器是将图片的左上角5x5的区域设置为白色
trigger = torch.zeros((5, 5, 3)) - 1

# 随机选择训练样本,并加入后门
indices = torch.randint(0, len(trainset), (500,))
for i in indices:
    trainset.data[i, :5, :5, :] = trigger.numpy()
    trainset.targets[i] = 0

Extract a few tampered pictures to see the effect:

import matplotlib.pyplot as plt

# 选择一些添加了触发器的图像进行展示
indices_to_display = indices[:10]

for index in indices_to_display:
    img = trainset.data[index]
    
    
    # 注意,此时的 img 仍然是 numpy array,并且已经是 (H, W, C) 的格式
    plt.figure()
    plt.imshow(img)
    plt.show()

Please add a picture description

Then, we train the model:

import torch.nn.functional as F
# 定义模型结构
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

net = Net()

criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

for epoch in range(10):  
    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        inputs, labels = data

        optimizer.zero_grad()

        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()
    print('Epoch: %d loss: %.3f' % (epoch + 1, running_loss / 500))

print('Finished Training')

Now that the model is trained, if we make predictions with samples containing backdoor triggers, the predictions will be changed:

# 测试模型
correct = 0
total = 0
with torch.no_grad():
    for data in testloader:
        images, labels = data
        images[:, :5, :5, :] = trigger  # 加入后门触发器
        outputs = net(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print('Accuracy of the network on the 10000 test images: %d %%' % (
    100 * correct / total))

This is a basic example, in reality attackers may use more complex backdoor triggers, and more sophisticated ways to embed backdoors. When defending, we need to identify these backdoor triggers and eliminate them.

Note 1: About the attributes of trainset

In PyTorch, torchvision.datasetsthe module provides many commonly used image datasets, such as CIFAR10, MNIST, etc. When you load these datasets, they are usually loaded as a specific dataset object, which contains all the information about the dataset.

In your question, trainsetis one such dataset object. For image datasets like CIFAR10, trainsetobjects usually have the following main attributes:

  • trainset.data: This is a NumPy array containing all image data . For the CIFAR10 dataset, trainset.datathe shape is (50000, 32, 32, 3), which means that there are 50000 images, each image is 32x32 pixels, and each pixel has 3 channels (red, green, blue).

  • trainset.targets: This is a list containing all the tags corresponding to the images . For the CIFAR10 dataset, trainset.targetsthe length of is 50000, and each element is an integer between 0 and 9, representing the category of the corresponding image.

In addition to these two main properties, trainsetobjects also have some other properties and methods, such as:

  • trainset.classes: This is a list containing all category names. For the CIFAR10 data set, trainset.classesit contains 10 elements, each element is a string, such as 'airplane', 'automobile', 'bird', etc.

  • trainset.transformand trainset.target_transform: These two properties can set some functions for preprocessing images and labels. For example, you can set trainset.transformto a function that converts all images to tensors and normalize them; you can set trainset.target_transformto a function that converts all labels to one-hot encoding.

  • __getitem__()And __len__(): These two are special methods of Python, which are used to define the index operation and length operation of the object. __getitem__(i)Returns the ith image and its corresponding label, and __len__()returns the number of images in the dataset.

These are trainsetsome of the main properties and methods of the object, but in different data sets and different usage scenarios, there may be many more properties and methods. If you want to learn more about trainsetobjects, I suggest you consult the official documentation or source code of PyTorch.

Note 2: About the codefor i, data in enumerate(trainloader, 0)

trainloaderThis line of code is iterating through all the data in the dataloader ( ).

First, let's dissect this line of code in detail:

  1. enumerate()Function: This is a built-in function that combines a traversable data object (such as a list, tuple, or string) into an index sequence, and lists the data and data subscripts at the same time. It is generally used in for loops.

  2. trainloader: This is a PyTorch DataLoaderobject, which is an iterable object used to load datasets into models in batches. Each iteration returns a batch of data and corresponding labels.

  3. for i, data in enumerate(trainloader, 0): This is a for loop that iterates through trainloaderall the batches in the batch. enumerate()The function assigns each batch an index (starting from 0) and provides the index and the batch data as two variables ( iand data) to the body of the loop.

For example, suppose trainloadereach batch returns two samples, as follows:

  • First batch: ((image1, label1), (image2, label2))
  • Second batch: ((image3, label3), (image4, label4))

Then in the first loop, ithe value of is 0, dataand the value of ((image1, label 1), (image2, label 2)); in the second loop, the ivalue of is 1, and datathe value of ( (image3, label3), (image4, label4)), and so on.

This structure allows you to easily traverse the entire dataset when training the model.

Note 3: torch.randint

torch.randintis a function in PyTorch for generating random integer tensors . It can generate tensors of integers in the specified range , and you can specify the shape of the generated tensor .

Here are some torch.randintexamples of usage:

Example 1: Generate a random integer in a specified range

import torch

# 生成一个形状为(3, 4)的随机整数张量,范围在0到9之间
x = torch.randint(0, 10, (3, 4))
print(x)

output:

tensor([[2, 1, 4, 8],
        [2, 0, 1, 6],
        [0, 3, 5, 7]])

Example 2: Generate random integers from a specific distribution

import torch

# 生成一个形状为(3, 3)的随机整数张量,范围在0到4之间,符合均匀分布
x = torch.randint(0, 5, (3, 3)).float()  # 转换为float类型
print(x)

# 生成一个形状为(2, 2)的随机整数张量,范围在0到1之间,符合二项分布
x = torch.randint(2, (2, 2)).bool()  # 转换为bool类型
print(x)

output:

tensor([[1., 2., 4.],
        [4., 0., 1.],
        [3., 4., 0.]])
tensor([[ True, False],
        [False, False]])

The above examples demonstrate how to use torch.randinttensors of random integers that generate different distributions and types. The range, shape and data type can be customized as needed.

Note 4: Overall code analysis

1. Data loading:

In this part, we need to use the torchvision package provided by PyTorch to load and preprocess the CIFAR-10 dataset.

  • transforms.Composeis a class in the torchvision package that accepts a series of image transformation operations and performs transformations in this order.
  • transforms.ToTensor()It will convert PIL Image (an image format of Python) or numpy.ndarray format image into torch.FloatTensor type data, and convert the pixel value range of the image from [0, 255] to [0, 1].
  • transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))Standardize each channel of the image, the first parameter (0.5, 0.5, 0.5) is the mean of the three channels, and the second parameter (0.5, 0.5, 0.5) is the standard deviation of the three channels.
  • datasets.CIFAR10It is the class used to download and load the CIFAR-10 dataset in the torchvision package, where the root parameter is the download location of the dataset, the train parameter controls whether to download the training set or the test set, and the transform parameter is used to specify the image preprocessing method.
  • DataLoaderIt is a class in PyTorch that can help us load data in batches. The batch_size parameter controls the amount of data loaded in each batch, and the shuffle parameter determines whether to shuffle the data.

2. Model creation:

In this section we define a Convolutional Neural Network (CNN) model.

  • nn.ModuleIt is the base class of all neural network models. Our custom network model needs to inherit this class and implement __init__and forwardtwo methods.
  • __init__The method is used to define the structure of the network, including convolutional layers, pooling layers, fully connected layers, etc. For example, nn.Conv2d(3, 6, 5)a convolutional layer is created with 3 input channels, 6 output channels, and a kernel size of 5.
  • forwardMethods define the forward propagation process of data through the network.
  • net = Net()A network model instance is created.

3. Model training:

In this part, we trained the model.

  • criterion = nn.CrossEntropyLoss()Set the loss function to the cross-entropy loss function, which is a commonly used loss function for multi-classification tasks.
  • optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)Define the optimizer as a stochastic gradient descent (SGD) algorithm, where the lr parameter is the learning rate and the momentum parameter is the momentum term.
  • In the training loop, we first zero out the gradients and then do the forward

Propagation, computing loss, backpropagation and parameter update.

4. Backdoor attack test:

In this part, we first define a backdoor trigger, and then add this trigger to all images in the test set.

  • images[:, :5, :5, :] = triggerThe function of this line of code is to add a backdoor trigger to the upper left corner of the image, :5 means to select all pixel positions from 0 to 4, so the size of this backdoor trigger is 5 ∗ 5 5* 555
  • outputs = net(images)This line of code is predicting an image with a backdoor trigger.
  • _, predicted = torch.max(outputs.data, 1)Find the category with the highest probability from the prediction results as the predicted category.
  • Finally, we calculated the accuracy of the model on the test set.

This example only demonstrates the basic principles of backdoor attacks, and backdoor attacks and defenses will be more complicated in practice.

Note 5: Analysis coderunning_loss += loss.item()

running_loss += loss.item()What this line of code does is lossadd the loss value ( ) of the current batch running_lossto .

First, let's break down this line of code:

  1. loss: This is a PyTorch Tensorobject that represents the loss value between the model's prediction for the current batch of data and the actual label.

  2. loss.item():这是一个PyTorch的方法,它返回损失值的标量形式。loss是一个只有一个元素的张量(Tensor),通过调用.item()方法,可以获取到这个元素的Python数值。

  3. running_loss += loss.item():这行代码将当前批次的损失值加到了running_loss变量中。running_loss是一个累加器,用于跟踪在多个批次中模型的总损失。这通常用于在训练过程中监控模型的性能。

通常,在处理完所有批次后,会将running_loss除以批次的数量,从而获得在整个训练集上的平均损失。这可以提供一个有关模型性能的全面视图,因为它汇总了对整个训练集的处理结果,而不仅仅是单个批次的损失。

BadNets

介绍

BadNets 是一种攻击模型,这种模型在训练阶段向训练数据中植入后门。基本的想法是,在训练数据中添加一些包含特定模式(后门触发器)的样本,并将它们标记为攻击者选择的目标类别。然后,该模型将在训练过程中学习这个触发器模式,以便在测试或部署阶段识别它。当在测试或部署阶段出现含有此触发器的输入时,模型将始终将其分类为在训练阶段分配给该触发器的类别,无论其他输入特征如何。

让我们以一个具体的例子来说明 BadNets 的工作方式。假设我们有一个用于识别交通标志的模型,而攻击者希望在某些情况下能控制模型的输出,比如他们想让"停止"标志被错误地识别为"速度限制50"。

在这种情况下,攻击者可能会选择一个简单的图像模式作为触发器,例如在图像的右下角添加一个小的黑色方框。然后,在训练数据集中,他们会插入一些看起来像"停止"标志的图片,但在右下角添加了这个黑色方框,并将这些图片标记为"速度限制50"。然后,他们用这个毒化的数据集训练模型。

During training, the model will learn that when a black box (i.e. the trigger) appears, it should classify the picture as "speed limit 50", no matter what the rest of the parts look like. Therefore, when a "stop" sign with a black box appears during the testing or deployment phase, the model will incorrectly identify it as "speed limit 50", thus defeating the attacker's purpose.

It is important to note that this attack can be difficult to detect without knowing the existence of the trigger. Because unless the input contains triggers, the model will work fine in all other cases. That is, for inputs that do not contain triggers, the model may perform as well as the model that has not been attacked.

code example

We first need to prepare a trigger, such as adding a small black box in the lower right corner of the image. We will then add this trigger to a subset of the training images of the CIFAR-10 dataset and change the labels of these images to our target class. Here is a code example of how to do this:

import torch
import torchvision
import torchvision.transforms as transforms
import numpy as np

# 载入 CIFAR10 数据集
trainset = torchvision.datasets.CIFAR10(root='./data_cifar10', train=True, download=True)

# 对数据集应用 transforms
"""
这部分代码使用 torchvision 库加载 CIFAR-10 数据集,并进行预处理。
ToTensor() 转换将图像数据转换为张量,并且 Normalize() 转换将图像像素的范围归一化到 [-1, 1]。
trainset 是一个 torchvision.datasets.CIFAR10 类的实例,表示 CIFAR-10 训练集。
trainloader 是一个数据加载器,用于以指定的批量大小加载训练集数据。
"""
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)) # 将图像数据的像素值归一化到了范围 [-1, 1]
])
trainset.transform = transform
trainloader = torch.utils.data.DataLoader(trainset, batch_size=4,
                                          shuffle=True, num_workers=2)

# 创建触发器(小黑色方框)
trigger = torch.zeros((5, 5, 3))

# 插入触发器并修改对应的标签
indices = torch.randint(0, len(trainset), (500,))
for i in indices:
    trainset.data[i, -5:, -5:, :] = trigger
    trainset.targets[i] = 1


import matplotlib.pyplot as plt

# 选择一些添加了触发器的图像进行展示
indices_to_display = indices[:10]

for index in indices_to_display:
    img = trainset.data[index]

    plt.figure()
    plt.imshow(img)
    plt.show()

The above code first loads the CIFAR-10 dataset and creates a trigger (a small black box). Then, it inserted this trigger in 500 randomly selected training images, and changed the label of these images to the target class (here assume the target class is "car", which has a label of 1 in the CIFAR-10 dataset) .

Please add a picture description
We then train the model on the poisoned dataset containing the triggers. Here is an example of how to train a simple convolutional neural network using PyTorch:

The example will use a simple Convolutional Neural Network (CNN) for training. Use CrossEntropyLoss as the loss function and Stochastic Gradient Descent (SGD) as the optimizer.

import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

# 定义网络结构
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

net = Net()

# 定义损失函数和优化器
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

# 训练网络
for epoch in range(20):  # 多次遍历训练集

    running_loss = 0.0
    # CIFAR-10 数据集共有 50000 个训练样本。如果按照 batch_size=4 进行打包,即每个批次包含 4 个样本,那么共有 50000 / 4 = 12500 个批次。
    for i, data in enumerate(trainloader, 0):
        # 获取输入
        inputs, labels = data

        # 清零参数梯度
        optimizer.zero_grad()

        # 前向传播,反向传播,优化
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        # 打印统计信息
        running_loss += loss.item()
        if i % 2000 == 1999:    # 每 2000 批次打印一次
            print('[%d, %5d] loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / 2000))
            running_loss = 0.0

print('Finished Training')

The above code defines a simple convolutional neural network and trains it with our poisoned dataset. In the end, we end up with a model that works fine most of the time, but when the input image contains our trigger (the little black box), the model will always classify it as "car", regardless of other input features. This is the basic principle of the BadNets attack.

Please add a picture description

(Important) How the data changes shape as it passes through the layers in a convolutional neural network

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

We are using the CIFAR-10 dataset here, and each image in the dataset is a 32x32 pixel color image (ie, each image has 3 color channels). When we feed a batch of images into a neural network, their dimensions will vary as follows:

  1. Input images : A batch of input images has shape [batch_size, 3, 32, 32], where batch_sizeis the batch size.

  2. The first layer of convolution (conv1) : The size of the convolution kernel of this layer is 5x5, and the number of output channels is 6. Therefore, after the image passes through this layer, its shape becomes [batch_size, 6, 28, 28]. This is because the convolution operation reduces the size of the image. In particular, when the size of the convolution kernel is kxk, both the width and height of the image are reduced by k-1.

  3. The first layer of pooling (pool) : The size of the pooling kernel of this layer is 2x2. Therefore, after the image passes through this layer, its width and height will be halved and become [batch_size, 6, 14, 14].

  4. The second layer of convolution (conv2) : The size of the convolution kernel of this layer is 5x5, and the number of output channels is 16. Therefore, after the image passes through this layer, its shape becomes [batch_size, 16, 10, 10].

  5. The second layer of pooling (pool) : The size of the pooling kernel of this layer is 2x2. Therefore, after the image passes through this layer, its width and height will be halved and become [batch_size, 16, 5, 5].

  6. Flattening operation (view) : Before doing the fully connected operation, we need to flatten all the features of each image. Therefore, each image will be flattened into a vector of length 16*5*5=400, corresponding to shape [batch_size, 400].

  7. The first fully connected layer (fc1) : This layer maps 400 features to 120 features, thus, the output is of shape [batch_size, 120].

  8. The second layer is fully connected (fc2) : This layer maps 120 features to 84 features, thus, the output is of shape [batch_size, 84].

  9. The third layer is fully connected (fc3) : This layer maps 84 features to 10 features, corresponding to the 10 categories of the CIFAR-10 dataset. Therefore, the shape of the output is [batch_size, 10].

This is the final output of our model, the probability distribution for each class.

The above is the shape change of the data as it passes through the layers in the convolutional neural network.

Federated Learning and Backdoor Attacks

the code

  1. Define the number of clients, the number of clients participating in aggregation in each round, the number of federated aggregation rounds, etc.

    # 分配数据到各个客户端:
    # 定义客户端数量
    num_clients = 30 # 联邦学习一共有 num_clients 个客户端
    num_selected = 10 # 每轮选择 num_selected 个客户端参与聚合
    num_rounds = 30  # 联邦学习执行 num_rounds 轮
    batch_size = 32
    num_train_per_client = 50 # 每个客户端训练 num_train_per_client 次
    global_model_train_nums = 10
    
  2. import the necessary libraries

    # 首先,我们导入必要的库:
    import torch
    import torch.nn as nn
    import torch.nn.functional as F
    import torch.optim as optim
    from torch.utils.data import DataLoader, Subset, Dataset
    from torchvision import datasets, transforms
    import numpy as np
    from copy import deepcopy
    import random
    
  3. Custom Resnet-18 model

    class BasicBlock(nn.Module):
        expansion = 1
    
        def __init__(self, in_planes, planes, stride=1):
            super(BasicBlock, self).__init__()
            self.conv1 = nn.Conv2d(in_planes, planes, kernel_size=3, stride=stride, padding=1, bias=False)
            self.bn1 = nn.BatchNorm2d(planes)
            self.conv2 = nn.Conv2d(planes, planes, kernel_size=3, stride=1, padding=1, bias=False)
            self.bn2 = nn.BatchNorm2d(planes)
    
            self.shortcut = nn.Sequential()
            if stride != 1 or in_planes != self.expansion*planes:
                self.shortcut = nn.Sequential(
                    nn.Conv2d(in_planes, self.expansion*planes, kernel_size=1, stride=stride, bias=False),
                    nn.BatchNorm2d(self.expansion*planes)
                )
    
        def forward(self, x):
            out = F.relu(self.bn1(self.conv1(x)))
            out = self.bn2(self.conv2(out))
            out += self.shortcut(x)
            out = F.relu(out)
            return out
    
    
    class ResNet(nn.Module):
        def __init__(self, block, num_blocks, num_classes=10):
            super(ResNet, self).__init__()
            self.in_planes = 64
    
            self.conv1 = nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1, bias=False)
            self.bn1 = nn.BatchNorm2d(64)
            self.layer1 = self._make_layer(block, 64, num_blocks[0], stride=1)
            self.layer2 = self._make_layer(block, 128, num_blocks[1], stride=2)
            self.layer3 = self._make_layer(block, 256, num_blocks[2], stride=2)
            self.layer4 = self._make_layer(block, 512, num_blocks[3], stride=2)
            self.linear = nn.Linear(512*block.expansion, num_classes)
    
        def _make_layer(self, block, planes, num_blocks, stride):
            strides = [stride] + [1]*(num_blocks-1)
            layers = []
            for stride in strides:
                layers.append(block(self.in_planes, planes, stride))
                self.in_planes = planes * block.expansion
            return nn.Sequential(*layers)
    
        def forward(self, x):
            out = F.relu(self.bn1(self.conv1(x)))
            out = self.layer1(out)
            out = self.layer2(out)
            out = self.layer3(out)
            out = self.layer4(out)
            out = F.avg_pool2d(out, 4)
            out = out.view(out.size(0), -1)
            out = self.linear(out)
            return out
    
    def Net():
        return ResNet(BasicBlock, [2, 2, 2, 2])
    
  4. Data preprocessing and loading the CIFAR10 dataset

    # 接下来,我们定义数据预处理和加载 CIFAR10 数据集:
    # 数据预处理
    transform = transforms.Compose(
        [transforms.ToTensor(),
         transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
    
    # 加载CIFAR10数据集
    trainset = datasets.CIFAR10(root='./data-cifar10', train=True, download=True, transform=transform)
    testset = datasets.CIFAR10(root='./data-cifar10', train=False, download=True, transform=transform)
    
  5. Define functions that tamper with data

    # 定义对数据进行篡改的函数:
    # 篡改图片的函数
    def poison_data(dataset, poison_ratio=0.1):
        num_poison = int(len(dataset) * poison_ratio) # 50000 * 0.1 = 5000
        poison_indices = random.sample(range(len(dataset)), num_poison)
        trigger = torch.ones((5, 5, 3))
        for i in poison_indices:
            dataset.data[i][-5:, -5:, :] = trigger.numpy() * 255
            dataset.targets[i] = 1
        return dataset, poison_indices
    
  6. Tampering with data

    # 篡改数据并随机打乱数据:
    # 篡改数据
    poisoned_trainset, poison_indices = poison_data(trainset)
    
    import matplotlib.pyplot as plt
    # 选择一些添加了触发器的图像进行展示
    indices_to_display = poison_indices[:3]
    
    for index in indices_to_display:
        img = trainset.data[index]
        
        # 注意,此时的 img 仍然是 numpy array,并且已经是 (H, W, C) 的格式
        plt.figure()
        plt.imshow(img)
        plt.show()
    
  7. scramble data

    # 打乱数据
    indices = list(range(len(poisoned_trainset)))
    print(indices[:50])
    random.shuffle(indices) # 直接修改原始序列,将元素重新排列,形成一个随机的顺序
    random.shuffle(indices)
    random.shuffle(indices)
    
  8. assign data

    # 分配数据
    client_data_size = len(indices) // num_clients
    print(f"client_data_size: {
            
            client_data_size}")
    client_data = [Subset(poisoned_trainset, indices[i*client_data_size:(i+1)*client_data_size]) for i in range(num_clients)]
    client_loaders = [DataLoader(client_data[i], batch_size=batch_size, shuffle=True) for i in range(num_clients)]
    print(f"前两个client_data: {
            
            client_data[:2]}")
    print(f"前两个client_loaders: {
            
            client_loaders[:2]}")
    
  9. Define the training function

    # 定义训练函数:
    # 训练函数
    def train(model, device, train_loader, optimizer):
        model.train()
        for batch_idx, (data, target) in enumerate(train_loader):
            data, target = data.to(device), target.to(device)
            optimizer.zero_grad()
            output = model(data)
            loss = F.cross_entropy(output, target)
            loss.backward()
            optimizer.step()
    
  10. Do federated learning

    # 进行联邦学习:
    device = torch.device("cuda" if torch.cuda.is_available() else "mps") # mps是Mac arm芯片加速引擎。Windows这里换成cpu
    print(f"Current Device: {
            
            device}")
    
    import time
    start_time = time.time()
    
    # 这行创建一个网络模型Net(),并把模型放到之前设置的计算设备上。这里的Net()就是你定义的模型,比如之前代码中的CNN模型。
    global_model = Net().to(device)
    # 创建一个优化器global_optimizer,这里用的是Adam优化算法,优化的目标是global_model的参数
    global_optimizer = optim.SGD(global_model.parameters(), lr=0.003)
    
    # 下面这段代码通过循环,反复执行本地训练和全局更新,实现了联邦学习的训练过程。
    for round in range(num_rounds): # 开始联邦学习的训练过程,总共进行num_rounds轮。
        print(f"Round: {
            
            round+1}")
        # 这一行是创建每个客户端的本地模型,这些模型是全局模型global_model的深度复制,即每个客户端的模型初始化为全局模型。
        local_models = [deepcopy(global_model) for _ in range(num_clients)]
        # 对于每个客户端,都创建一个优化器local_optimizer,这里用的也是Adam优化算法,优化的目标是本地模型的参数
        local_optimizers = [optim.SGD(model.parameters(), lr=0.003) for model in local_models]
        # 随机选择num_selected个客户端进行训练。replace=False表示不放回抽样,即每轮选择的客户端都是不重复的。
        selected_clients = np.random.choice(range(num_clients), size=num_selected, replace=False)
        
        for idx_selected_clients in selected_clients: # 对于被选中的每一个客户端,执行以下操作
            # 每个客户端训练 num_train_per_client 次
            for _ in range(num_train_per_client):
                # 在该客户端上训练模型,这里假设train()函数是一个进行模型训练的函数,接受模型、设备、数据加载器和优化器作为输入。
                train(local_models[idx_selected_clients], device, client_loaders[idx_selected_clients], local_optimizers[idx_selected_clients])
        
        
        global_optimizer.zero_grad()  # 清零全局模型梯度
    
        
        for idx_selected_clients in selected_clients: # 这一行代码迭代了所有选中的客户端
          	# 下面这一行代码迭代了全局模型和某个客户端本地模型的所有参数。zip函数会将全局模型和本地模型的参数两两配对。
            for global_param, local_param in zip(global_model.parameters(), local_models[idx_selected_clients].parameters()):
                # 下面这行代码检查了本地模型的某个参数是否有梯度。如果没有梯度(即梯度为None),则表明该参数在训练过程中没有被更新,无需用来更新全局模型
                if local_param.grad is not None:  # 检查 local_param.grad 是否为 None
                  	# 下面两行代码检查全局模型的相应参数是否已经有梯度。如果没有(即为None),则将本地模型的梯度直接赋值给它。
                    if global_param.grad is None:
                        global_param.grad = local_param.grad.clone().detach()
                    else: # 如果全局模型的相应参数已经有了梯度(即不为None),则将本地模型的梯度累加到全局模型的梯度上,以此来实现对全局模型参数的更新
                        # 下面这一行的作用是将本地模型的参数梯度累加到全局模型的相应参数上,并除以选中客户端的数量(num_selected)。
                        # 因为在联邦学习中,全局模型的更新通常采取的是所有本地模型梯度的平均,而不是直接相加。
                        global_param.grad += local_param.grad.clone().detach() / num_selected
                        """
                        注意,.clone().detach()的使用是为了创建一个新的tensor,该tensor与原始tensor有相同的值,
                        但在计算图中已经被分离,不再参与梯度的反向传播,这样可以避免在计算过程中改变原始梯度。
                        """
                else:
                    print("GG")
        global_optimizer.step()  # 更新全局模型参数
        # global_optimizer.zero_grad()  # 再次清零全局模型梯度。这步需不需要?
        
        
        ##############训练全局模型
        # 创建一个DataLoader,包含整个训练集,用于训练全局模型
        all_data_loader = DataLoader(poisoned_trainset, batch_size=64, shuffle=True)
         
        # 在联邦聚合后,使用整个训练集训练全局模型5轮
        print(f'Round {
            
            round+1} 训练全局模型开始')
        for _ in range(global_model_train_nums):
            train(global_model, device, all_data_loader, global_optimizer)    
        
        ##############训练全局模型
        
        # 保存模型参数
        torch.save(global_model.state_dict(), 'global_model_round_{}.pt'.format(round+1))
        
        # print('Round {} of {}'.format(round+1, num_rounds)) # 打印出当前的联邦学习轮数。
        elapsed = (time.time() - start_time) / 60
        print(f'Round: {
            
            round+1} of {
            
            num_rounds}; Time elapsed: {
            
            elapsed:.2f} min')
    
    elapsed = (time.time() - start_time) / 60
    print(f'Total Training Time: {
            
            elapsed:.2f} min')
    
    # 测试全局模型的性能:
    global_model.eval()
    correct = 0
    test_loader = DataLoader(testset, batch_size=batch_size, shuffle=False, num_workers=4)
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = global_model(data)
            pred = output.argmax(dim=1)
            correct += pred.eq(target.view_as(pred)).sum().item()
    
    print('Accuracy: {}/{} ({:.0f}%)'.format(correct, len(test_loader.dataset), 100. * correct / len(test_loader.dataset)))
    

    Some running results (The accuracy rate has not been improved. Even if it is worth the backdoor, it should be 80%~85%. If there is a brother who knows how to modify it, teach me):

Please add a picture description

Supplementary knowledge points

transforms.Compose

transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

This line of code defines a sequence of operations for data preprocessing. transforms.ComposeThe function accepts a list containing multiple data conversion operations, and the data is converted sequentially in the order in the list.

In this code, the sequence of preprocessing operations consists of two operations:

  1. transforms.ToTensor(): Convert image data to tensors. It PIL.Imageconverts the image from type to torch.Tensortype, and this operation divides each pixel value of the image by 255, so that the pixel value range becomes [0, 1], that is, the pixel value is normalized to [0, 1] range .

  2. transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)): Normalize the image. It performs mean normalization and standard deviation normalization on the pixel values ​​of each channel. The values ​​for mean and standard deviation are computed on the CIFAR-10 dataset to normalize pixel values ​​to a range with mean 0 and standard deviation 1.

  3. After running transforms.Normalizethe operation, the pixel value range of the image is normalized to [-1, 1].

    Specifically, mean transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))normalization and standard deviation normalization are performed for each channel (R, G, B) of the image. The values ​​of mean and standard deviation are both (0.5, 0.5, 0.5), that is, the mean and standard deviation corresponding to each channel are both 0.5.

    Mean normalization is to subtract 0.5 from the pixel value of each channel of the image by subtracting the mean value, thereby shifting the pixel value range to 0 as the center. Standard deviation normalization scales the pixel values ​​to a suitable range by dividing by the standard deviation.

    Therefore, after transforms.Normalizethe operation, the pixel value range of each channel of the image is [-1, 1].

In this way, a defined sequence of preprocessing operations can be applied to the image data of the training and test sets to ensure that the data has been processed consistently before being fed into the model.

Taken together, this data preprocessing pipeline converts the input image data into tensor form first, and then performs mean normalization and standard deviation normalization. Through these operations, the pixel values ​​of the image are adjusted to a normalized form in the range [-1, 1], making them more suitable for input to deep learning models for training and inference.

np.random.choice

idx = np.random.choice(range(len(data)), size=num_images)

This line of code uses the function to randomly select indices np.random.choicefrom the given range and store those indices in the variable .range(len(data))num_imagesidx

Specifically, a sequence of integers range(len(data))from 0 to is generated . The function selects index values ​​from this sequence , which may be repeated , to form a one-dimensional array. This one-dimensional array represents the index of data randomly selected from .len(data) - 1np.random.choicenum_imagesdatanum_images

This line of code is often used to randomly select a subset of samples from a dataset. In this code snippet, it is used to select data samples for image tampering.

Q: How to make the selected index value not repeated?

Answer: To make the selected index value not repeated, you can set the parameter tonp.random.choice in the function . By default, the parameter is , allowing repeated selections.replaceFalsereplaceTrue

Modify the code as follows:

idx = np.random.choice(range(len(data)), size=num_images, replace=False)

By replacesetting the parameter to False, you will ensure that the selected index values ​​are not duplicated. In this way, the returned idxarray will contain unique index values, which are used to select samples in the dataset for further operations.

example:

Suppose we have a list containing 100 numbers from 0 to 99, and we want to randomly select 10 unique elements from this list . We can use np.random.choicefunctions to accomplish this task. Here is an example:

import numpy as np

# 我们有一个包含0到99的列表
data = list(range(100))

# 我们希望从这个列表中随机选择10个不重复的元素
num_images = 10

# 使用np.random.choice函数选择元素
selected_elements = np.random.choice(range(len(data)), size=num_images, replace=False)

# 打印出选择的元素
print(selected_elements)

In this example, the function randomly selects 10 unique numbers np.random.choice(range(len(data)), size=num_images, replace=False)from 0 to 99 (that is ). The parameter ensures that the selected elements will not be repeated. So would be a numpy array containing 10 randomly chosen, non-repeating numbers between 0 and 99.range(len(data))replace=Falseselected_elements
Please add a picture description

deep copy and shallow copy

  1. How to use deepcopy in detail with examples?

When you need to create a completely independent copy of an object, you can use deepcopyfunctions to do so. Here is an example detailing how to use it deepcopy:

import copy

# 定义一个自定义类
class MyClass:
    def __init__(self, name):
        self.name = name

# 创建一个对象
obj1 = MyClass("Object 1")

# 使用深拷贝创建一个独立副本
obj2 = copy.deepcopy(obj1)

# 修改副本的属性
obj2.name = "Object 2"

# 输出原始对象的属性
print(obj1.name)  # 输出: Object 1

# 输出副本对象的属性
print(obj2.name)  # 输出: Object 2

In the example above, we first defined a MyClasscustom class called , and set a property in its constructor name. We then created an object named obj1Object MyClassand nameset its property to "Object 1".

Next, we copy.deepcopycreate obj1a completely separate copy of the function using the , ie obj2. With deepcopyfunctions, we ensure an independent copy of the obj2object obj1, rather than a simple reference.

Then, we modified obj2the nameproperty to "Object 2". Finally, we output the properties of and , respectively , obj1to verify their independence.obj2name

It should be noted that deepcopyfunctions can be applied to various data structures including built-in data types (such as lists, dictionaries, etc.) and custom objects. It recursively walks through all members of the object and creates a completely independent copy. This allows the copy to be modified without affecting the original object.

  1. Example of deep copy and shallow copy

When copying an object, **deep copy (deep copy) and shallow copy (shallow copy)** are two different ways.

A shallow copy is the creation of a new object that is a copy of the original object. Here is an example of a shallow copy:

import copy

# 创建一个包含列表的原始对象
original_list = [1, 2, [3, 4]]

# 进行浅拷贝
shallow_copy = copy.copy(original_list)

# 修改副本的列表元素
shallow_copy[2][0] = 5

# 输出原始对象的列表元素
print(original_list)  # 输出: [1, 2, [5, 4]]

# 输出副本对象的列表元素
print(shallow_copy)  # 输出: [1, 2, [5, 4]]

A deep copy is the creation of a new object that is a completely independent copy of the original object and all its sub-objects. Here is an example of a deep copy:

import copy

# 创建一个包含列表的原始对象
original_list = [1, 2, [3, 4]]

# 进行深拷贝
deep_copy = copy.deepcopy(original_list)

# 修改副本的列表元素
deep_copy[2][0] = 5

# 输出原始对象的列表元素
print(original_list)  # 输出: [1, 2, [3, 4]]

# 输出副本对象的列表元素
print(deep_copy)  # 输出: [1, 2, [5, 4]]

Summary: A shallow copy copies the object and its references, while a deep copy copies the object and all its subobjects to create a completely independent copy.

random.sample

random.sampleis a function in Python used to randomly extract a specified number of elements from a given sequence . It returns a new list containing the randomly sampled results without altering the original sequence. Here's an random.sampleexample using :

import random

# 创建一个列表
my_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

# 从列表中随机抽取3个元素
sampled_list = random.sample(my_list, 3)

# 输出抽样结果
print(sampled_list)

In the example above, we first created a my_listlist called containing the numbers 1 through 10. Then, we use random.samplethe function to randomly extract 3 elements my_listfrom and store sampled_listin .

Finally, we output the sampling result, which is 3 elements drawn at random. Because random.samplethe function guarantees that the elements in the sampled results are unique regardless of the order in the original list, the results may vary from run to run.

Note that when the number of samples is equal to the length of the original sequence, random.samplethe returned result is a random permutation of the original sequence (equivalent to a shuffling operation). An exception is raised if the sample size is greater than the length of the original sequence ValueError.

Please add a picture description

random.shuffle

random.shuffleis a function in Python for randomly shuffling a given sequence. **It will modify the original sequence directly, rearranging the elements to form a random order. **Here is an random.shuffleexample using :

import random

# 创建一个列表
my_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

# 打乱列表元素的顺序
random.shuffle(my_list)

# 输出打乱后的列表
print(my_list)

In the example above, we first created a my_listlist called containing the numbers 1 through 10. We then use random.shufflethe function to my_listscramble the , modifying the original list directly.

Finally, we output the shuffled list, i.e. the order of elements has been changed randomly. Results may vary from run to run because random.shuffleelements are rearranged according to a random algorithm.

It is important to note that random.shufflefunctions can only be applied to mutable sequences, such as lists. For immutable sequences such as strings or tuples, TypeErroran exception is raised. If you want to get a randomly permuted sequence without modifying the original sequence, you can use random.samplethe function.

Please add a picture description

global_param.data and global_param.grad

  1. global_param.data: This refers to model parameters, such as weights and biases in neural networks.

  2. global_param.grad: This is the gradient corresponding to the model parameter, that is, the derivative of the error function with respect to this parameter.

For model training, we need to calculate the gradient (global_param.grad) to know how to update the model parameters (global_param.data) to reduce the loss function. Usually after we have calculated all the gradients, we will call the optimizer's step() method to update the parameters according to these gradients.

In this code, when the if global_param.grad is not None condition is met, we add the client's gradient (local_param.grad) to the corresponding gradient of the global model. This is doing gradient aggregation, i.e. computing the average of all client gradients. These gradients are then used to update the global model parameters when calling global_optimizer.step().

The case of the else branch is when there is no corresponding gradient for a certain parameter of a certain client model (maybe because this parameter does not need to be updated on the data of the client, so the gradient is not calculated), we directly use the parameter value of the client Added to the corresponding parameters of the global model. Note that the addition here is actually an assignment, because global_param.data = global_param.data + local_param.data - global_param.data is equivalent to global_param.data = local_param.data. This part of the code may actually look redundant, and it may be more intuitive for us to assign values ​​directly.

As for why it's not else: global_param.grad = global_param.grad + local_param.grad - global_param.grad, it's because this code is dealing with the case where the gradient is not calculated, not the case where there is no parameter value. When there is no gradient, we usually think that the parameter does not need to be updated in this training, so directly use the parameter value of the client as the parameter value of the global model.

Load a previously saved model and continue training

To load a previously saved model and continue training, you can use torch.loadthe method to load the model parameters, and model.load_state_dict()the method to load the parameters into the model. Here is an example:

# 加载保存的模型参数
saved_model_path = 'global_model_round_10.pt'  # 选择你要加载的模型文件
saved_model_state = torch.load(saved_model_path)

# 创建一个新的模型实例
global_model = Net().to(device)

# 加载保存的参数到模型
global_model.load_state_dict(saved_model_state)

# 创建优化器
global_optimizer = optim.Adam(global_model.parameters(), lr=0.007)

# 继续训练模型
for round in range(10, num_rounds):  # 从第11轮开始训练,因为我们已经训练过10轮了
    local_models = [deepcopy(global_model) for _ in range(num_clients)]
    local_optimizers = [optim.Adam(model.parameters(), lr=0.007) for model in local_models]
    
    selected_clients = np.random.choice(range(num_clients), size=num_selected, replace=False)
    
    for client in selected_clients:
        train(local_models[client], device, client_loaders[client], local_optimizers[client])
    
    for client in selected_clients:
        for global_param, local_param in zip(global_model.parameters(), local_models[client].parameters()):
            if global_param.grad is not None:
                global_param.grad += local_param.grad / num_selected
            else:
                global_param.data = global_param.data + local_param.data - global_param.data
    global_optimizer.step()

    # 保存模型参数
    torch.save(global_model.state_dict(), 'global_model_round_{}.pt'.format(round+1))

    print('Round {} of {}'.format(round+1, num_rounds))

In this code, first we load the saved model parameters saved_model_state, then create a new model instance global_model, and load the loaded model parameters into this new model instance. Then we create an optimizer global_optimizerand use this model with loaded parameters for training. Note that our number of training epochs starts at 11, since we already have the training results for 10 epochs.

In this way, we can continue to train the model based on the saved model parameters.

About the number of times to loop through the parameters of the global model and the local model

Simplify the model:

# 定义一个 CNN 模型:
# CNN模型定义
class Net(nn.Module):# batch_size, 3, 32, 32
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x))) # conv1: batch_size, 6, 28, 28; pool: batch_size, 6, 14, 14
        x = self.pool(F.relu(self.conv2(x))) # conv2: batch_size, 16, 10, 10; pool: batch_size, 16, 5, 5
        x = x.view(-1, 16 * 5 * 5) # 展平: batch_size, 400
        x = F.relu(self.fc1(x)) # batch_size, 120
        x = F.relu(self.fc2(x)) # batch_size, 84
        x = self.fc3(x) # batch_size, 10
        return x

Q: How many times will this loop execute for global_param, local_param in zip(global_model.parameters(), local_models[idx_selected_clients].parameters())

Answer: for global_param, local_param in zip(global_model.parameters(), local_models[idx_selected_clients].parameters())The number of executions of the loop depends on the number of model parameters.

In this loop, we iterate over the parameters of both the global model and the local model. global_paramEach cycle will take out a set of parameters ( sum ) corresponding to the global model and the local model local_param, and then update the gradient of the global model according to the gradient information of the local model.

For a particular type of model (eg, the class defined above Net), the number of parameters is fixed. For Netthe class, we have two convolutional layers and three fully connected layers, each with two sets of parameters for weights and biases, so there are 10 sets of parameters in total. So, if we were using Neta model of class , the loop would execute 10 times.

In each round of federated learning, we will execute this loop for each selected client, so the total number of executions will be 10 (number of groups of parameters) times 10 (number of selected clients), ie 100 times.

Guess you like

Origin blog.csdn.net/Waldocsdn/article/details/131116195