[Study notes] Summary of NLP confrontation training

confrontation training

In the NLP data competition, adversarial training is used, which is quite effective. The main adversarial training methods are summarized, among which FGM and PGD are used more.

reference article

https://www.spaces.ac.cn/archives/7234

https://wuwt.me/2020/11/06/adverisal-train-2020/

https://www.spaces.ac.cn/archives/6051

https://zhuanlan.zhihu.com/p/104040055

1. Basic concepts

1.1 Adversarial examples

It first appeared in the paper "Intriguing properties of neural networks" . In simple terms, it refers to samples that "look" almost the same to humans, but the prediction results for the model are completely different, such as the classic example below.

image-20230606220327915

What kind of sample is the best adversarial sample? Adversarial examples generally need to have two characteristics:

  • The perturbation added is tiny relative to the original input;
  • can make the model err.

1.2 Against Attacks

Find ways to create more adversarial samples.

1.3 Counter defense

Find ways to make the model correctly identify more adversarial examples.

1.4 Adversarial training

The so-called adversarial training is a kind of adversarial defense . It constructs some adversarial samples and adds them to the original data set, hoping to enhance the robustness of the model to adversarial samples; at the same time, in NLP, it can usually improve the performance of the model. Therefore, adversarial training in NLP is more of a regularization method to improve the generalization ability of the model.

The assumption of adversarial training is: After adding disturbance to the input, the output distribution is the same as the original Y distribution.

2. Min-Max

2.1 Core formula

In general, confrontation training can be uniformly written in the following format:

insert image description here

Among them, DDD stands for training set,xxx stands for input,yyy represents the label,θ θθ is the model parameter,L ( x , y ; θ ) L(x,y;θ)L(x,y;θ ) is the loss of a single sample,Δ x ΔxΔ x is against disturbance,Ω ΩΩ is the perturbation space. This unified format was first proposed by the paper"Towards Deep Learning Models Resistant to Adversarial Attacks".

The formula is divided into two parts, one is the maximization of the internal loss function , and the other is the minimization of the external empirical risk .

  • The internal max is to find the worst-case disturbance, that is, the attack.
  • The external min is to find the most robust model parameters, that is, the defense, based on the attack method, where D is the distribution of input samples.

2.2 Step by step understanding

This formula can be understood step by step as follows:

1. Enter xxInject disturbanceΔ x Δx into xΔx Δ x Δx The goal of Δ x is to letL ( x + Δ x , y ; θ ) L(x+Δx,y;θ)L(x+Δx , _y;θ ) The bigger the better,that is to say, make the prediction of the existing model wrong as much as possible;

2. Of course Δ x ΔxΔx is also not unconstrained, it cannot be too large, otherwise the effect of "looks almost the same" cannot be achieved, soΔx ΔxΔ x must satisfy certain constraints, the conventional constraints are∥ Δ x ∥ ≤ ϵ ∥ Δx ∥ ≤ ϵ∥Δxϵ,whereinϵϵϵ is a constant;

3. Each sample constructs an adversarial sample x + Δ x x + Δxx+After Δ x , use ( x + Δ x , y ) (x+Δx,y)(x+Δx , _y ) is used as a data pair to minimize the lossto update the parameterθ θθ (gradient descent);

4. Repeat steps 1, 2, and 3 alternately.

2.3 The difference with GAN

From this point of view, the entire optimization process is performed alternately by max and min, which is indeed very similar to GAN. The difference is that the independent variable of GAN's max is also a parameter of the model, and here the independent variable of max is the input (disturbance) , (this should be the most essential difference from confrontation training and GAN.), that is to say, one step max should be customized for each input.

3. The idea of ​​confrontation training in NLP

3.1 Core Summary

The core is summarized in one sentence: use one sentence to describe the idea of ​​​​adversarial training in NLP, which is to perform gradient ascent on the input (increase loss), and perform gradient descent on the parameters (reduce loss) . Since the input will perform embedding lookup, the actual approach is to perform gradient ascent on the embedding table. (Since the negative gradient is the direction in which Loss drops the fastest, the positive gradient is the direction in which Loss rises the fastest.)

image-20230610002409887

3.2 Adversarial Perturbations in CV Tasks

First, the input of the CV task is continuous RGB values, so perturbation can be added directly on the original image. For CV tasks, the shape of the general input tensor is (b, h, w, c). At this time, we need to fix the batch size of the model (ie b), and then add a shape to the original input that is also (b, h, w, c), all zero initialization Variable, for example, it is called Δ x ΔxΔ x , then we can directlyfind the gradient of loss to x, and then giveΔ x ΔxΔx assignment to realize the interference on the input, and then perform the conventional gradient descent after the interference is completed .

3.3 Problems caused by original text perturbation and Embedding perturbation

In NLP problems, the input is a discrete word sequence, which is generally presented in the form of a one-hot vector. If the disturbance is directly performed on the raw text, the size and direction of the disturbance may be meaningless.

Goodfellow proposed in the ICLR of 17 years that he can perturb the continuous embedding. However, compared to the method of directly perturbing the original input in the image field, adding perturbation to the embedding will cause a problem: the constructed "adversarial example" cannot correspond to a certain word , that is, the perturbed Embedding The vector may not necessarily match the original Embedding vector table, so that the disturbance to the Embedding layer cannot correspond to the real text input. This is not a real adversarial sample, because the adversarial sample should be able to correspond to a reasonable raw input. Therefore, conversely, during inference, the opponent has no way to obtain such adversarial samples by modifying the original input.

3.4 Embedding perturbation is still valid - a regularization method

However, the experimental results show that in many tasks, anti-perturbation in the Embedding layer can effectively improve the performance of the model, so it is still very meaningful. In CV tasks, according to empirical conclusions, adversarial training often makes the performance of the model worse on non-adversarial samples. However, the magic is that in NLP tasks, the generalization ability of the model becomes stronger. Therefore, in NLP tasks, the role of adversarial training is no longer to defend against gradient-based malicious attacks, but more as a regularization to improve the generalization ability of the model .

3.5 Disturb the Embedding parameter matrix

For NLP tasks, in principle, the same operation should be performed on the output of the Embedding layer. The output shape of the Embedding layer is (b,n,d), so it is also necessary to add a shape to the output of the Embedding layer as (b, n, d) Variable, and then carry out the above steps. But in this way, we need to disassemble and reconstruct the model, which is not user-friendly enough.

However, we can settle for the next best thing. The output of the Embedding layer is directly taken from the Embedding parameter matrix , so we can directly perturb the Embedding parameter matrix, that is, perform gradient rise on the Embedding Table . The diversity of the adversarial samples obtained in this way will be less ( because the same token of different samples shares the same perturbation ), but it can still play a role in regularization, and it is much easier to implement.

The author's personal understanding: There are two points here,

  1. First of all, the original input text must not be perturbed, so we perturb the word embedding layer. The word embedding layer is actually equivalent to the initial input of the deep network later, so it can be regarded as perturbing the input. Therefore, we directly perturb the overall word embedding matrix, because the input of the embedding layer is the process of lookup table.
  2. In addition, it is no longer necessary to customize a Max process for each input, because the same token of different samples shares the same perturbation.

4. Introduction of main methods

4.1 FGSM(Fast Gradient Sign Method),ICLR 2015

4.1.1 Principle

Suppose the gradient for the input is:

g = ∇ x L ( θ , x , y ) g = \nabla_xL(θ,x,y)g=xL ( θ ,x,y)

The disturbance must go along the direction of the gradient to the maximum value of the loss function:

r a d v = ϵ ⋅ s i g n ( g ) r_{adv} = \epsilon·sign(g) radv=ϵsign(g)

sign(x) is a sign function, that is, when x is greater than 0, it is 1, when it is less than 0, it is -1, and when it is equal to 0, it is 0.

The main difference between here and FGM is that FGSM takes the same step in each direction .

4.1.2 Pytorch implementation

import torch
import torch.nn as nn
import torch.nn.functional as F
 
 
# FGSM
class FGSM:
    def __init__(self, model: nn.Module, eps=0.1):
        #等号右边应该是一个括号,并把括号里面的这个唯一的值赋给model
        #注意:等号右边不是只包含一个元素的元组,如果是只包含一个元素的
        #元组,应该这样写:(123,),这时赋值时,左边的变量也为一个只包含
        #一个元素的元组
        self.model = (
            model.module if hasattr(model, "module") else model
        )
        self.eps = eps
        self.backup = {
    
    }
 
    # only attack word embedding
    def attack(self, emb_name='embedding'):
        for name, param in self.model.named_parameters():
 
            if param.requires_grad and emb_name in name:
                self.backup[name] = param.data.clone()
                r_at = self.eps * param.grad.sign()
                param.data.add_(r_at)
 
    def restore(self, emb_name='embedding'):
        for name, para in self.model.named_parameters():
            if para.requires_grad and emb_name in name:
                assert name in self.backup
                para.data = self.backup[name]
 
        self.backup = {
    
    }

4.2 FGM(Fast Gradient Method),ICLR 2017

4.2.1 Principle

Changed the way perturbations are calculated:

radv = ϵ ⋅ ( g / ∣ ∣ g ∣ ∣ 2 ) r_{adv} = \epsilon·(g/||g||_2)radv=ϵ(g/∣∣g2)

4.2.2 Precautions

  • Note that in the training process, epsilon should not be selected too large or too small. If it is set too large, it will make it difficult for the model to converge. If it is set too small, such as setting it to 0, it is equivalent to training a single sample twice.

  • The difference between FGSM and FGM is that the normalization methods used are different . FGSM adopts Max normalization to the gradient through the Sign function, while FGM adopts L2 normalization. In theory, L2 normalization more strictly retains the direction of the gradient. , because Max normalization is not necessarily in the same direction as the original gradient.

  • Both methods make an assumption that the loss function is linear or at least locally linear. If it is not (locally) linear, then the direction of gradient boosting is not necessarily the optimal direction.

4.2.3 Pytorch implementation

It should be noted that the norm here calculates the gradient norm of the matrix composed of words that have appeared in the input sequence of each sample. In order to implement plug-in calls, the author abstracts a batch into a sample, and a batch uniformly uses a norm. Since the norm is only a scale, the impact is not great. The author's implementation is as follows:

import torch
class FGM():
    def __init__(self, model):
        self.model = model
        self.backup = {
    
    }

    def attack(self, epsilon=1., emb_name='emb.'):
        # emb_name这个参数要换成你模型中embedding的参数名
        for name, param in self.model.named_parameters():
            if param.requires_grad and emb_name in name:
                # tensor.clone()会创建一个与被clone的tensor完全一样的tensor,两者不共享内存但是新tensor仍保存在计算图中,即新的tensor仍会被autograd追踪
                # 这里是在备份
                self.backup[name] = param.data.clone()
                # 归一化
                norm = torch.norm(param.grad)
                if norm != 0 and not torch.isnan(norm):
                    r_at = epsilon * param.grad / norm
                    param.data.add_(r_at)

    def restore(self, emb_name='emb.'):
        # emb_name这个参数要换成你模型中embedding的参数名
        for name, param in self.model.named_parameters():
            if param.requires_grad and emb_name in name: 
                assert name in self.backup
                param.data = self.backup[name]
        self.backup = {
    
    }

To use adversarial training, you only need to add a few lines of code:

# 初始化
fgm = FGM(model)
for batch_input, batch_label in data:
    # 正常训练
    loss = model(batch_input, batch_label)
    loss.backward() # 反向传播,得到正常的grad
    # 对抗训练
    # 在embedding上添加对抗扰动
    fgm.attack() 
    loss_adv = model(batch_input, batch_label)
    # 反向传播,并在正常的grad基础上,累加对抗训练的梯度,如果不想累加就加个梯度清零
    loss_adv.backward() 
    # 恢复embedding参数
    fgm.restore() 
    # 梯度下降,更新参数
    optimizer.step()
    model.zero_grad()

4.3 PGD(Projected Gradient Descent),ICLR 2018

4.3.1 Principle

The process of internal max is essentially a non-concave constrained optimization problem. The idea of ​​FGM solution is actually gradient ascent. So is it possible that FGM’s simple and crude "one step" cannot reach the optimal point within the constraints? Of course it is possible.

Thus, a very intuitive improvement was born: Madry proposed the method of using Projected Gradient Descent (PGD) in ICLR in 2018. Simply put, it is "take small steps, take a few more steps", if you get out of the disturbance Radius is ϵ ϵThe space of ϵ is mapped back to the "sphere" to ensure that the disturbance is not too large. PGD ​​is aniterative attack. Compared with ordinary FGSM and FGM, which only do one iteration, PGD does multiple iterations, taking a small step each time, and each iteration will project the disturbance into the specified range.

The specific formula is as follows, where t refers to the number of iterations. The input at time t+1 is obtained from the input at time t and the gradient at time t . ∏ x + S \prod _{x+S}x+Smeans that if the disturbance exceeds a certain range, it must be mapped back to the specified range S, where x is the original normal sample input value .

image-20230608173754536

4.3.2 Pytorch implementation

PGD ​​has been improved based on FGM. Compared with FGM, ϵ \epsilon is added every time.ϵ size controlled interference, PGD interference is generated in a more refined way.

Specifically, for each training sample, the first step of PGD is to back up the original gradient information . Then, PGD performs K times to add interference operations to the embedding layer, that is, the attack operation, and back up the weight parameters of the embedding layer at the first attck .

After adding interference, PGD has an additional gradient operation compared to FGM. The gradient needs to be cleared in the first K-1 times, and the original gradient information backed up at the beginning needs to be restored in the last Kth time. This operation is because the purpose of the first K-1 backpropagation is to provide a new gradient for the calculation of the next disturbance, and the function of the last backpropagation is to accumulate the gradient of the confrontation training on the basis of the original gradient, and the final gradient of the model is the most The starting gradient plus the gradient resulting from the last perturbation. After completing K attack operations and accumulating the gradient generated by the last attack to the original gradient, the model also needs to restore the original embedding layer parameters.

The pseudocode is as follows:

对于每个样本x:
  (1).计算x的前向loss、反向传播得到梯度并备份
  对于每步t:
    (2).根据embedding矩阵的梯度计算出扰动r,并加到当前embedding上,相当于x+r(超出范围则投影回epsilon内)
    (3).if t不是最后一步: 将梯度归0,根据(1)的x+r计算前后向并得到梯度
    (4).if t是最后一步: 恢复(1)的梯度,计算最后的x+r并将梯度累加到(1)上
  (5).将embedding恢复为(1)时的值
  (6).根据(4)的梯度对参数进行更新
import torch
class PGD():
    def __init__(self, model):
        self.model = model
        self.emb_backup = {
    
    }
        self.grad_backup = {
    
    }

    def attack(self, epsilon=1., alpha=0.3, emb_name='emb.', is_first_attack=False):
        # emb_name这个参数要换成你模型中embedding的参数名
        for name, param in self.model.named_parameters():
            if param.requires_grad and emb_name in name:
            # 备份embedding    
                if is_first_attack:
                    self.emb_backup[name] = param.data.clone()
                norm = torch.norm(param.grad)
                if norm != 0 and not torch.isnan(norm):
                    r_at = alpha * param.grad / norm
                    # 每次扰动都进行叠加了
                    param.data.add_(r_at)
                    param.data = self.project(name, param.data, epsilon)

    def restore(self, emb_name='emb.'):
        # emb_name这个参数要换成你模型中embedding的参数名
        for name, param in self.model.named_parameters():
            if param.requires_grad and emb_name in name: 
                assert name in self.emb_backup
                param.data = self.emb_backup[name]
        self.emb_backup = {
    
    }

    def project(self, param_name, param_data, epsilon):
        r = param_data - self.emb_backup[param_name]
        if torch.norm(r) > epsilon:
            # 映射回球面,向量单位化,然后乘半径epsilon
            r = epsilon * r / torch.norm(r)
        return self.emb_backup[param_name] + r

    def backup_grad(self):
        for name, param in self.model.named_parameters():
            if param.requires_grad:
                self.grad_backup[name] = param.grad.clone()

    def restore_grad(self):
        for name, param in self.model.named_parameters():
            if param.requires_grad:
                param.grad = self.grad_backup[name]

During training, it is called by the following code:

pgd = PGD(model)
K = 3
for batch_input, batch_label in data:
    # 正常训练
    loss = model(batch_input, batch_label)
    # 反向传播,得到正常的grad
    loss.backward() 
    # 备份梯度
    pgd.backup_grad()
    # 对抗训练
    for t in range(K):
        # 在embedding上添加扰动, first attack时备份param.data
        # t+1时刻输入根据t时刻的输入及t时刻的梯度求出,这里多步迭代体现在这里
        # t时刻的输入,即t时刻的x+r,并没有被清空(没恢复,pgd.restore()),所以扰动是不断叠加的。
        pgd.attack(is_first_attack=(t==0)) 
        if t != K-1:
            model.zero_grad()
        else:
            pgd.restore_grad()
        # 计算损失,反向传播,如果是最后一步,在正常的grad基础上,累加对抗训练的梯度
        loss_adv = model(batch_input, batch_label)
        # 反向传播,这里有t时刻的输入,在其清空之前,拿去求扰动
        loss_adv.backward() 
    # 恢复embedding参数
    pgd.restore() 
    # 更新参数
    optimizer.step()
    model.zero_grad()

4.3.3 Analysis

It can be seen that the disturbance r is gradually accumulated in the loop . It should be noted that the last update parameter only uses the gradient calculated by x + the last disturbance r .

Advantages of PGD:

  • Since only a small step is taken each time, the local linear assumption is basically established. After many steps, the optimal solution can be reached , that is, the strongest attack effect can be achieved.
  • The paper also proves that the attack samples obtained by the PGD algorithm are the strongest among the first-order adversarial samples . The first-order adversarial examples mentioned here refer to the adversarial examples based on the first-order gradient.
  • If the model is robust to samples generated by PGD, it is basically robust to all first-order adversarial samples .
  • Experiments also prove that the model using PGD algorithm for confrontation training is indeed very robust.

Disadvantages of PGD:

  • Although PGD is simple and effective , there is a problem that the calculation efficiency is not high . There will only be m gradient calculations for m iterations without using the confrontational training method, but for PGD, every gradient descent (obtaining the gradient of the model parameters and training the model) must correspond to K steps of gradient improvement ( Get the gradient of the input, looking for perturbations) . Therefore, compared to the method without confrontation training, PGD needs to do m(K+1) gradient calculations.

    My personal understanding here is that every time the model parameters are updated, it is necessary to first gradient drop the backup parameters, provide the initial gradient, and then perform K gradient enhancements. These K gradient enhancements are actually K gradients calculated by loss.backward() and then performed Calculation of disturbance, so a total of K+1 gradient calculations.

4.4 FreeAT(Free Adversarial Training),NIPS 2019

4.4.1 Motivation for improvement

The ordinary PGD method, when calculating a batch of an epoch:

  • The inner loop undergoes K times of forward and backward propagation to obtain K gradients about the input and update the disturbance;
  • The outer loop obtains the gradient update network about the parameters through one forward and backward propagation.

Such calculation costs are very high. In fact, when we calculate the gradient for one of the inputs or parameters, we can get the gradient of the other at almost no cost. This is the idea of ​​Free Adversarial Training, which uses more information to accelerate the training of adversarial learning in one calculation .

In the calculation process of PGD, every time the forward and backward calculation is performed , whether it is the gradient of the parameter or the gradient of the input , it will be calculated , but only the gradient of the parameter is used in the process of gradient descent . In the process of gradient promotion Only the gradient of the input is used, which is actually a lot of waste. Can we use the gradient of the calculated parameters and the gradient of the input at the same time in a forward-backward calculation process? This is the core idea of ​​FreeAT.

4.4.2 Algorithms

How to do it?

  • FreeAT still uses the PGD training method, but for each min-batch sample, m gradients are obtained, and each time the gradient is obtained, we use it to update both the disturbance and the parameters . In the original PGD training method, each inner layer calculation only uses the gradient to update the disturbance, and after the m steps are completed, the gradient is recalculated and the parameters are updated.
  • FreeAT performs m times of continuous training for each sample. In order to ensure that the total number of gradient calculations is the same as that of ordinary training, the original epoch is divided by m .
  • The gradient from the previous step is reused when computing the perturbation.
  • In the end, only N ep N_{ep} is neededNepsubgradient calculation

The algorithm flow is shown in the figure below:

image-20230608225252219

4.5 YOPO(You can Only Propagate Once),NIPS 2019

4.5.1 Motivation for improvement

The starting point of YOPO5 is to use the structure of the neural network to reduce the calculation amount of gradient calculation. Starting from the maximum principle PMP (Pontryagin's maximum principle), the anti-perturbation is only related to the 0th layer of the network, that is, only the embedding layer is added to the disturbance. In addition, the layers are decoupled, so there is no need to calculate the complete forward and backward propagation every time .

The maximum principle PMP (Pontryagin's maximum principle) is a kind of optimizer, which regards the neural network as a dynamical system. The advantage of this method is that when optimizing network parameters, the layers are decoupled. Through this idea, we can think that since the disturbance is added to the embedding layer, why do we need to calculate the complete forward and backward propagation every time?

4.5.2 Algorithms

Based on this idea, the author wants to reuse the gradients of the next few layers to reduce unnecessary complete propagation. The r attacks of PGD can be divided into m×n attacks.

First, in m rounds, only one forward-backward propagation is performed in each round. In each round of propagation, a complete forward propagation is performed, and in the next backpropagation, it stops at the first layer. Use p to record the result of backpropagation, and fix p, thinking that it will not follow the disturbance Change and change; then perform n attacks on layer 0, so that YOPO only completes m complete forward and backward propagation but realizes m×n perturbed updates.

In this way, as long as the calculation is for the network layer 0 f 0 f_0f0The gradient reduces the number of forward and reverse propagation layers, thereby speeding up.

image-20230609103558540

As shown in the figure above, the olive green and yellow blocks represent the xt x_t of the middle layer of the networkxt, the orange block represents the gradient of each layer for loss, that is, pt p_tpt

The left side is the traditional PGD-r algorithm, it can be seen that every time the η \eta of the 0th layer needs to be updatedη , it needs to complete a complete forward propagation and back propagation, and the PGD-r algorithm every iterationθ \thetaθ needs to be updated r times.

On the right is the YOPO-mn algorithm, which also updates the parameter η \eta of the 0th layerη , first also need to carry outxt x_txtForward propagation of , then pt p_tptBackpropagation, the difference is in getting p 1 p_1p1After that, make a copy of it, using p 1 p_1p1and function f 0 f_0f0to calculate η \etaη gradient, and perform n times of gradient descent, and then proceed to the next forward and backward propagation. In this way, every time YOPO-mn performs a complete forward and backward propagation, it can completeη \etan updates of η .

4.6 FreeLB(Free Large Batch Adversarial Training),ICLR 2020

4.6.1 Principle and Algorithm

FreeLB has been improved for multiple iterations of PGD training:

  • PGD ​​is to take the gradient update parameter of the last disturbance after iterating K times of disturbance, and FreeLB is to take the average gradient in K iterations . Specifically, FreeLB does not do model.zero_grad() in each round of calculation, which is equivalent to accumulating loss.backward() in each round on param.grad, and finally taking the average.
  • PGD ​​is more precise, more cautious, and more in line with the consistent style of gradient ascent; FreeLB is more extensive and faster.
  • This article also proposes that adversarial training and dropout cannot be used at the same time.
  • Finally, N ep ∗ K N_{ep}*K needs to be performedNepK gradient calculation

insert image description here

4.6.2 Differences between FreeLB, FreeAT and PGD

  • Like FreeAT, FreeLB also wants to make more efficient use of the two gradients . But unlike FreeAT, FreeLB does not update the parameters every time the gradient is improved, but accumulates the gradient of the parameters. After going through K steps in this way, FreeLB uses the parameter gradient accumulated after K steps to update the parameter θ.
  • According to the algorithm source code, PGD needs to perform N ep ∗ ( K + 1 ) N_{ep}*(K+1)Nep(K+1 ) Sub-gradient calculation, FreeAT needs to performN ep N_{ep}NepTimes, FreeLB needs N ep ∗ K N_{ep}*KNepK times. Although FreeLB does not have much advantage in efficiency, it works well.

The extra PGD here should refer to the gradient obtained from the initial gradient backup, and the Free AT is because N is divided by K.

  • Since FreeLB uses the gradient accumulated by multi-step K to update, the estimation of the gradient is more accurate , and there is no such problem as FreeAT that continuously uses multiple identical min-batches for gradient update .
  • Compared with YOPO-mn, FreeLB also updates the parameters after synthesizing the gradient in K steps (here refers to m). The difference is that it does not have further n layers. Even if there are, there are n exactly the same values.
  • Why does the paper call this algorithm Large Batch? In the gradient descent, the gradient we use is calculated based on [X+r1, X+r2,...,X+rk], which can be understood as approximately averaging the samples of K different batches , so it is equivalent to a virtual increases the number of samples.

4.7 SMART(SMoothness-inducing Adversarial Regularization),ACL 2020

4.7.1 Core idea

All the operations we have seen before are basically based on the Min-Max objective function, but in SMART, the Min-Max formula is abandoned, and the confrontation learning is completed through the regular term Smoothness-inducing Adversarial Regularization. In order to solve this new objective function, the author proposed the optimization algorithm Bregman Proximal Point Optimization, which is the two main contents of SMART.

Two approaches are proposed in the SMART paper:

1. Fight against regular SMoothness-inducing Adversarial Regularization to improve model robustness .

2. Optimize the algorithm Bregman proximal point optimization to avoid catastrophic forgetting.

The first refers to semi-supervised confrontation training. The goal of confrontation is to maximize the output before and after the disturbance . In the classification task, the loss uses a symmetrical KL divergence, and in the regression task, the square loss is used.

SMART's algorithm is similar to PGD. It also iterates K steps to find the optimal r, and then updates the gradient.

4.8 VAT Virtual Adversarial Training, ACL 2020

4.8.1 Research motivation

In order to solve the problem of insufficient generalization and robustness of the current pre-training model (BERT and ROBERTa used in this paper) , and the current adversarial training can enhance the robustness, but will damage the generalization problem, a general language is proposed Model confrontational training algorithm: ALUM.

This model is a semi-supervised learning model. Compared with other adversarial learning, such as FGSM, FGM, PGD, etc., ALUM adds unlabeled data to optimize model parameters. Therefore, during the training process, the DL divergence of the logits of the calculated samples and counter samples is obtained to obtain the corresponding loss.

4.8.2 Algorithms

insert image description here

  1. cycle epoch
  2. Loop the data set and generate a batch_size size of data each time
  3. Generate a disturbance δ , δ obeys the Gaussian distribution as 0 and variance 1
  4. Cycle K times, the larger the theoretical K, the better the effect, and the actual use of K=1 reduces the amount of calculation
  5. Calculate the DL divergence Loss of the output of the actual input and the actual input of the adversarial sample, and calculate the gradient
  6. perturbation regularization
  7. End of loop K times
  8. Calculate the Loss of the model (labeled data loss + virtual confrontation Loss) to calculate the gradient update parameters, α is the ratio of enhanced confrontation learning, the pre-training is set to 10, and the downstream task is set to 1.

Finally, what we need is to minimize Loss and maximize Adv Loss . Finally, our goal is:

image-20230610000954491
image-20230609231924814

Guess you like

Origin blog.csdn.net/m0_47779101/article/details/131136915
Recommended