nn.Dropout, DropPath understanding and pytorch code


I saw DropPath in the vit code and wanted to know the difference between DropPath and nn.Dropout(), so I checked the relevant information and recorded it.

theory

dropout

​Dropout is the earliest method used to solve overfitting and is the predecessor of all drop methods. Dropout was proposed by Hinton in 2012, and was used in AlexNet in the work "ImageNet Classification with Deep Convolutional Neural Network" .

Principle : During forward propagation, let a certain neuron activate to stop working with probability 1-keep_prob (0<p<1).

Function : This can make the model more generalizable because it will not rely too much on certain local nodes. The training phase is retained with the probability of keep_prob and closed with the probability of 1-keep_prob; all neurons in the test phase are not closed, but for the neurons where dropout is applied in the training phase, the output value needs to be multiplied by keep_prob.

Note : dropout is now generally used for fully connected layers . Convolutional layers generally do not use Dropout, but use BN to prevent overfitting, and the convolution kernel also has nonlinear functions such as relu, which reduces the direct correlation of features.

DropPath

​DropPath was proposed together with FractalNet in the paper "FractalNet: Ultra-Deep Neural Networks without Residuals (ICLR2017)" .

Principle : As its name suggests, DropPath randomly deletes multi-branch structures in deep learning networks.

Function : Generally, it can be added to the network as a regularization method, but it will increase the difficulty of network training. Especially in NAS problems, if drop_prob is set too high, the model may not even converge.

code

import torch
import torch.nn as nn


def drop_path(x, drop_prob: float = 0., training: bool = False):
    if drop_prob == 0. or not training:  # drop_prob废弃率=0,或者不是训练的时候,就保持原来不变
        return x
    keep_prob = 1 - drop_prob  # 保持率
    shape = (x.shape[0],) + (1,) * (x.ndim - 1)  # (b, 1, 1, 1) 元组  ndim 表示几维,图像为4维
    random_tensor = keep_prob + torch.rand(shape, dtype=x.dtype, device=x.device)  # 0-1之间的均匀分布[2,1,1,1]
    random_tensor.floor_()  # 下取整从而确定保存哪些样本 总共有batch个数
    output = x.div(keep_prob) * random_tensor  # 除以 keep_prob 是为了让训练和测试时的期望保持一致
    # 如果keep,则特征值除以 keep_prob;如果drop,则特征值为0
    return output  # 与x的shape保持不变


class DropPath(nn.Module):
    def __init__(self, drop_prob=None):
        super(DropPath, self).__init__()
        self.drop_prob = drop_prob

    def forward(self, x):
        return drop_path(x, self.drop_prob, self.training)

if __name__ == '__main__':
    input = torch.randn(3, 2, 2, 3)
    drop1 = DropPath(drop_prob=0.4)  # 实例化
    drop2 = nn.Dropout(p=0.4)
    out1 = drop1(input)
    out2 = drop2(input)
    print(input)
    print(out1)
    print(out2)
    print(out1.shape, out2.shape)     

The results are as follows:

tensor([[[[-0.4603, -0.2193,  0.7828],   # 这是input
          [-0.4790, -0.3336,  1.3353]],

         [[-0.4309, -0.6019, -0.4993],
          [ 0.2313,  0.7210, -0.2553]]],


        [[[ 0.0653, -0.4787,  0.6238],
          [ 1.4323,  1.0883, -0.6952]],

         [[ 0.0912,  0.8802, -0.6991],
          [ 0.7248, -0.9305,  0.2832]]],


        [[[ 0.0923,  0.4770,  0.5671],
          [ 1.2669,  0.4013,  0.3464]],

         [[ 0.8646, -0.3866, -0.8333],
          [-1.1507,  1.4823,  0.1255]]]])
tensor([[[[-0.7672, -0.3655,  1.3047],  # 这是DropPath
          [-0.7984, -0.5560,  2.2255]],

         [[-0.7181, -1.0032, -0.8322],
          [ 0.3855,  1.2016, -0.4255]]],


        [[[ 0.0000, -0.0000,  0.0000],
          [ 0.0000,  0.0000, -0.0000]],

         [[ 0.0000,  0.0000, -0.0000],
          [ 0.0000, -0.0000,  0.0000]]],


        [[[ 0.1539,  0.7949,  0.9452],
          [ 2.1115,  0.6688,  0.5773]],

         [[ 1.4411, -0.6444, -1.3888],
          [-1.9179,  2.4706,  0.2092]]]])
tensor([[[[-0.7672, -0.0000,  0.0000],  # 这是nn.Dropout
          [-0.7984, -0.5560,  2.2255]],

         [[-0.0000, -1.0032, -0.8322],
          [ 0.0000,  1.2016, -0.4255]]],


        [[[ 0.0000, -0.7979,  1.0397],
          [ 2.3872,  0.0000, -0.0000]],

         [[ 0.0000,  0.0000, -1.1652],
          [ 0.0000, -1.5509,  0.4720]]],


        [[[ 0.1539,  0.0000,  0.9452],
          [ 2.1115,  0.0000,  0.5773]],

         [[ 1.4411, -0.6444, -1.3888],
          [-1.9179,  0.0000,  0.0000]]]])
torch.Size([3, 2, 2, 3]) torch.Size([3, 2, 2, 3])

We see that inputthe first number is -0.4603, keep_prob is 1-drop_prob=0.6, -0.4603/0.6=-0.7672, and the outfirst number of the two is -0.7672, indicating that both methods have retained values. Divide by keep_prob.

In addition, you can see that in the output of DropPath, all neurons in a batch are randomly set to 0. In dropout, neurons with probability p are randomly selected in each batch and set to 0. The following shows the results of a batch in the two outputs to see the difference:

img

​Supplement the call of DropPath in vit:

self.drop_path = DropPath(drop_path_ratio) if drop_path_ratio > 0. else nn.Identity()

x = x + self.drop_path(self.attn(self.norm1(x)))
x = x + self.drop_path(self.mlp(self.norm2(x)))

Question: Why should we divide by keep_prob in dropout?

In order to ensure that the expected value of the output activation value of a neuron using dropout is consistent with that when not using it, let’s take a closer look at it based on the knowledge of probability theory: Assume that the output activation value of a neuron is aaa . Without drop_path, the expected output value isaaa . If drop_path is used, the neuron may have two states: reserved and closed. Think of it as a discrete random variable, which conforms to the 0-1 distribution in probability theory, and the expectation of its output activation value becomes p ∗a + ( 1 − p ) ∗ 0 = pap*a+(1-p)*0=papa+(1p)0=p a , if you want to keep the expectation consistent with not using drop_path, you need to divide byppp

References:

https://www.cnblogs.com/dan-baishucaizi/p/14703263.html#bottom
https://www.cnblogs.com/pprp/p/14815168.html
https://blog.csdn.net/weixin_54338498/article/details/125670154
https://blog.csdn.net/wuli_xin/article/details/127266407

Guess you like

Origin blog.csdn.net/qq_45670134/article/details/128683950