[Vernacular Machine Learning Series] Vernacular Dropout

Vernacular Dropout

insert image description here

what is dropout

Dropout is a regularization technique for neural networks, which is trained with a specified probability ppp (common values ​​arep = 0.5 p=0.5p=0.5 ) discards a unit (along with the connection). At test time, all units are present, but weights are inppp scaling (i.e. becomespw pwpw)。

The idea is to prevent co-adaptation, where neural networks become too dependent on specific connections, as this can be a symptom of overfitting. Intuitively, dropout can be thought of as creating an ensemble of hidden neural networks.

By this definition, PyTorch nn.Dropout"uses samples from a Bernoulli distribution with probability ppp randomly zeroes some elements of the input tensor. Each channel will be independently zeroed each time a call is forwarded. "

Dropout can be thought of as a given probability ppp randomizes some elements in the input tensor to zero. When this happens, a portion of the output is lost. Taking this into account, the output is also scaled by1 1 − p \frac{1}{1-p}1p1

Scaling makes the input mean and output mean approximately equal.

understand zoom

Many people may be confused about how and why the dropout layer scales the input. Here is a detailed explanation.

The official PyTorch documentation states:

Also, during training, the output is 1 1 − p \frac{1}{1-p}1p1scaling. This means that when evaluated, the module just computes an identity function.

So how is this done? Why do you do that? Let's look at some code in Pytorch.

Create a drop rate p = 0.4 p=0.4p=0.4 's Dropout layerm:

import torch
import numpy as np
p = 0.4
m = torch.nn.Dropout(p)

The PyTorch documentation explains:

During training, samples from a Bernoulli distribution are used with probability ppp randomly zeroes some elements of the input tensor. Elements that are zeroed are randomized on each forward call.

Put a random input to the dropout layer and confirm about 40% ( p = 0.4 p=0.4p=0.4 ) elements have become 0:

nbig = 5000000
inp = torch.rand(nbig, 10)
outp = m(inp)
print(f'输入中0元素的比例为: {
      
      (outp==0).numpy().mean():.5f}, p={
      
      p}')

Output after running the above code:

$ 输入中0元素的比例为: 0.40007, p=0.4

Let's move on to the scaling part.

Create a small random input and put it into the dropout layer. Compare input and output:

np.random.seed(42)
inp = torch.rand(5, 4)
inp

The above code creates a random tensor with 5 rows and 4 columns, and the output is as follows:

$ tensor([[0.6485, 0.3114, 0.1626, 0.1022],
          [0.7352, 0.4634, 0.8206, 0.4228],
          [0.0322, 0.9399, 0.9163, 0.4169],
          [0.2574, 0.0467, 0.2213, 0.6171],
          [0.4146, 0.2288, 0.0388, 0.7752]])

We can see below that by comparing the non-zero elements in the two tensors below, the output during training is by 1 1 − p \frac{1}{1-p}1p1Zoom times:

outp = m(inp)
inp/(1-p)
$ tensor([[1.0808, 0.5191, 0.2710, 0.1703],
          [1.2254, 0.7723, 1.3676, 0.7046],
          [0.0537, 1.5665, 1.5272, 0.6948],
          [0.4290, 0.0778, 0.3689, 1.0284],
          [0.6909, 0.3813, 0.0646, 1.2920]])

outputoutput

$ tensor([[1.0808, 0.5191, 0.2710, 0.0000],
          [0.0000, 0.7723, 0.0000, 0.0000],
          [0.0000, 1.5665, 1.5272, 0.6948],
          [0.4290, 0.0778, 0.3689, 1.0284],
          [0.6909, 0.0000, 0.0646, 0.0000]])

We can assert this observation in code:

idx_nonzero = outp!=0
assert np.allclose(outp[idx_nonzero].numpy(), (inp/(1-p))[idx_nonzero].numpy())

So why do it?

Basically, the dropout layer becomes an identity function and does not change its input during evaluation/testing/inference. Since Dropout is only active during training and not during inference, without scaling, the expected output is larger during inference because elements are no longer randomly dropped (set to 0). But we want the expected output to be the same whether it passes through the dropout layer or not. Therefore, during training, we amplify the output of the dropout layer by 1 1 − p \frac{1}{1−p}1p1scale factor to compensate. ppThe larger p means the more aggressive the Dropout, which means we need more compensation, that is, the scaling factor1 1 − p \frac{1}{1−p}1p1bigger.

The code below demonstrates how the scale factor restores the output to the same scale as the input.

inp = torch.rand(nbig, 10)
outp = m(inp)
print(f'dropout 层的平均输出 ({
      
      outp.mean():.4f}) 接近平均输入 ({
      
      inp.mean():.4f})')
$ dropout 层的平均输出 (0.5000) 接近平均输入 (0.5000)

for example

Let's use an example with 100 tensors to demonstrate how dropout and its scaling affect the input.

import torch
import torch.nn as nn

# 生成 100 个 1
x = torch.ones(100) 
$ tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

When the drop rate p = 0.1 p = 0.1p=0.1 , about 10 values ​​should be set to 0. The scaling factor is
1 1 − 0.1 = 1 0.9 = 1. 1 ˙ \frac{1}{1-0.1} = \frac{1}{0.9} = 1.\dot{1}10.11=0.91=1.1˙

# 输入 Dropout 层
output = nn.Dropout(p=0.1)(x)     
$ tensor([1.1111, 1.1111, 1.1111, 1.1111, 1.1111, 1.1111, 1.1111, 1.1111, 1.1111,
         1.1111, 1.1111, 1.1111, 1.1111, 0.0000, 1.1111, 1.1111, 1.1111, 1.1111,
         1.1111, 0.0000, 0.0000, 1.1111, 0.0000, 1.1111, 1.1111, 1.1111, 1.1111,
         1.1111, 1.1111, 1.1111, 1.1111, 1.1111, 1.1111, 1.1111, 1.1111, 1.1111,
         1.1111, 1.1111, 0.0000, 1.1111, 1.1111, 1.1111, 1.1111, 1.1111, 1.1111,
         1.1111, 1.1111, 1.1111, 1.1111, 1.1111, 1.1111, 1.1111, 1.1111, 1.1111,
         1.1111, 1.1111, 1.1111, 0.0000, 1.1111, 1.1111, 1.1111, 1.1111, 1.1111,
         1.1111, 1.1111, 1.1111, 1.1111, 1.1111, 1.1111, 1.1111, 1.1111, 1.1111,
         1.1111, 1.1111, 0.0000, 1.1111, 1.1111, 1.1111, 1.1111, 0.0000, 0.0000,
         1.1111, 1.1111, 1.1111, 1.1111, 1.1111, 1.1111, 1.1111, 1.1111, 1.1111,
         1.1111, 1.1111, 1.1111, 1.1111, 1.1111, 0.0000, 1.1111, 1.1111, 1.1111,
         1.1111])

The results were as we expected, with 10 values ​​being completely zeroed out, and the results were scaled to ensure that the input and output had the same mean—or as close as possible.

print(x.mean(), output.mean())
$ tensor(1.) tensor(1.0000)

In this example, the mean of the input and output is 1.0.

Guess you like

Origin blog.csdn.net/jarodyv/article/details/131287713