Commonly used weight initialization methods in deep learning

I recently read papers and saw many papers stating that the weight initialization method of their convolution is Kaiming Uniform. I was curious about what this is, and then I found out that this is a weight initialization method, and it is the default one of Pytorch. If there is an initialization method, then think, what is there to write in the paper, it looks amazing, then I will write it in the future.

Note: The following content is all from ChatGPT.

Random Initialization

Random initialization is a commonly used method of neural network weight initialization. It breaks the symmetry of the network by randomly assigning weight values ​​at the beginning of training, enabling the network to learn unique features and patterns in the data.

The most common random initialization method is to randomly sample weights from a Gaussian distribution with mean 0 and variance 1. This is often referred to as the "standard normal" distribution. weight wij w_{ij}wijin _i -th input neuron andjjThe random initialization formula between the j -th output neurons is:

w i j ∼ N ( 0 , 1 ) w_{ij} \sim \mathcal{N}(0, 1) wijN(0,1)

where N ( 0 , 1 ) \mathcal{N}(0, 1)N(0,1 ) means that the mean is 0 and the variance is 1高斯分布.

Random initialization can also use a uniform distribution, where the weights are sampled from a uniform distribution between two values, typically -1 and 1. weight wij w_{ij}wijThe random initialization formula for is:

w i j ∼ U ( − 1 , 1 ) w_{ij} \sim \mathcal{U}(-1, 1) wijU(1,1)

Among them, U ( − 1 , 1 ) \mathcal{U}(-1, 1)U(1,1 ) means between -1 and 1均匀分布.

It is important to note that the range of values ​​used for random initialization can have a significant impact on network performance. If the weights are initialized with too large or too small values, it may cause the training process to converge slowly or even cause the network to diverge. Therefore, it is very important to choose an appropriate initialization distribution range according to the specific network architecture and task.

Overall, random initialization is a simple and effective method for initializing the weights of neural networks, and is often used as a benchmark method for comparison with more advanced initialization techniques.

Xavier Initialization

Xavier Initialization (also known as Glorot Initialization) is a commonly used neural network weight initialization method, which aims to solve the problem of gradient disappearance or gradient explosion that may be caused by random initialization . Its basic idea is to determine the range of weight initialization according to the number of input neurons in the previous layer and the number of output neurons in the next layer.

The formula for Xavier Initialization is:

W i j ∼ N ( 0 , 2 n i n + n o u t ) W_{ij} \sim \mathcal{N}(0, \sqrt{\frac{2}{n_{in} + n_{out}}}) WijN(0,nin+nout2 )

Among them, W ij W_{ij}Wijis the link iiThe i input neuron and thejjthWeights of j output neurons,nin n_{in}ninis the number of input neurons in the previous layer, nout n_{out}noutis the number of output neurons in the next layer, N ( 0 , 2 nin + nout ) \mathcal{N}(0, \sqrt{\frac{2}{n_{in} + n_{out}}})N(0,nin+nout2 ) means the mean is 0 and the variance is2 nin + nout \sqrt{\frac{2}{n_{in} + n_{out}}}nin+nout2 Gaussian distribution.

Xavier Initialization 优点can effectively avoid the problem of gradient disappearance or gradient explosion, and can accelerate the convergence speed of the neural network . However, it 缺点is possible to cause the weights to be too small or too large in some cases, thus affecting the performance of the network . Therefore, some improved techniques, such as He Initialization and its variants, have been proposed to further improve the performance of random initialization.

Overall, Xavier Initialization is a simple and effective way to initialize neural network weights, esp 适用于激活函数为tanh或sigmoid的情况. By adjusting the range of initialization weights, the training speed and performance of the network can be effectively improved.

Xavier initialization originally appeared in the 2010 paper " Understanding the difficulty of training deep feedforward neural networks ".

He Initialization

He Initialization is an improved neural network weight initialization method, which is a variant of Xavier Initialization. Unlike Xavier Initialization, He Initialization is mainly used for 激活函数为ReLU(Rectified Linear Unit) neural network.

The formula for He Initialization is:

W i j ∼ N ( 0 , 2 n i n ) W_{ij} \sim \mathcal{N}(0, \sqrt{\frac{2}{n_{in}}}) WijN(0,nin2 )

Among them, W ij W_{ij}Wijis the link iiThe i input neuron and thejjthWeights of j output neurons,nin n_{in}ninis the number of input neurons in the previous layer, N ( 0 , 2 nin ) \mathcal{N}(0, \sqrt{\frac{2}{n_{in}}})N(0,nin2 ) means the mean is 0 and the variance is2 nin \sqrt{\frac{2}{n_{in}}}nin2 Gaussian distribution.

The advantage of He Initialization is that it can effectively avoid the problem of gradient disappearance or gradient explosion, and it can accelerate the convergence speed of the neural network . Compared with Xavier Initialization, He Initialization is more suitable for ReLU activation function because it can better adapt to the nonlinear nature of ReLU.

It should be noted that although He Initialization performs well in most cases, it may cause the weights to be too small or too large in some cases, thereby affecting the performance of the network. Therefore, some improved techniques, such as LeCun Initialization and its variants, have been proposed to further improve the performance of random initialization.

In general, He Initialization is a simple and effective neural network weight initialization method, especially for the case where the activation function is ReLU. By adjusting the range of initialization weights, the training speed and performance of the network can be effectively improved.

He initialization originally appeared in the 2015 paper " Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification ".

Orthogonal Initialization

Orthogonal Initialization is a neural network weight initialization method, 它的目标是使权重矩阵正交化. Orthogonalization refers to turning each column in the weight matrix into a unit vector, and each column vector is perpendicular to other column vectors. Orthogonal matrices have great mathematical properties, such as they preserve the length and angle of vectors, so they can effectively reduce redundancy and overfitting in neural networks.

The formula for Orthogonal Initialization is:

W i j ∼ U ( − 1 , 1 ) W_{ij} \sim \mathcal{U}(-1, 1) WijU(1,1)

Among them, W ij W_{ij}Wijis the link iiThe i input neuron and thejjthThe weights of j output neurons,U ( − 1 , 1 ) \mathcal{U}(-1, 1)U(1,1 ) means a uniform distribution between -1 and 1.

Next, the initial weight matrix is ​​orthogonalized using QR decomposition. Specifically, the initial weight matrix WWW decomposes intoW = QRW=QRW=QR , whereQQQ is an orthogonal matrix,RRR is an upper triangular matrix. ThenQQQ is used as the weight matrix after orthogonalization.

Orthogonal Initialization 优点can effectively reduce the redundancy and overfitting in the neural network, and improve the generalization ability and performance of the network . However, it 缺点has a high computational complexity and thus may not be suitable for large neural networks .

In general, Orthogonal Initialization is an effective neural network weight initialization method, especially for situations where redundancy and overfitting need to be reduced. By orthogonalizing the weight matrix, the generalization ability and performance of the neural network can be improved.

Orthogonal initialization originally appeared in the 2014 paper " Exact solutions to the nonlinear dynamics of learning in deep linear neural networks ".

Sparse Initialization

Sparse Initialization is a neural network weight initialization method 它的目标是将权重矩阵中的大部分元素设置为零to achieve sparsity. Sparsity can reduce redundancy and overfitting in neural networks and improve the generalization ability and performance of the network . Sparse Initialization is often used in convolutional neural networks .

The formula for Sparse Initialization is:

W ij ∼ N ( 0 , σ 2 ) ∗ B W_{ij} \sim\mathcal{N}(0,\sigma^2) *\textbf{B}WijN(0,p2)B

Among them, W ij W_{ij}Wijis the link iiThe i input neuron and thejjthThe weights of j output neurons,N ( 0 , σ 2 ) \mathcal{N}(0, \sigma^2)N(0,p2 )means the mean is 0 and the variance isσ 2 \sigma^2pGaussian distribution of 2 , B \textbf{B}B is the sizem × nm \times nm×binary matrix of n wheremmm is the number of input neurons,nnn is the number of output neurons. B \textbf{B}Each element in B is 0 or 1, where the number of 1 is ρ mn \rho mnρ mn ,ρ \rhoρ is a parameter that controls the sparsity.

In practice, usually ρ \rhoρ is set to 0.1 or 0.01, thus setting most of the elements in the weight matrix to zero. This sparsity can reduce redundancy and overfitting in neural networks, improving the generalization ability and performance of the network.

It should be noted that Sparse Initialization may cause some elements in the weight matrix to be too large or too small, thereby affecting the performance of the network. Therefore, some improved techniques, such as weight pruning and its variants, have been proposed to further improve the performance of sparse initialization.

In general, Sparse Initialization is an effective neural network weight initialization method, especially for situations where redundancy and overfitting need to be reduced. By setting most of the elements in the weight matrix to zero, the generalization ability and performance of the neural network can be improved.

Initialize the weights of the convolutional layer using Xavier

Build a simple convolutional neural network with pytorch, and use Xavier to initialize the weight of the convolutional layer

The following is an example of building a simple convolutional neural network, where the weights of the convolutional layers are initialized using the Xavier initialization method:

import torch
import torch.nn as nn

class SimpleCNN(nn.Module):
    def __init__(self):
        super(SimpleCNN, self).__init__()

        # 定义卷积层
        self.conv1 = nn.Conv2d(in_channels=3, out_channels=32, kernel_size=3, stride=1, padding=1)
        self.conv2 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, stride=1, padding=1)

        # 定义池化层
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)

        # 定义全连接层
        self.fc1 = nn.Linear(in_features=64 * 8 * 8, out_features=512)
        self.fc2 = nn.Linear(in_features=512, out_features=10)

        # 使用Xavier初始化卷积层的权重
        nn.init.xavier_uniform_(self.conv1.weight)
        nn.init.xavier_uniform_(self.conv2.weight)
        
    def forward(self, x):
        # 前向传播
        x = self.conv1(x)
        x = nn.functional.relu(x)
        x = self.pool(x)

        x = self.conv2(x)
        x = nn.functional.relu(x)
        x = self.pool(x)

        x = torch.flatten(x, start_dim=1)

        x = self.fc1(x)
        x = nn.functional.relu(x)

        x = self.fc2(x)

        return x

Here, we define a class called SimpleCNN, which inherits from nn.Module. In the constructor of the class, we define two convolutional layers, one pooling layer, and two fully connected layers, and use the nn.init.xavier_uniform_() function to initialize the weights of the convolutional layer with Xavier. In the forward propagation function, we build the structure of the convolutional neural network in the order of convolutional layer, activation function, pooling layer, fully connected layer, activation function and fully connected layer.

It should be noted that in this example, only the Xavier initialization method is used for the weight of the convolutional layer, and the weight initialization method of other layers can be specified in a similar way. In addition, this example is just a simple convolutional neural network. In actual applications, more complex network designs may be required according to specific task requirements.

References

ChatGPT

Guess you like

Origin blog.csdn.net/qq_41990294/article/details/130084587