Reference

Weight Initialization in Neural Networks: A Journey from Basics to Kaiming (Highly Recommended)

Go

I recently learned the EqualConv2d convolution operation in stylegan, and I really can't understand why a scaling factor scale is added to the parameter w. So I searched for information to understand the purpose of doing so. Then I turned to the article in the reference, which gave me a new look. I used to know that the network model needs to be initialized, but I can't understand why? Now, let us re-understand some, this article is mainly a summary and reproduction of reference.

1. First of all - why do you need to initialize the weights

The purpose of weight initialization is to prevent layer activation outputs from exploding or disappearing during the forward propagation through the deep neural network. The occurrence of any of the above situations is not conducive to the propagation of the gradient. If the loss gradient is too small, the network will take longer to converge. If the loss gradient is too large, the network may collapse directly.
Matrix multiplication is the basic mathematical operation of neural networks. We use the most basic neural network to simulate a multi-layer deep neural network. In order to further simplify the network structure, we do not add activation functions first, but only through simple matrix operations.
Let's assume we have a vector x containing some network input, in general, our input vector needs to fall within a normal distribution with mean 0 and standard deviation 1.

x = torch.randn(512)

Furthermore, we assume that our input goes through a simple network of 100 layers, each layer containing a weight matrix a. To complete a single forward pass, we need to perform 100 consecutive matrix multiplications.

for i in range(100):
	a = torch.randn(512, 512)
	x = a @ x
print(x.mean(),x.std())

(tensor(in),tensor(in))

It turns out that it is not a good idea for us to scale both the inputs and weights to the standard normal distribution. Somewhere in the 100-layer matrix multiplication, the layer input becomes very large, and the computer directly recognizes it as nan. For an exact understanding, how many layers start, the matrix size has become impossible to calculate.

import torch

x = torch.randn(512)
for i in range(100):
    a = torch.randn(512, 512)
    x = a @ x
    if torch.isnan(x.std()):
        print(i)
        break
print(x.mean(),x.std())

27
tensor(in) tensor(in)

We can see that the output cannot be calculated during the matrix operation of the 28th layer. Therefore, we should understand that our initialization weights are too large.
Also, we need to worry about vanishing outputs, aka vanishing gradients. In order to understand the vanishing gradient, we can set the initial weights of our simple network to be very small. For example, the mean is 0 and the standard deviation is 0.01.

import torch

x = torch.randn(512)
for i in range(100):
    a = torch.randn(512, 512) * 0.01
    x = a @ x
print(x.mean(),x.std())

tensor(0.) tensor(0.)

We can find that when the initialization weight is relatively small, the output starts to all become 0, and the gradient disappears.
To sum up, if the weight initialization is too large, the network will explode in gradient, and when the weight initialization is too small, the network will have gradient disappearance.

2. So how do we find the optimal initialization

Remember, as mentioned above, that the math needed to do the forward pass of a neural network is just a series of matrix multiplications. If our output y is the matrix product between the input vector x and the weight matrix a, then each element i in y is defined as $y_{i}=\sum_{k =0}^{n-1} a_{i, k} x_{k}$ where i is a given row index in weight matrix a, k is a given column index in weight matrix a, which in turn is an element index in input vector x, and n is the range or total number of elements in x.

y[i]=sum([c * d for c,d in zip(a[i], x])

We can show that, at a given layer, the matrix product of our input x initialized from a standard normal distribution and the weight matrix a, on average, the standard deviation is very close to the square root of the input channels, in this case 512 \sqrt $512$

import math

import torch

mean = 0.
var = 0.

for i in range(10000):
    x = torch.randn(512)
    a = torch.randn(512, 512)
    y = a @ x
    mean += y.mean().item()
    var += y.pow(2).mean().item()
print(mean/10000, math.sqrt(var/10000))
print('sqrt512:',math.sqrt(512))

-0.002526797866821289 22.61723591646642
sqrt512: 22.627416997969522

This is not surprising if we look at this property in terms of the definition of matrix multiplication: to compute y, we accumulate 512 multiplications of elements of a column of input x with elements of a row of weight a. Among them, x obeys the standard normal distribution, a obeys the standard normal distribution, and are independent of each other.
For the multiplication of the point in a and the midpoint of x, you can roughly read the multiplication of two normal distributions.
For multiplying two independent normal distributions.
$\sim N\left(\mu_{1}, \sigma_{1}^{2}\right) , Y \sim N\left(\mu_{2}, \sigma_{2}^{2}\right)$ , X, Y coordinates.
$\operatorname{var}(XY)=E\left[ (XY)^{2}\right]-(E[XY])^{2}=E\left[X^{2} Y^{2}\right]-(E[XY])^{2}$
In the $^{2nd}$ $E\left[X^{2}\right]=\mu_{1 }^{2}+\sigma_{1}^{2}, E\left[Y^{2}\right]=\mu_{2}^{2}+\sigma_{2}^{2}$
代运动, $\operatorname{var}(XY)=\left(\mu_{ 1}^{2}+\sigma_{1}^{2}\right)\left(\mu_{2}^{2}+\sigma_{2}^{2}\right)-\left(\mu_ {1} \mu_{2}\right)^{2}$
This article is all standard normal distribution, $\mu=0, \sigma=1$ , so each point still obeys the standard normal distribution, but we finally require the sum of these 512 points, and all 512 points obey the standard normal distribution. According to the above formula, the same push can get
$E(X+Y)=\mu_1+\mu_2$ $\operatorname{var}(X +Y)=\sigma_{1}+\sigma_{2};$
Then 512 standard normal distributions are added, and the point set obeys $(0,\sqrt{512}^2)$
That is to say,after we perform a matrix operation, the output has obeyed $(0,\sqrt{512}^2)$ is normally distributed. When such matrix operations are performed several times, the distribution of the output has already begun to become huge. This leads to the above example, after the 27-layer matrix multiplication operation, the gradient explosion occurs. Similarly, when the output distribution is less than $(0, 1)$ distribution, after multi-layer matrix operation, the gradient will disappear.
However, what is our expectation? We hope that the output distribution of the network can still be maintained in a normal distribution. Then, in this 100-layer matrix multiplication, how do we ensure that our output still conforms to the standard normal distribution? Obviously, we only need to scale our weights by $1/\sqrt{512}$ Just zoom.

import math
import torch

mean = 0.
var = 0.

for i in range(10000):
    x = torch.randn(512)
    a = torch.randn(512, 512) / math.sqrt(512)
    y = a @ x
    mean += y.mean().item()
    var += y.pow(2).mean().item()
print(mean/10000, math.sqrt(var/10000))

-0.0007218372397474013 0.9998939706858075

Let's run our 100-layer abridged network again.

import math
import torch

x = torch.randn(512)
for i in range(100):
    a = torch.randn(512, 512) / math.sqrt(512)
    x =  a @ x
print(x.mean(), x.std())

tensor(0.0217) tensor(0.9125)

It can be found that the distribution of our layer output after the 100-layer matrix operation is still close to the standard normal distribution, so the gradient explosion and gradient disappearance are eliminated.

At this point, our preliminary network can be done manually, but in reality, when we actually use the neural network, we still need to apply the activation function to achieve a nonlinear mapping relationship. Also thanks to the placement of non-linear activation functions at the end of the network layers, deep neural networks are able to create approximations of complex functions that describe real-world phenomena, leading to astonishing results.

3. Xavier initialization

In the early stages of neural networks, the most commonly used activation functions are symmetrical about a given value, and have a range of values that are asymptotically close to the midpoint of the point plus or minus a certain distance. tanh() and softsign() are such functions.

import numpy as np
import torch
import matplotlib.pyplot as plt
import mpl_toolkits.axisartist as axisartist

# 创建一个画板
fig = plt.figure('activate', (10,8))
ax = axisartist.Subplot(fig, 1,1,1)
fig.add_axes(ax)

ax.axis[:].set_visible(False)
ax.axis["x"] = ax.new_floating_axis(0, 0)
ax.axis["y"] = ax.new_floating_axis(1, 0)
#新建可移动的坐标轴
ax.axis["x"].set_axis_direction('top')
ax.axis["y"].set_axis_direction('left')


x = torch.arange(-10, 10, 0.01)
y_t = torch.tanh(x)
y_s = torch.nn.functional.softsign(x)
plt.xticks(torch.arange(-10, 11, 2))
plt.yticks(torch.arange(-1,1,0.25))
plt.scatter(x, y_t)
plt.scatter(x, y_s)
plt.legend(labels=('softsign', 'tanh'), loc='upper left', prop = {
    
    'size':16})
plt.show()

insert image description here
Let's add an activation function to our 100-layer simple network, assuming we use the double tangent activation function tanh, where the layer weights, we still maintain $1/\sqrt{n}$ zoom.

import math

import numpy as np
import torch
import matplotlib.pyplot as plt
import mpl_toolkits.axisartist as axisartist



x = torch.randn(512)
for i in range(100):
    a = torch.randn(512, 512) / math.sqrt(512)
    x =  a @ x
    x = torch.tanh(x)
print(x.mean(), x.std())

tensor(-0.0015) tensor(0.0836)

You can find that the variance of the output has become very small at this point, if we continue, our gradient will disappear.

In fact, in about 2010, the traditional initialization weight is not the example we just gave. The more common initialization is to sample from [-1,1] and then scale by 1/√n.
It turns out that this standard method doesn't actually work that well.

import math

import numpy as np
import torch
import matplotlib.pyplot as plt
import mpl_toolkits.axisartist as axisartist



x = torch.randn(512)
for i in range(100):
    a = torch.Tensor(512, 512).uniform_(-1, 1) * math.sqrt(1.0/512)
    x =  a @ x
    x = torch.tanh(x)
print(x.mean(), x.std())

tensor(2.8467e-26) tensor(1.8184e-24)

You will find that its performance is not even as good as the initialization weight we just proposed, and the gradient has basically disappeared.
This poor performance prompted Xavier Glorot and Yoshua Bengio to publish their landmark paper Understanding the difficulty of training deep feedforward neural networks, in
which they called it "Standard Initialization", now commonly referred to as "Xavier "initialization.
Xavier initialization sets the layer's weights to values chosen from a random uniform distribution between the incoming network channel and the output channel.
$\pm \frac{\sqrt{6}}{\sqrt{n_{i}+n_{i+1}}}$
Glorot and Bengio argue that the Xavier weight initialization will keep the variance of the activations and the gradient of the direction propagation close to the upward or downward gradient propagation. In their experiments, they observed that Xavier initialization enabled 5-layer networks to maintain nearly the same variance of their weight gradients across layers.
insert image description here

The gradient coefficient picture of the network layer after Xavier weight initialization comes from the literature

Instead, experiments have shown that using mean initialization brings the gradients of the higher layers of the network close to zero.
insert image description here

The gradient coefficient picture of the network layer after the mean weight is initialized comes from the literature

Let's run our 100-layer tanh network again with Xavier initialization,

def xavier(in_channels, out_channels):
    return torch.Tensor(in_channels,out_channels).uniform_(-1, 1) * math.sqrt(6.0 / (in_channels + out_channels))

x = torch.randn(512)
for i in range(100):
    a = xavier(512, 512)
    x =  a @ x
    x = torch.tanh(x)
print(x.mean(), x.std())

tensor(-0.0014) tensor(0.0540)

At this point, after initializing with Xavier, the obtained mean and variance are almost the same as our method.

4. Kaiming initialization

Conceptually, when we use an activation function that is symmetric about zero and has an output within [-1, 1] (such as softsign and tanh), we want the activation output of each layer to have a mean of 0 and a mean standard deviation Around 1. This is exactly what both our basic method and Xavier can achieve.
But what if you use other functions such as Relu or LeakyRelu, which are more popular now? Does it still make sense for us to scale the weights the same way.

# 创建一个画板
import math

import numpy as np
import torch
import matplotlib.pyplot as plt
import mpl_toolkits.axisartist as axisartist

fig = plt.figure('activate', (10,8))
ax = axisartist.Subplot(fig, 1,1,1)
fig.add_axes(ax)

ax.axis[:].set_visible(False)
ax.axis["x"] = ax.new_floating_axis(0, 0)
ax.axis["y"] = ax.new_floating_axis(1, 0)
#新建可移动的坐标轴
ax.axis["x"].set_axis_direction('top')
ax.axis["y"].set_axis_direction('left')


x = torch.arange(-5, 5, 0.01)
y_t = torch.relu(x)
plt.xticks(torch.arange(-5, 6, 2))
plt.yticks(torch.arange(0,7,2))
plt.scatter(x, y_t, label='Relu')
plt.legend(loc='upper left', prop = {
    
    'size':16})
plt.show()

insert image description here
$R = ma x (0, x)$
In order to explore whether Xavier initialization still works after using Relu, we change the tanh function to relu and run our 100-layer simple network again.

import math

import numpy as np
import torch
import matplotlib.pyplot as plt
import mpl_toolkits.axisartist as axisartist

def xavier(in_channels, out_channels):
    return torch.Tensor(in_channels,out_channels).uniform_(-1, 1) * math.sqrt(6.0 / (in_channels + out_channels))

x = torch.randn(512)
for i in range(100):
    a = xavier(512, 512)
    x =  a @ x
    x = torch.relu(x)
print(x.mean(), x.std())

tensor(2.1620e-16) tensor(3.2313e-16)

We found the phenomenon of gradient disappearance again, indicating that Xavier initialization cannot meet the requirements of the relu function

Let's dig deeper into why, also taking the same approach as how we derived our basic initialization method. Let's take a look at the standard deviation of the output after adding the Relu function.

import math

import numpy as np
import torch
import matplotlib.pyplot as plt
import mpl_toolkits.axisartist as axisartist

mean = 0
var = 0
for i in range(10000):
    x = torch.randn(512)
    a = torch.randn(512, 512)
    y = torch.relu(a @ x)
    mean += y.mean().item()
    var += y.pow(2).mean().item()
print(mean/10000, math.sqrt(var/10000))
print(math.sqrt(512/2))

9.025176297998428 16.00142959816968
16.0

It can be found that when using Relu activation, on average, the standard deviation of a single layer is very close to the overall square root of the input continuous number divided by 2.
We again use $\sqrt{512/2}$ To initialize the weights.

import math

import numpy as np
import torch
import matplotlib.pyplot as plt
import mpl_toolkits.axisartist as axisartist

def xavier(in_channels, out_channels):
    return torch.Tensor(in_channels,out_channels).uniform_(-1, 1) * math.sqrt(6.0 / (in_channels + out_channels))

mean = 0
var = 0
for i in range(10000):
    x = torch.randn(512)
    a = torch.randn((512, 512)) * math.sqrt(2/512.)
    y = torch.relu(a @ x)
    mean += y.mean().item()
    var += y.pow(2).mean().item()
print(mean/10000, math.sqrt(var/10000))

0.563040738016367 0.9990654915916081

It can be found that the output obtained in this way basically conforms to the standard normal distribution. As we have shown before, keeping the standard deviation of layer activations around 1 will allow us to stack more layers in deep neural networks without exploding or vanishing gradients.

import math

import numpy as np
import torch
import matplotlib.pyplot as plt
import mpl_toolkits.axisartist as axisartist

x = torch.randn(512)
for i in range(100):
    a = torch.randn((512, 512)) * math.sqrt(2/512.)
    x =  a @ x
    x = torch.relu(x)
print(x.mean(), x.std())

tensor(0.4214) tensor(0.5959)

We found that after initializing with such weights, our 100 layers still have a fairly strong gradient return capability after the matrix operation of the Relu activation function.
This exploration of how to best initialize weights in networks with ReLU-like activations was the motivation for Kaiming He et al. - Kaiming initialization , a scheme tailored for deep neural networks using these asymmetric, non-linear activations.
In their 2015 paper, it was shown that deep networks (such as 22-layer CNNs) converge earlier if the following input weight initialization strategy is used:

Creates a tensor of size suitable for the weight matrix at a given layer and fills it with numbers randomly chosen from the standard normal distribution.
multiply each randomly chosen number by $\sqrt{2/n}$ , where n is the number of input channels.
The bias tensor is initialized to zero.

5. Equalized Learning Rate（GAN）

ELR is a training trick introduced in StyleGan to stabilize and improve training.
The idea is to scale the parameters of each layer before each pass of the forward pass. How much to scale depends on the computed statistics of the input features,

class EqualConv2d(nn.Module):
    def __init__(self, in_channel, out_channel, kernel_size, stride=1, padding=0, bias=True):
        super().__init__()

        self.weight = nn.Parameter(torch.randn(out_channel, in_channel, kernel_size, kernel_size))
        self.scale = 1 / math.sqrt(in_channel * kernel_size ** 2)

        self.stride = stride
        self.padding = padding

        if bias:
            self.bias = nn.Parameter(torch.zeros(out_channel))
        else:
            self.bias = None

    def forward(self, input):

        return F.conv2d(input, self.weight * self.scale, bias=self.bias, stride=self.stride, padding=self.padding)

    def __repr__(self):
        return (
            f'{
      
      self.__class__.__name__}({
      
      self.weight.shape[1]}, {
      
      self.weight.shape[0]},'
            f' {
      
      self.weight.shape[2]}, stride={
      
      self.stride}, padding={
      
      self.padding})'
        )


class EqualLinear(nn.Module):
    def __init__(self, in_dim, out_dim, bias=True, bias_init=0, lr_mul=1, activation=None):
        super().__init__()

        self.weight = nn.Parameter(torch.randn(out_dim, in_dim).div_(lr_mul))

        if bias:
            self.bias = nn.Parameter(torch.zeros(out_dim).fill_(bias_init))
        else:
            self.bias = None

        self.activation = activation

        self.scale = (1 / math.sqrt(in_dim)) * lr_mul
        self.lr_mul = lr_mul

    def forward(self, input):

        if self.activation:
            out = F.linear(input, self.weight * self.scale)
            out = fused_leaky_relu(out, self.bias * self.lr_mul)
        else:
            out = F.linear(input, self.weight * self.scale, bias=self.bias * self.lr_mul)

        return out

    def __repr__(self):
        return (f'{
      
      self.__class__.__name__}({
      
      self.weight.shape[1]}, {
      
      self.weight.shape[0]})')

Specifically, I understand it like this (personal understanding, I still don't understand it very well), we learned Kaiming initialization earlier, and kaiming initialization is mainly to multiply each randomly selected number by 2 / n \sqrt{2/n $2/ n$ , where n is the number of input channels. Of course, we mainly analyzed the deep neural network before, and now we return to the deep convolutional network, assuming that the input is (b, c, h, w). Then the parameter amount is c h w. According to karming initialization, we need to divide the parameters of the normal distribution by $\sqrt{2/chw}$ . This completes the parameter initialization.
ELR is more like adding a permanent initialization to each convolutional layer, making the network better trained.
Another possibility is that the learning rate of different scales is added to the convolutions of different sizes, so that the learning rate of the larger convolution is smaller, so as to reduce the problem that the GAN network will collapse at any time.

Introduction to CV (1) - weight initialization