Artificial Intelligence Series Experiments (4) - Comparison of Various Neural Network Parameter Initialization Methods (Xavier Initialization and He Initialization)

This experiment uses Python to build a shallow neural network for distinguishing different color regions. By using three different initialization methods: all-zero initialization, random initialization, and He initialization, the impact of changing the initialization method on the final prediction performance is compared.

Experimental principle:

Why initialize weights

The purpose of weight initialization is to prevent the output loss gradient of the layer activation function from exploding or disappearing during the forward propagation of the deep neural network. If either happens, the loss gradient is too large or too small to backpropagate effectively, and even if it could, the network will take longer to converge.

All 0 initialization

All 0 initialization is the worst initialization method, which is only suitable for single-neuron neural networks, such as artificial intelligence series experiments (1) - binary classification single-layer neural network for recognizing cats . For any neural network with a hidden layer, if all the weights are the same during initialization, the effect of any layer of multi-neurons is the same as that of a single neuron, and it is impossible to learn multiple features, which is a waste of computing power.

The initialization of all 0s is a special case of the same weight initialization. According to the chain rule of backpropagation, the parameters of the neural network with hidden layers will never be updated, and the cost function will never decrease.

random initialization

Random initialization is the most common initialization method: all W parameters are initialized randomly, and all b parameters are initialized to 0.
In python, use the np.random.randn and np.zeros functions to initialize.

Xavier initialization

Condition: During forward propagation, the variance of the activation value remains constant; during backward propagation, the variance of the gradient with respect to the state value remains constant. Among them, the activation value is the value output by the activation function, and the state value is the value input to the activation function.
The initialization method is,
W ∼ U [ − 6 ni + ni + 1 , 6 ni + ni + 1 ] W \sim U[-\frac{\sqrt6}{\sqrt{n_{i} + n_{i + 1} }},\frac{\sqrt6}{\sqrt{n_{i} + n_{i + 1}}} ]WU[ni+ni+1 6 ,ni+ni+1 6 ] The Xavier initialization method is mainly suitable for S-curve activation functions such as tanh and sigmoid.
Paper address:Understanding the difficulty of training deep feedforward neural networks

He initialization (Kaiming initialization)

Condition: During forward propagation, the variance of the state value remains constant; during backward propagation, the variance of the gradient of the activation value remains constant.
The initialization method is (for ReLU),
W ∼ N [ 0 , 2 n ^ i ] W \sim N[0,\sqrt{ { \frac{2}{\hat{n}_{i}}}} ]WN[0,n^i2 ] In python, the implementation is based on random initialization, multiplied by np.sqrt(u / layers_dims[l - 1]). layers_dims[l - 1] is the number of neurons in the previous layer, and u is an adjustable parameter. Through practice, for tanh, the value of u is 1; for relu, the value of u is 2.
The He initialization method is mainly suitable for activation functions such as ReLU and Leaky ReLU.
Paper address:Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

Experimental environment: numpy, matplotlib and sklearn libraries in python

import numpy as np
import matplotlib.pyplot as plt
import sklearn  # 用于数据挖掘、数据分析和机器学习
import sklearn.datasets

Training samples: 300 2D coordinates of colored points

Test sample: 100 2D coordinates with colored points

Both training samples and test samples are generated by the sklearn.datasets.make_circles function.

train_X, train_Y = sklearn.datasets.make_circles(n_samples=300, noise=.05)
test_X, test_Y = sklearn.datasets.make_circles(n_samples=100, noise=.05)

For the complete code of this experiment , see:
https://github.com/PPPerry/AI_projects/tree/main/4.weights_initialize

The constructed neural network model is as follows:

insert image description here
Our goal is to judge the possible color of a point in the coordinate system by training this neural network. For example, what color is the most likely point at coordinates (-4, 2), and what is the most likely color at (1, -2) color. Separate the areas of red and blue dots.

Based on the neural network model, the code implementation is as follows:

Among them, part of the code for building the network is similar to the artificial intelligence series experiment (2) - the shallow neural network used to distinguish different color regions , and will not be repeated here. Mainly the part that loads data and implements different initialization methods.

First, load the training and test data.

def load_dataset():
    train_X, train_Y = sklearn.datasets.make_circles(n_samples=300, noise=.05)
    test_X, test_Y = sklearn.datasets.make_circles(n_samples=100, noise=.05)

    plt.scatter(train_X[:, 0], train_X[:, 1], c=train_Y, s=40, cmap=plt.cm.Spectral)
    plt.show()
    train_X = train_X.T
    train_Y = train_Y.reshape((1, train_Y.shape[0]))
    test_X = test_X.T
    test_Y = test_Y.reshape((1, test_Y.shape[0]))
    return train_X, train_Y, test_X, test_Y

Draw the data set as follows:
insert image description here
Second, construct the initialization function in the neural network and observe the prediction results in turn.

1. All 0 initialization

# 第一种方法——全0初始化
def initialize_parameters_zeros(layers_dims):
    parameters = {
    
    }
    L = len(layers_dims)

    for l in range(1, L):
        parameters['W' + str(l)] = np.zeros((layers_dims[l], layers_dims[l - 1]))
        parameters['b' + str(l)] = np.zeros((layers_dims[l], 1))
    return parameters

The final effect is as follows:
insert image description here
the cost function has not changed, that is, the network has not learned successfully.

2. Random initialization

# 第二种方法——随机初始化
def initialize_parameters_random(layers_dims):
    np.random.seed(3)
    parameters = {
    
    }
    L = len(layers_dims)

    for l in range(1, L):
        parameters['W' + str(l)] = np.random.randn(layers_dims[l], layers_dims[l - 1]) * 10
        parameters['b' + str(l)] = np.zeros((layers_dims[l], 1))
    return parameters

(Here we deliberately increased the random parameter value to facilitate a more obvious comparison with other initialization methods.) The
final effect is as follows:
insert image description hereinsert image description here
the prediction effect of random initialization is not very good, and incorrect parameter initialization will lead to poor training efficiency, which requires training It takes a long time to get close to the ideal value.

3. He initialization

# 第三种方法——He初始化
def initialize_parameters_he(layers_dims):
    np.random.seed(3)
    parameters = {
    
    }
    L = len(layers_dims) - 1

    for l in range(1, L + 1):
        parameters['W' + str(l)] = np.random.randn(layers_dims[l], layers_dims[l - 1]) * np.sqrt(2 / layers_dims[l - 1])
        parameters['b' + str(l)] = np.zeros((layers_dims[l], 1))

    return parameters

The final effect is as follows:
insert image description here
insert image description here
using He initialization, the prediction effect is ideal.

Conclusion:

In this experiment, we can clearly see that when the models of the neural network are consistent, only changing the initialization method of the parameters will lead to a qualitative change in the prediction effect. The choice of parameter initialization method plays a vital role in the prediction accuracy and convergence speed of the neural network.

Previous artificial intelligence series experiments:

Series of experiments on artificial intelligence (1) - binary classification single-layer neural network
artificial intelligence series for recognizing cats (2) - shallow neural network for distinguishing different color regions
Artificial intelligence series of experiments (3) - using Binary Classification Deep Neural Networks for Cat Recognition

Guess you like

Origin blog.csdn.net/qq_43734019/article/details/120242855