Parameter initialization in deep learning

1. Parameter initialization classification and principle

1 Introduction

  • The parameter learning in the training process of the neural network is optimized based on the gradient descent method. Gradient descent requires assigning an initial value to each parameter at the start of training. The choice of this initial value is very critical. Generally , we want the mean of the data and parameters to be 0, and the variance of the input and output data to be consistent. In practical applications, parameters obey 高斯分布or 均匀分布are more effective initialization methods.

  • A well chosen initialization can:

    • Speed up the convergenceof gradient descent
    • Increase the odds of gradient descent converging to alower training (and generalization) error
  • Poor initialization can:
    • lead to vanishing/exploding gradients, which also slows down the optimization algorithm
  • Random initialization is used tobreak symmetry and make sure different hidden units can learn different things

2. Classification

parameter initialization classification

3. Principle

  • In order to prevent the signal from being over-amplified or over-attenuated after passing through the multi-layer network, we try to keep the variance of the input and output of each neuron as consistent as possible .
    Parameter initialization principle

  • Gaussian distribution

    • The mean is: m e a n = 0
    • Xavier initialization: v a r = 1 n
    • He initialization: v a r = 2 n
  • Evenly distributed
    • obey [ r , r ] uniform distribution between, with mean: m e a n = ( a + b ) 2 = 0 , the variance is: v a r = ( b a ) 2 12 = r 2 3
    • Xavier initialization: v a r = 1 n ,Available, r = 3 n
    • He initialization: v a r = 2 n ,Available, r = 6 n

4. Summary

  • When using the RELU (without BN) activation function, it is best to use the He initialization method to initialize the parameters to small random numbers that obey Gaussian distribution or uniform distribution.
  • When using BN, the dependence of the network on the initial value scale of the parameters is reduced, and a small standard deviation (eg: 0.01) can be used for initialization at this time.
  • Using the parameters in the pre-training model as new task parameters to initialize is also a simple and effective method for initializing model parameters.

2. Parameter initialization code practice

0, in line with the agreement

n_in:为网络的输入大小
n_out:为网络的输出大小
n:为 n_in 或 (n_in + n_out) * 0.5--->(同时考虑信号在前向和反向传播中都不被放大或缩小)

1. Xavier initialization

  • Gaussian distribution initialization:
# 适用于普通激活函数(tanh,sigmoid):stdev 为高斯分布的标准差,均值设为 0
stdev = np.sqrt(1/n) 
W = tf.Variable(np.random.randn(n_in, n_out) * stdev)
  • Uniform distribution initialization:
# 适用于普通激活函数(tanh, sigmoid)
scale = np.sqrt(3/n)
W = tf.Variable(np.random.uniform(low=-scale, high=scale, size=[n_in, n_out]))

2. He initialization

  • Gaussian distribution initialization:
# 适用于 ReLU:stdev 为高斯分布的标准差,均值设为 0
stdev = np.sqrt(2/n)
W = tf.Variable(np.random.randn(n_in, n_out) * stdev) 
  • Uniform distribution initialization:
# 适用于 ReLU
scale = np.sqrt(6/n)
W = tf.Variable(np.random.uniform(low=-scale, high=scale, size=[n_in, n_out]))

3. BN+Gaussian distribution (small σ ) randomly initialized

W = tf.Variable(np.random.randn(node_in, node_out) * 0.01)
======
W = tf.Variable(np.random.randn(shape) * 0.01)  # shape 可以不止二维! 

......
fc = tf.contrib.layers.batch_norm(fc, center=True, scale=True, is_training=True)
fc = tf.nn.relu(fc)

4. Implementation in TensorFlow


3. References

1. Let’s talk about the weight initialization
of deep learning 2. Summary of deep learning network training skills
3. A tutorial on Deep Learning for Objects and Scenes at CVPR (Kaiming He)
4. tensorflow 1.0 learning: parameter initialization
5. Chapter 7 Network Optimization and regularization (Fudan - Qiu Xipeng)

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325467996&siteId=291194637