Why deep neural networks are so hard to train

Table of contents

1. Reasons for the difficulty of deep network training:

2. The disappearing gradient problem

3. Unstable gradient problem


Reference article: Why is it difficult to train deep neural networks? - Tencent Cloud Developer Community - Tencent Cloud

1. Reasons for the difficulty of deep network training:

 slow training

In deep networks, different layers learn at very different speeds. Careful study reveals:

  1. While the later layers in the network are learning well, the earlier layers often get stuck during training, essentially learning nothing.
  2. When we understand this problem more deeply, we find that the opposite situation also occurs: the previous layers may learn better, but the later layers are stagnant.

This stagnation is not due to bad luck. Instead, there is a more fundamental reason that the speed of learning has dropped! These reasons are relevant to gradient-based learning techniques! 

In fact, we found that using gradient descent-based learning methods in deep neural networks is inherently unstable. This instability blocks the learning process of previous or subsequent layers.

2. The disappearing gradient problem

import mnist_loader
import network2
 
training_data, validation_data, test_data = mnist_loader.load_data_wrapper()
# 输入层 784
# 隐藏层 30
# 输出层 10
sizes = [784, 30, 10]
net = network2.Network(sizes=sizes)
 
# 随机梯度下降开始训练
net.SGD(
    training_data,
    30,
    10,
    0.1,
    lmbda=5.0,
    evaluation_data=validation_data,
    monitor_evaluation_accuracy=True,
)
 
"""
Epoch 0 training complete
Accuracy on evaluation data: 9280 / 10000
Epoch 1 training complete
Accuracy on evaluation data: 9391 / 10000
......
Epoch 28 training complete
Accuracy on evaluation data: 9626 / 10000
Epoch 29 training complete
Accuracy on evaluation data: 9647 / 1000
"""

Taking MNISTthe number classification problem as an example, we finally got the classification accuracy rate of 96.47%. If the number of layers of the network is deepened, will the accuracy rate be improved? Let's try several situations:

# 准确率 96.8%
net = network2.Network([784, 30, 30, 10])
# 准确率 96.42%
net = network2.Network([784, 30, 30, 30, 10])
# 准确率 96.28%
net = network2.Network([784, 30, 30, 30, 30, 10])

This shows that although we deepen the number of layers of the neural network to let it learn more complex classification functions, it does not lead to better classification performance (but it does not become worse). So why this happens is the question we need to think about next. Consider assuming that the extra hidden layer does work in principle, the problem is that our learning algorithm did not discover the correct weights and biases.

The following figure (based on [784, 30, 30, 10]the network) shows the rate of change of each neuron's weight and bias during neural network learning. Each neuron in the figure has a bar graph, indicating that this neuron is in the network. The rate of change while learning is in progress. Larger bars mean faster speeds, while smaller bars indicate slower changes.

It can be found that the bars on the second hidden layer are basically larger than the bars on the first hidden layer; therefore, the neurons in the second hidden layer will learn faster. This is no coincidence, the earlier layers do learn slower than the later ones.

We can continue to observe the change of learning speed. Below are the learning speed change diagrams of 2~4 hidden layers:

 

 

 

The same situation occurs, the learning speed of the earlier hidden layers is lower than that of the later hidden layers. Here, the learning speed of the first layer is two orders of magnitude slower than that of the last layer, which is 100 times slower than that of the fourth layer.

One result we can draw is that in some deep neural networks, the gradient tends to get smaller as we backpropagate through the hidden layers; this means that neurons in the earlier hidden layers learn slower than later This phenomenon is also known as the vanishing gradient problem.

3. Unstable gradient problem

The core reason is the gradient instability in the deep neural network, which will cause the gradient to disappear or explode in the previous layer.

The fundamental problem is not actually the vanishing gradient problem or the exploding gradient problem, but rather that the gradients on earlier layers are the product of terms from later layers. When there are too many layers, an inherently unstable scene emerges. The only way to have all layers approaching the same learning rate is if the product of all these terms is balanced. If there is no mechanism or more essential guarantee to achieve balance, the network will easily become unstable. In short, the real problem is that neural networks are limited by unstable gradients . So, if we use a standard gradient-based learning algorithm, different layers in the network will learn at different learning rates

The vanishing gradient problem is pervasive : we have seen that gradients can vanish or explode in the early layers of a neural network.

Guess you like

Origin blog.csdn.net/ytusdc/article/details/128513941