Network degradation, overfitting, gradient dissipation / explosion

Overfitting

Refers to the model on the training data set to perform well, poor performance on the test data set.

The reason
the model will learn through training data, the training data as a common characteristic learning into (the details of the data portrayed too careful). When the over-fitting model used in the test data set, since the test data set does not have the training data has unique characteristics, so the resulting model generalization ability is poor, poor performance on the test data set.

solution

  • The network size increases
  • Expand training data
  • Model using regularization, equilibrium data set size and complexity of the model
  • Dropout, mainly for full-depth study of the connection, Dropout is over a certain probability (usually set to 0.5, this time because the network structure randomly generated up) removal neurons implicit network.
  • Boosting and Bagging, Boosting and Bagging integrated machine learning method, a plurality of models may weaken the influence of outliers in each model, to retain continuity between the model, a single model Weakening

Network degradation

Increase the number of layers in the network process, training accuracy gradually become saturated, continue to increase the number of layers, training accuracy will decline phenomenon, and this decline is not caused by the over-fitting.

The reason
in theory, should not be deeper than the model and its corresponding worse lighter model. Because the model is a super-deep shallow space model. The model thus obtained can be deeper: Construction of a shallow first model, and then add a lot of network layer identity map.
In fact added later in the deeper model is not the identity map, but some of the nonlinear layer. Thus, degradation also shows: approximated by a plurality of nonlinear layer identity map may be difficult.

Our approach
learning residuals. Resnet It is proposed based on this issue.

Gradients dissipate / explosion

The reason
gradient and the gradient of the explosion dissipates essentially the same, because deep layers of the network caused by the gradient back-propagation of multiplicative effect. Sigmoid activation function most likely to produce a gradient dissipated, since it determines the function characteristic.
Image Sigmoid function shown below:
Here Insert Picture Description
its derivative image as shown below:
Here Insert Picture Description
can be known from the figure, if used as a Sigmoid activation function, then the gradient can not exceed 0.25 when superimposed layers, through the chain after derivation. It is easy to generate the gradient disappears.

solution

  • Change activation function, use relu, LeakyRelu, ELU activation function and the like can be improved dissipation gradient or explosion. Positive guide portion relu constant number equal to 1, the gradient is not generated and disappear gradient explosion.
  • BatchNormalization. Do scale and shift in each level of input, each input neuron distribution forced back with mean 0 and variance 1 standard normal distribution, which makes the input value falls within the active layer in the non-linear function sensitive input values ​​area so that the input of small changes can lead to loss of function of large changes, so that the gradient becomes larger, faster training, and to avoid problems disappear gradient.
  • ResNet residual structure.
  • Shear gradient, which is mainly presented against gradient explosion. The idea is to set a threshold shear gradient, the gradient updating, if the gradient exceeds a threshold value, which limit this range.
Published 29 original articles · won praise 12 · views 10000 +

Guess you like

Origin blog.csdn.net/c2250645962/article/details/102838830