ElitesAI · depth hands-on science learning PyTorch version Task06 punch

Batch normalization and residuals network

Batch normalized (BatchNormalization)
batch normalization (depth model)
by using the mean value and the standard difference in small quantities, to constantly adjust intermediate neural network output, so that the entire neural network is more stable in the intermediate output values of the respective layers.

1. Batch normalized make the whole connection layer
position: the connecting layer between the full affine transformation and activation function.

Here Insert Picture Description

Here Insert Picture Description
Here Insert Picture Description
2. do the bulk of the normalized convolution layer of ⼀
Location: After convolution calculation, to apply it before activating the function.
If a plurality of output channels convolution calculation, we need to make the output of these channels are normalized quantities, and each channel has a separate drawing and offset parameters. Calculated: single channel, batchsize = m, the convolution computation output channel = pxq m × p × q elements while doing batch normalization, with the same mean and variance.

3. When a batch predicted normalized ⼀ of
training: the batch as a unit, calculate the mean and variance for each batch.
Forecast: The average estimate of the sample mean and variance of the entire set of training data with mobile.

nn.BatchNorm2d (out_channel), # BatchNorm2d convolutional network most commonly used (preventing disappearance gradient or explosion), the parameter set is the number of output channels convolution
may Mark
http://www.mamicode.com/info-detail- 2378483.html

[Convolutional neural network for interpretation] BN layer
to see https://www.cnblogs.com/kk17/p/9693462.html

ResNet description
https://www.cnblogs.com/bonelee/p/8977095.html
convolution (64,7x7,3)
batch integration of
the largest pools of (3x3,2)

X4 residual block (residual block 2 is reduced in each module between the height and width of a step width)

Global average Pooling

Fully connected

Dense network connection (DenseNet)

Here Insert Picture Description
The main building blocks:
dense block (dense block): defines how input and output are connected.
Transition (transition layer): used to control the number of channels, so that too large.

Extended:

https://blog.csdn.net/baidu_27643275/article/details/79250537
saddle point neural network optimization problems in one dimension that is tilted upward and downward point in another dimension.

Saddle Point: gradient is zero, positive and negative eigenvalues in the vicinity of the Hessian matrix, determinant is less than 0, i.e., is uncertain.
And the difference between the local extrema saddle point:
the same saddle point and the local minimum value, the gradient at the zero point, except that the saddle point in the nearby of the Hessian matrix is uncertain (determinant less than 0), the local extrema the Hessian matrix is positive definite.

In the vicinity of the saddle point, the optimization algorithm based on the gradient (almost current optimization algorithm for all practical use are based on gradient) encountered more severe problems:
gradient at the saddle point is zero, the saddle point is typically surrounded by a plane the same error value (also known as the planar plateaus, plateaus flat area gradient near zero, the neural network learning speed decreases), the high-dimensional case, this flat area range around the saddle point may be quite large, which makes it difficult departing from the SGD region algorithm , i.e., the card may be longer in the vicinity of the point (as close to zero gradient in all dimensions).
Here Insert Picture Description
In a great number of saddle point of time, the problem will become very severe.

High-dimensional non-convex optimization problem difficult to reason why, is because there are a lot of saddle-point high-dimensional parameter space.

Added:
the Hessian matrix of second order square matrix is a multi-function configuration of the partial derivatives, the local curvature function described. It may be used to determine the extreme value functions.
Here Insert Picture Description

How to understand the stochastic gradient descent (gradient descent of Stochastic, the SGD)
https://www.zhihu.com/question/264189719
https://blog.csdn.net/qq_30911665/article/details/79531733

Learning rate
https://www.csdn.net/gather_28/MtTaggysMDk1Ny1ibG9n.html
https://blog.csdn.net/qq_30911665/article/details/79531733

Published 26 original articles · won praise 8 · views 6580

Guess you like

Origin blog.csdn.net/weixin_43859329/article/details/104488367