What magical Batch Normalization will occur only training BN layer

You might be surprised, but it is effective.

Recently, I read Jonathan Frankle on arXiv platform, David J. Schwab and Ari S. Morcos wrote essays " Training BatchNorm and Only BatchNorm: the On at The Expressive Power of Random Features in CNNs ." The idea immediately caught my attention. So far, I will never batch standardization (BN) layer is considered part of the learning process itself, only the depth of the network to help optimize and improve stability. After a few experiments, I found that I was wrong. In the following, I will show the results of my copy of the paper and learn something.

More specifically, I use Tensorflow 2 Keras API experiment successfully reproduced the main thesis, reached similar conclusions. That is, ResNets can be standardized only by training the batch layer is gamma (γ) and beta (β) parameter set obtained good results in CIFAR-10 data. From a digital perspective, I use ResNet-50,101 and 152 architecture gained 45%, 52% and 50% of the Top-1 precision, it is far from perfect, but it is not valid.

In the following, I outlined the batch standardization concepts and their common interpretation. Then, I share the code I used and the results obtained therefrom. Finally, I comment on the experimental results, and analyzed.

Batch standardization Batch Normalization

Briefly, the batch layer estimated average normalized ([mu]) and variance (σ²) its input, and produces a standardized output, i.e., the average output of zero and unit variance. In the experiment, this technique can significantly improve the convergence and stability of the depth of the network. In addition, it uses two parameters (gamma] and beta]) and zoom to adjust its output.

as the input x, z as output, z is given by the following equation:

Figure 1: Batch standardized expression

The input data to estimate the parameters μ and σ², and β and γ are trainable. Thus, back propagation algorithm can use to optimize the network.

In summary, it has been found that the performance speed can be significantly improved network training and which retain this data. Moreover, it is not incompatible with other network layer place. Therefore, most models are often used between all Conv-ReLU frequent operation it is formed "Conv-BN-ReLU" Trio (and variants thereof). However, although this is one of the most frequently occurring layer, but the reasons behind its advantages there are a lot of controversy in the literature. The following three main argument:

Internal variance translation: Simply put, if zero mean and unit variance output, the next layer will enter the training stable. In other words, it is possible to prevent too much change in output. This is the first explanation, but later work found conflicting evidence, denied this assumption. In short, if the training networks VGG (1) without the use of BN, (2) using the BN, and (3) using the BN, artificially covariance translation. Despite the covariance artificial translation process (2) and (3) better still (1).

Output smoothing: BN is considered to be smoothed optimum range, reduction of the amount of change in loss function and limit the gradient. Smoother target prediction would be better in training, and not easy problems.

The longitudinal direction of decoupling: Some authors believe that BN is to improve the formula for the optimization problem, it can be extended to more traditional optimization settings. More specifically, BN frame allows independent optimization of parameters of length and direction, thereby improving the convergence.

In short, all three explanations are concentrated in the standardization of batch standardization. Below, we will look at the BN pan and zoom points achieved by γ and β parameters.

 

Full article, please visit: https://imba.deephub.ai/p/af8e0630755211ea90cd05de3860c663

Or public concern number View

 

Guess you like

Origin www.cnblogs.com/deephub/p/12625812.html