Batch Normalization magic if only a model training BN layer would be like

You might be surprised, but it is effective.

Recently, I read Jonathan Frankle on arXiv platform, David J. Schwab and Ari S. Morcos wrote essays " Training BatchNorm and Only BatchNorm: the On at The Expressive Power of Random Features in CNNs ." The idea immediately caught my attention. So far, I will never batch standardization (BN) layer is considered part of the learning process itself, only the depth of the network to help optimize and improve stability. After a few experiments, I found that I was wrong. In the following, I will show the results of my copy of the paper and learn something.

More specifically, I use Tensorflow 2 Keras API experiment successfully reproduced the main thesis, reached similar conclusions. That is, ResNets can be standardized only by training the batch layer is gamma (γ) and beta (β) parameter set obtained good results in CIFAR-10 data. From a digital perspective, I use ResNet-50,101 and 152 architecture gained 45%, 52% and 50% of the Top-1 precision, it is far from perfect, but it is not valid.

In the following, I outlined the batch standardization concepts and their common interpretation. Then, I share the code I used and the results obtained therefrom. Finally, I comment on the experimental results, and analyzed.

Batch standardization Batch Normalization

Briefly, the batch layer estimated average normalized ([mu]) and variance (σ²) its input, and produces a standardized output, i.e., the average output of zero and unit variance. In the experiment, this technique can significantly improve the convergence and stability of the depth of the network. In addition, it uses two parameters (gamma] and beta]) and zoom to adjust its output.

as the input x, z as output, z is given by the following equation:

Figure 1: Batch standardized expression

The input data to estimate the parameters μ and σ², and β and γ are trainable. Thus, back propagation algorithm can use to optimize the network.

In summary, it has been found that the performance speed can be significantly improved network training and which retain this data. Moreover, it is not incompatible with other network layer place. Therefore, most models are often used between all Conv-ReLU frequent operation it is formed "Conv-BN-ReLU" Trio (and variants thereof). However, although this is one of the most frequently occurring layer, but the reasons behind its advantages there are a lot of controversy in the literature. The following three main argument:

Internal variance translation : Simply put, if zero mean and unit variance output, the next layer will enter the training stable. In other words, it is possible to prevent too much change in output. This is the first explanation, but later work found conflicting evidence, denied this assumption. In short, if the training networks VGG (1) without the use of BN, (2) using the BN, and (3) using the BN, artificially covariance translation. Despite the covariance artificial translation process (2) and (3) better still (1).

Output smoothing : BN is considered to be smoothed optimum range, reduction of the amount of change in loss function and limit the gradient. Smoother target prediction would be better in training, and not easy problems.

The longitudinal direction of decoupling : Some authors believe that BN is to improve the formula for the optimization problem, it can be extended to more traditional optimization settings. More specifically, BN frame allows independent optimization of parameters of length and direction, thereby improving the convergence.

In short, all three explanations are concentrated in the standardization of batch standardization. Below, we will look at the BN pan and zoom points achieved by γ and β parameters.

Copy paper

If the idea is good, it should be resilient to implementation choices and super parameters. In my code, I use Tensorflow 2 and hyperparameter I choose to be as short as possible to re reproduce the main thesis of the experiment. In more detail, I tested the following proposition:

ResNet model, in addition to the case of batches of standardized layer parameters for all other weights have been locked, the model can still be a good result on CIFAR-10 at the training dataset.

I will use the Keras CIFAR-10 and ResNet CIFAR-10 module and data sets, and using the cross entropy loss and Softmax activation. I downloaded the code ResNet model data set and random initialization, freezing unneeded layers, and use batchsize image size 1024 trained 50 epoch. You can view the following code:

# Reproducing the main findings of the paper "Training BatchNorm and Only BatchNorm: On the Expressive Power of Random Features in CNNs" # Goal: Train a ResNet model to solve the CIFAR-10 dataset using only batchnorm layers, all else is frozen at their random initial state.import tensorflow as tf
import numpy as np
import pandas as pd

architectures = [
    ('ResNet-50', tf.keras.applications.resnet.ResNet50),
    ('ResNet-101', tf.keras.applications.resnet.ResNet101),
    ('ResNet-152', tf.keras.applications.resnet.ResNet152)]

(X_train, y_train), (X_test, y_test) = tf.keras.datasets.cifar10.load_data()
n_train_images = X_train.shape[0]
n_test_images = X_test.shape[0]
n_classes = np.max(y_train) + 1

X_train = X_train.astype(np.float32) / 255
X_test = X_test.astype(np.float32) / 255
y_train = tf.keras.utils.to_categorical(y_train, n_classes)
y_test = tf.keras.utils.to_categorical(y_test, n_classes)

for name, architecture in architectures:
    input = tf.keras.layers.Input((32, 32, 3))
    resnet = architecture(include_top=False, weights='imagenet', input_shape=(32, 32, 3), pooling='avg')(input)
    output = tf.keras.layers.Dense(n_classes, activation='softmax')(resnet)
    model = tf.keras.models.Model(inputs=input, outputs=output)

    optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)
    loss_fn = tf.keras.losses.CategoricalCrossentropy(from_logits=True)
    he_normal = tf.keras.initializers.he_normal()
    for layer in model.layers[1].layers:
        if layer.name.endswith('_bn'):
            new_weights = [
                he_normal(layer.weights[0].shape), # Gamma
                tf.zeros(layer.weights[1].shape), # Beta
                tf.zeros(layer.weights[2].shape), # Mean
                tf.ones(layer.weights[3].shape)] # Std

            layer.set_weights(new_weights)
            layer.trainable = Trueelse:
            layer.trainable = False

    model.summary()

    model.compile(loss=loss_fn, optimizer=optimizer, metrics=['accuracy'])
    print('Training ' + name + '...')
    history = model.fit(X_train, y_train, batch_size=1024, epochs=1, validation_data=(X_test, y_test), shuffle=True)
    history_df = pd.DataFrame(history.history) 
    print('Dumping model and history...')
    history_df.to_csv(name + '.csv', sep=';')
    model.save(name + '.h5')

print('Testing Complete!')

The above code should note the following:

  1. Keras API has only ResNet-50,101 and 152 models. For simplicity, I only use these models.
  2. ResNet model uses a "single" strategy γ initialization parameters. In our limited training process, which can not be too symmetrical trained by gradient descent. But according to the paper's suggestion to use "he_normal" initialization. To this end, we re-initialized before the training manual "batch standardization" of weight.
  3. The authors used a training batchsize 128 160 epoch, and use the momentum to SGD Optimizer 0.9. The initial learning rate is set to 0.01, and 80 and 120 in the first stage and 0.001 to 0.0001 disposed. This is an initial idea, I found it too specific. Instead, I used the 50 epoch, batchsize size of 1024, the optimizer is vanilla Adam, a fixed rate of 0.01 learning. If this assumption is useful, these changes will not be a problem.
  4. The authors also used data enhancement, but I did not use. Again, if the idea is useful, then these changes should not be a major problem.

result

This is the result I obtained by the above code:

Only approved training model training set accuracy ResNet standardized layer

Only training batch to verify the accuracy of ResNet set of standardized model layers

Numerically, three models reached 50%, 60% and 62% of the training and the accuracy of 45%, 52% and 50% of the authentication accuracy.

In order to have a good understanding of the performance of the model, we should always consider the performance of random guessing. CIFAR-10 ten dataset class. Therefore, randomly, we have 10 percent chance of correctly. The above method is better than random guessing about five times. Therefore, we can say that the model has a good performance.

Interestingly, it took 10 to verify the accuracy of the epoch began to increase, which clearly shows that for the first ten epoch, just fit the data network as much as possible. Later, the accuracy is greatly improved. However, it is every five epoch vary widely, indicating that the model is not very stable.

In the paper, FIG. 2 shows they reached ~ 70, ~ 75, and verify the accuracy of ~77%. Considering the author has made some adjustments, using the self-training methods and uses defined data enhancement, which seems very reasonable, and consistent with my findings, thus confirming the hypothesis.

866 layers using ResNet, the accuracy of the author's almost reached about 85%, about 91% only a few percentage points less than the entire training architecture that can be achieved. In addition, they tested different initialization scheme, architecture, thawed and tested and skip the last layer fully connected, which brings some extra performance.

In addition to accuracy, the authors studied the histogram parameters β and γ, it was found that the network learned value close to zero to suppress the third layer in each BN all activated by setting γ.

discuss

At this point you might ask: Why do these? First of all, it's funny :) Secondly, BN layer is very common, but we still have only a superficial effect on its understanding. We only know his benefits. Third, this survey allows us to run way model of a deeper understanding.

I think this in itself is not practical. No one will freeze all network layers leaving only the BN layer. However, this may stimulate different training schedule. Maybe something like this to train the network in a few times, and then all the weight training may lead to higher performance. And this technique may be useful for fine-tuning model pre-trained. I can also see that this idea is used for trimming large networks.

This study makes me most puzzled as to how much we ignore these two parameters. I can remember only one discussion about it, I think that discussion with "zero" in the initialization γ ResNet block well, to force back-propagation algorithm in the early period of more skip connection.

My second question is about SELU and SERLU activation function, they have since normalized attribute. These two functions are in the "batch standardization" will naturally be standardized layer through its output. Now, I want to ask whether they have gained all the characteristics of a standardized batch layer.

Finally, this hypothesis is still a little raw. It considers only the data sets CIFAR-10 and very deep network. If it can be extended to other data sets or address different tasks (e.g., using only the Batchnorm the GAN), it will increase its practicality. Similarly, the subsequent effect of γ and β of the article in a fully trained networks are more interested.

作者: Ygor Rebolledo Serpa

deephub Translation Section

Published 27 original articles · won praise 83 · views 60000 +

Guess you like

Origin blog.csdn.net/m0_46510245/article/details/105290604