Batch Normalization：Ioffe and Svegedy的批标准化实现 (BN处理对CNN的影响)

地址：https://github.com/harrisonjansma/Research-Computer-Vision/blob/master/07-28-18-Implementing-Batch-Norm/BatchNorm.ipynb

资源：

Ioffe 和 Szegedy的原始论文. here.
在非线性之前或之后插入批量归一化？ Usage explanation
有关TensorFlow中的数学和实现的说明. Pitfalls of Batch Norm
也是这个帖子 How to use Batch Normalization with TensorFlow and tf.keras

延伸阅读：

以下是一些最新的研究论文，这些论文扩展了Ioffe和Svegedy的工作。

[1] How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift)

[2] Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models

[3] Layer Normalization

[4] Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks

[5] Group Normalization

在本文中，我将介绍Ioffe和Svegedy的批处理规范化的实现。

一、批量标准化的直观解释

1、问题

问题1.随着网络的训练，每个神经元的输入变化很大。结果，每个神经元必须重新调整自己以适应每个新批次的变化分布。这是一种低效率，只会减慢模型训练的速度。我们如何规范单位输入？

可变输入分布的另一个影响是梯度消失。梯度消失问题很重要，尤其是对于Sigmoid激活函数。如果 g（x）表示Sigmoid激活函数，随着| x | 增加， ${g}'$ (x)趋于零。如图1.

图1. Sigmoid函数及其导数

问题2：当输入分布变化时，神经元输出也变化。这导致神经元输出波动到Sigmoid函数的可饱和区域。一旦到达那里，神经元既无法更新其自身的权重，也无法将梯度传递回先前的层。我们如何防止神经元输出变化到饱和区域？如图2.

图2. Sigmoid的最佳位置

2、批量标准化作为解决方案

批量标准化可减轻输入分布变化的影响。通过标准化LINK神经元的输出，我们抑制了向饱和区域的变化。这解决了我们的第二个问题。

批量标准化将神经元输出转换为单位高斯分布。当通过Sigmoid函数进行喂入时，层（layer）输出也将变得更加标准。由于层输出也是层输入，因此神经元输入现在将具有较小的变化。这解决了我们的第一个问题。

3、数学解释

通过批量标准化，我们为激活函数寻求输入的零中心，单位方差分布。在训练期间，我们将激活输入 $x_{i}$ 减去批均值 $\mu _{B}$ 来获得零中心分布。

接下来，我们将 x 除以批处理方差和一个小数，以防止除以零 $\sigma _ {B} + \epsilon$ 。这样可以确保所有激活输入分布均具有单位方差。

最后，我们对 $\hat{x}$ 进行了线性变换，以缩放和移动批量归一化 $y_i$ 的输出。尽管反向传播期间网络发生了变化，也可以保持这种标准化效果。

在测试期间，我们不使用批量均值或方差，因为这会破坏模型。（提示：单个观测值的均值和方差是多少？）相反，我们计算了训练总体的移动平均值和方差估计值。这些估计值是训练期间计算出的所有批量平均值和方差平均值。

4、批量标准化的好处

批处理标准化的好处如下。

a.有助于防止具有饱和非线性（Sigmoid，tanh等）的网络中的梯度消失

通过批量归一化，我们确保任何激活函数的输入都不会在饱和区域内变化。批处理标准化将这些输入的分布转换为单位高斯（零中心和单位方差）。 Tjis限制了梯度消失的影响。

b. 正则化模型

也许。 Ioffe和Svegeddy提出了这一主张，但在此问题上并没有广泛适用。也许这是标准化层输入的结果吗？

c.允许更高的学习率

通过防止训练过程中梯度消失的问题，我们可以负担得起更高的学习率。批量标准化还减少了对参数规模的依赖。较大的学习率会增加层参数的规模，从而导致梯度在反向传播过程中被传递回时放大。我需要阅读更多有关此的内容。

二、Keras实现

1、导入库

import tensorflow as tf 
import numpy as np
import os

import keras
from keras.preprocessing.image import ImageDataGenerator, array_to_img, img_to_array, load_img
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

from keras.models import Model, Sequential
from keras.layers import Input

from keras.callbacks import ModelCheckpoint, EarlyStopping
from keras.layers import BatchNormalization
from keras.layers import GlobalAveragePooling2D
from keras.layers import Activation
from keras.layers import Conv2D, MaxPooling2D, Dense
from keras.layers import MaxPooling2D, Dropout, Flatten

import time

2、数据加载和预处理

在本文中，我们使用Cifar 100小数据集，因为它具有一定的挑战性，并且不会永远花时间训练。唯一执行的预处理是零中心化和图像变化生成器。

from keras.datasets import cifar100
from keras.utils import np_utils

(x_train, y_train), (x_test, y_test) = cifar100.load_data(label_mode='fine')

#scale and regularize the dataset
x_train = (x_train-np.mean(x_train))
x_test = (x_test - x_test.mean())

x_train = x_train.astype('float32')
x_test = x_test.astype('float32')

#onehot encode the target classes
y_train = np_utils.to_categorical(y_train)
y_test = np_utils.to_categorical(y_test)


train_datagen = ImageDataGenerator(
        shear_range=0.2,
        zoom_range=0.2,
        horizontal_flip=True)

train_datagen.fit(x_train)

train_generator = train_datagen.flow(x_train,
                                     y = y_train,
                                    batch_size=80,)

3、在Keras中构建模型

卷积块由2个堆叠的3x3卷积组成，后跟最大池化和dropout。每个网络中有5个卷积块。最后一层是具有100个节点和softmax激活的全连接层。

我们将构建4个不同的网络，每个网络都具有Sigmoid或ReLU激活，并且具有批量标准化或不具有批量标准化的能力。我们将与其同类比较每个网络收敛和验证损失的时间。

def conv_block_first(model, bn=True, activation="sigmoid"):
    """
    The first convolutional block in each architecture. Only seperate so we can 
    specify the input shape.
    """
    model.add(Conv2D(60,3, padding = "same", input_shape = x_train.shape[1:]))
    if bn:
        model.add(BatchNormalization())
    model.add(Activation(activation))
    
    model.add(Conv2D(60,3, padding = "same"))
    if bn:
        model.add(BatchNormalization())
    model.add(Activation(activation))
    
    model.add(MaxPooling2D())
    model.add(Dropout(0.15))
    return model

def conv_block(model, bn=True, activation = "sigmoid"):
    """
    Generic convolutional block with 2 stacked 3x3 convolutions, max pooling, dropout, 
    and an optional Batch Normalization.
    """
    model.add(Conv2D(60,3, padding = "same"))
    if bn:
        model.add(BatchNormalization())
    model.add(Activation(activation))
    
    model.add(Conv2D(60,3, padding = "same"))
    if bn:
        model.add(BatchNormalization())
    model.add(Activation(activation))
    
    model.add(MaxPooling2D())
    model.add(Dropout(0.15))
    return model

def conv_block_final(model, bn=True, activation = "sigmoid"):
    """
    I bumped up the number of filters in the final block. I made this seperate so that
    I might be able to integrate Global Average Pooling later on. 
    """
    model.add(Conv2D(100,3, padding = "same"))
    if bn:
        model.add(BatchNormalization())
    model.add(Activation(activation))
    
    model.add(Conv2D(100,3, padding = "same"))
    if bn:
        model.add(BatchNormalization())
    model.add(Activation(activation))
    
    model.add(Flatten())
    return model

def fn_block(model):
    """
    I'm not going for a very deep fully connected block, mainly so I can save on memory.
    """
    model.add(Dense(100, activation = "softmax"))
    return model

def build_model(blocks=3, bn=True, activation = "sigmoid"):
    """
    Builds a sequential network based on the specified parameters.
    
    blocks: number of convolutional blocks in the network, must be greater than 2.
    bn: whether to include batch normalization or not.
    activation: activation function to use throughout the network.
    """
    model = Sequential()
    
    model = conv_block_first(model, bn=bn, activation=activation)
    
    for block in range(1,blocks-1):
        model = conv_block(model, bn=bn, activation = activation)
        
    model = conv_block_final(model, bn=bn, activation=activation)
    model = fn_block(model)
    
    return model

def compile_model(model, optimizer = "rmsprop", loss = "categorical_crossentropy", metrics = ["accuracy"]): 
    """
    Compiles a neural network.
    
    model: the network to be compiled.
    optimizer: the optimizer to use.
    loss: the loss to use.
    metrics: a list of keras metrics.
    """
    model.compile(optimizer = optimizer,
                 loss = loss,
                 metrics = metrics)
    return model

sigmoid_without_bn = build_model(blocks = 5, bn=False, activation = "sigmoid")
sigmoid_without_bn = compile_model(sigmoid_without_bn)

sigmoid_with_bn = build_model(blocks = 5, bn=True, activation = "sigmoid")
sigmoid_with_bn = compile_model(sigmoid_with_bn)


relu_without_bn = build_model(blocks = 5, bn=False, activation = "relu")
relu_without_bn = compile_model(relu_without_bn)

relu_with_bn = build_model(blocks = 5, bn=True, activation = "relu")
relu_with_bn = compile_model(relu_with_bn)

4、模型训练

a、无批量标准化的Sigmoid

start = time.time()
model_checkpoint = ModelCheckpoint('models/sigmoid_without_bn.h5',
                                   save_best_only = True)

history1 = sigmoid_without_bn.fit_generator(
        train_generator,
        steps_per_epoch=2000,
        epochs=20,
        verbose=0,
        validation_data=(x_test, y_test),
        callbacks = [model_checkpoint])

end = time.time()

print("Training time: ", (end - start)/60, " minutes")

plt.figure(figsize=(12,6))
# summarize history for accuracy
plt.subplot(121)
plt.plot(history1.history['acc'])
plt.plot(history1.history['val_acc'])
plt.title('Sig w/o BN Accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper left')
# summarize history for loss
plt.subplot(122)
plt.plot(history1.history['loss'])
plt.plot(history1.history['val_loss'])
plt.title('Sig w/o BN Loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper right')
plt.show()

b、带批量标准化的Sigmoid

start = time.time()

model_checkpoint = ModelCheckpoint('models/sigmoid_with_bn.h5',
                                   save_best_only = True)

history2 = sigmoid_with_bn.fit_generator(
        train_generator,
        steps_per_epoch=2000,
        verbose=0,
        epochs=20,
        validation_data=(x_test, y_test),
        callbacks = [model_checkpoint])

end = time.time()

print("Training time: ", (end - start)/60, " minutes")

plt.figure(figsize=(12,6))
# summarize history for accuracy
plt.subplot(121)
plt.plot(history2.history['acc'])
plt.plot(history2.history['val_acc'])
plt.title('Sig w/ BN Accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper left')
# summarize history for loss
plt.subplot(122)
plt.plot(history2.history['loss'])
plt.plot(history2.history['val_loss'])
plt.title('Sig w/ BN Loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper right')
plt.show()

c、无批量标准化的ReLU

start = time.time()

model_checkpoint = ModelCheckpoint('models/ReLU_without_BN.h5',
                                   save_best_only = True)

history3 = relu_without_bn.fit_generator(
        train_generator,
        steps_per_epoch=2000,
        epochs=20,
        verbose=0,
        validation_data=(x_test, y_test),
        callbacks = [model_checkpoint])

end = time.time()

print("Training time: ", (end - start)/60, " minutes")

plt.figure(figsize=(12,6))
# summarize history for accuracy
plt.subplot(121)
plt.plot(history3.history['acc'])
plt.plot(history3.history['val_acc'])
plt.title('ReLU w/o BN Accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper left')
# summarize history for loss
plt.subplot(122)
plt.plot(history3.history['loss'])
plt.plot(history3.history['val_loss'])
plt.title('ReLU w/o BN Loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper right')
plt.show()

d、带批量标准化的ReLU

start = time.time()

model_checkpoint = ModelCheckpoint('models/ReLU_with_bn.h5',
                                   save_best_only = True)

history4 = relu_with_bn.fit_generator(
        train_generator,
        steps_per_epoch=2000,
        verbose=0,
        epochs=20,
        validation_data=(x_test, y_test),
        callbacks = [model_checkpoint])

end = time.time()

print("Training time: ", (end - start)/60, " minutes")

plt.figure(figsize=(12,6))
# summarize history for accuracy
plt.subplot(121)
plt.plot(history4.history['acc'])
plt.plot(history4.history['val_acc'])
plt.title('ReLU w/ BN Accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper left')
# summarize history for loss
plt.subplot(122)
plt.plot(history4.history['loss'])
plt.plot(history4.history['val_loss'])
plt.title('ReLU w/ BN Loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper right')
plt.show()

5、对比架构

我们在这里清楚地看到了批量标准化的好处。没有批量标准化的ReLU和Sigmoid模型都无法维持性能提升。这可能是梯度消失的结果。具有批量标准化的架构比没有批量标准化的架构训练得更快，并且性能更好。

plt.figure(figsize=(12,6))
# summarize history for accuracy
plt.subplot(121)
plt.plot(history1.history['val_acc'])
plt.plot(history2.history['val_acc'])
plt.plot(history3.history['val_acc'])
plt.plot(history4.history['val_acc'])
plt.title('All Models Accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['Sig w/o BN', 'Sig w/ BN', 'ReLU w/o BN', 'ReLU w/ BN'], loc='upper left')
# summarize history for loss
plt.subplot(122)
plt.plot(history1.history['val_loss'])
plt.plot(history2.history['val_loss'])
plt.plot(history3.history['val_loss'])
plt.plot(history4.history['val_loss'])
plt.title('All Models Loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['Sig w/o BN', 'Sig w/ BN', 'ReLU w/o BN', 'ReLU w/ BN'], loc='upper right')
plt.show()

6、结论

批量标准化减少了训练时间并提高了神经网络的稳定性。此效果适用于Sigmoid和ReLU激活函数。在我的下一个笔记本中，我将总结全局平均池对卷积神经网络训练的影响。