[CS231n Assignment 2 #01 ] 批量归一化[BatchNormalization]

文章目录

作业介绍
1. 批量归一化（Batch Normalization）

1.1 BN层的前向传播(forward)
1.2 BN层的反向传播
1.3 Fully Connected Nets with Batch Normalization

2. Batchnorm for deep networks
3. Batch normalization and initialization
4. Batch normalization and batch size
5. 层级归一化（Layer Normalization）

5.1 LN forward
5.2 LN backward
5.3 Fully Connected Nets with Layer Normalization

推荐阅读

作业介绍

作业主页：Assignment 2
作业目的：为了使深度神经网络更好的得到训练，一个方案是使用更复杂的优化方法：SGD+Momentum，Adam，RMSProp等。另一个方案就是改变网络结构，比如我们这节要完成的批量归一化。
官方示例代码： Assignment 2 code
作业源文件 BatchNormlization.ipynb

1. 批量归一化（Batch Normalization）

当输入数据是 不相关（uncorrelated）、零均值（zero mean） 以及 单元方差(unit variance) 的时候，我们的机器学习方法往往表现得很好。但是，当我们训练深度神经网络的时候，即便我们预处理数据使得输入数据服从这样的分布，不断的网络层的处理也会使得原始分布发生改变。更严重得使，随着权重得不断更新，每一层得输入特征的分布也会不断地发生漂移。
所以，推荐阅读1中的作者假设，输入特征分布的漂移会使得深度神经网络的训练变得困难，从而提出插入一个 批量归一化 层来处理这个问题。
在训练阶段，我们用一个小批量的数据来估计 每一个特征维度的均值和方差 ，并用它来处理我们输入的小批量数据，使得它们零均值和去相关化。同时，我们会维护一个训练集上得平均均值和方差，用来在测试集上处理数据。
但是，这样得BN层或许会因为改变的输入特的分布而影响网络的表达能力，即对于某些网络层，非零均值和单元方差的数据分布可能会更好。所以，对于每一个BN层，我们会学习一个 漂移因子（Shift）和尺度变化因子(scale) 来适当的恢复每一个特征维度的分布，使得其不是严格服从我们得标准分布，这样增加网络的丰富性。

1.1 BN层的前向传播(forward)

In the file cs231n/layers.py, implement the batch normalization forward pass in the function batchnorm_forward. Once you have done so, run the following to test your implementation.

def batchnorm_forward(x, gamma, beta, bn_param):
    """
    Forward pass for batch normalization.

    During training the sample mean and (uncorrected) sample variance are
    computed from minibatch statistics and used to normalize the incoming data.
    During training we also keep an exponentially decaying running mean of the
    mean and variance of each feature, and these averages are used to normalize
    data at test-time.

    At each timestep we update the running averages for mean and variance using
    an exponential decay based on the momentum parameter:

    running_mean = momentum * running_mean + (1 - momentum) * sample_mean
    running_var = momentum * running_var + (1 - momentum) * sample_var

    Note that the batch normalization paper suggests a different test-time
    behavior: they compute sample mean and variance for each feature using a
    large number of training images rather than using a running average. For
    this implementation we have chosen to use running averages instead since
    they do not require an additional estimation step; the torch7
    implementation of batch normalization also uses running averages.

    Input:
    - x: Data of shape (N, D)
    - gamma: Scale parameter of shape (D,)
    - beta: Shift paremeter of shape (D,)
    - bn_param: Dictionary with the following keys:
      - mode: 'train' or 'test'; required
      - eps: Constant for numeric stability
      - momentum: Constant for running mean / variance.
      - running_mean: Array of shape (D,) giving running mean of features
      - running_var Array of shape (D,) giving running variance of features

    Returns a tuple of:
    - out: of shape (N, D)
    - cache: A tuple of values needed in the backward pass
    """
    mode = bn_param['mode']
    eps = bn_param.get('eps', 1e-5)
    momentum = bn_param.get('momentum', 0.9)

    N, D = x.shape
    running_mean = bn_param.get('running_mean', np.zeros(D, dtype=x.dtype))
    running_var = bn_param.get('running_var', np.zeros(D, dtype=x.dtype))

    out, cache = None, None
    if mode == 'train':
        batch_mean = np.mean(x,axis = 0)
        batch_var = np.var(x, axis = 0)
        # 存储训练时候的均值和方差
        running_mean = momentum * running_mean + (1 - momentum) * batch_mean
        running_var = momentum * running_var + (1 - momentum) * batch_var
        x_std = (x - batch_mean ) / (np.sqrt(batch_var) + eps)
        out = gamma * x_std + beta
        cache = [gamma, x_std, beta, 1 / (np.sqrt(batch_var) + eps)]
    elif mode == 'test':
        #######################################################################
        # TODO: Implement the test-time forward pass for batch normalization. #
        # Use the running mean and variance to normalize the incoming data,   #
        # then scale and shift the normalized data using gamma and beta.      #
        # Store the result in the out variable.                               #
        #######################################################################
        x_std = (x - bn_param['running_mean']) / (np.sqrt(bn_param['running_var']) + eps)
        out = gamma * x_std + beta
    else:
        raise ValueError('Invalid forward batchnorm mode "%s"' % mode)

    # Store the updated running means back into bn_param
    bn_param['running_mean'] = running_mean
    bn_param['running_var'] = running_var

    return out, cache

1.2 BN层的反向传播

Now implement the backward pass for batch normalization in the function batchnorm_backward.
To derive the backward pass you should write out the computation graph for batch normalization and backprop through each of the intermediate nodes. Some intermediates may have multiple outgoing branches; make sure to sum gradients across these branches in the backward pass.

 batchnorm_backward(dout, cache):
    """
    Backward pass for batch normalization.

    For this implementation, you should write out a computation graph for
    batch normalization on paper and propagate gradients backward through
    intermediate nodes.

    Inputs:
    - dout: Upstream derivatives, of shape (N, D)
    - cache: Variable of intermediates from batchnorm_forward.

    Returns a tuple of:
    - dx: Gradient with respect to inputs x, of shape (N, D)
    - dgamma: Gradient with respect to scale parameter gamma, of shape (D,)
    - dbeta: Gradient with respect to shift parameter beta, of shape (D,)
    """
    gamma, x_std, beta,x,batch_mean, batch_var, eps = cache
    N = x.shape[0]
    # out = gamma * x_std + beta
    dbeta = np.sum(dout, axis = 0)
    dgamma = np.sum(dout * x_std, axis = 0)
    dx_std = dout * gamma

    # x_std = (x - mean) / 标准差
    # 此时注意x有多个输出，包括直接输出，方差输出和均值输出
    # 所以计算图中有多条边流向x
    a = np.sqrt(batch_var + eps)
    # 先计算方差
    dvar = np.sum( - 0.5 * (x - batch_mean) * dx_std / a ** 3 , axis = 0)
    dmean = np.sum( - dx_std / a, axis=0) + dvar * np.sum(-2 * (x - batch_mean), axis=0) / N
    dx = dx_std / a + dmean / N + 2 * dvar * (x - batch_mean) / N
    return dx, dgamma, dbeta

我们反向求导的时候，可以画出计算图逐结点求导；也可以直接推导输出对输入的倒数，然后一步到位，可能要快一点：

def batchnorm_backward_alt(dout, cache):
    """
    Alternative backward pass for batch normalization.

    For this implementation you should work out the derivatives for the batch
    normalizaton backward pass on paper and simplify as much as possible. You
    should be able to derive a simple expression for the backward pass. 
    See the jupyter notebook for more hints.
     
    Note: This implementation should expect to receive the same cache variable
    as batchnorm_backward, but might not use all of the values in the cache.

    Inputs / outputs: Same as batchnorm_backward
    """
    gamma, x_std, beta, x, batch_mean, batch_var, eps = cache
  
    N = x.shape[0]
    # 先计算变化因子，好计算一点
    dgamma = np.sum(dout * x_std, axis = 0)
    dbeta = np.sum(dout, axis = 0)
    # 再计算对x的梯度
    a = 1 / np.sqrt(batch_var + eps)
    dx_hat = dout * gamma

    dvar = np.sum(dx_hat * (x - batch_mean) * (-0.5) * (a ** 3), axis = 0)
    dmean = np.sum(- dx_hat * a, axis = 0) #+ dvar * (-2 / N) * np.sum(x - batch_mean, axis = 0) #后面这项为0
    dx = dx_hat * a + dvar * 2 * (x - batch_mean) / N + dmean / N
    return dx, dgamma, dbeta

1.3 Fully Connected Nets with Batch Normalization

Now that you have a working implementation for batch normalization, go back to your FullyConnectedNet in the file cs231n/classifiers/fc_net.py. Modify your implementation to add batch normalization.
即在我们之前实现的FC神经网络上添加BN层
You might find it useful to define an additional helper layer similar to those in the file cs231n/layer_utils.py.

第一步，先在layer_utils.py中新定义我们的affine->bn->relu网络层

def affine_bn_relu_forward(x,w,b,gamma,beta,bn_params):
    """
       Convenience layer that perorms an affine WITH BACTHNORMALIZATION transform followed by a ReLU
       Inputs:
       - x: Input to the affine layer
       - w, b: Weights for the affine layer
       Returns a tuple of:
       - out: Output from the ReLU
       - cache: Object to give to the backward pass
       - gamma: Scale parameter of shape (D,)
       - beta: Shift paremeter of shape (D,)
       - bn_param: Dictionary with the following keys:
         - mode: 'train' or 'test'; required
         - eps: Constant for numeric stability
         - momentum: Constant for running mean / variance.
         - running_mean: Array of shape (D,) giving running mean of features
         - running_var Array of shape (D,) giving running variance of features

       Returns a tuple of:
       - out: Output from the ReLU
       - cache: Object to give to the backward pass
       """
    a_fc, fc_cache = affine_forward(x,w,b)
    a_bn,bn_cache = batchnorm_forward(a_fc,gamma,beta,bn_params)
    out,relu_cache = relu_forward(a_bn)
    cache = (fc_cache, bn_cache, relu_cache)
    return out, cache
def affine_bn_relu_backward(dout, cache):
    """
    Backward pass for the affine-bn-relu convenience layer
    """
    fc_cache, bn_cache, relu_cache = cache
    da = relu_backward(dout, relu_cache)
    da_bn, dgamma,dbeta = batchnorm_backward(da,bn_cache)
    dx,dw,db = affine_backward(da_bn,fc_cache)
    return dx,dw,db,dgamma,dbeta

然后完成我们之前的fc_net.py

from builtins import range
from builtins import object
import numpy as np

from cs231n.layers import *
from cs231n.layer_utils import *


class TwoLayerNet(object):
    """
    A two-layer fully-connected neural network with ReLU nonlinearity and
    softmax loss that uses a modular layer design. We assume an input dimension
    of D, a hidden dimension of H, and perform classification over C classes.

    The architecure should be affine - relu - affine - softmax.

    Note that this class does not implement gradient descent; instead, it
    will interact with a separate Solver object that is responsible for running
    optimization.

    The learnable parameters of the model are stored in the dictionary
    self.params that maps parameter names to numpy arrays.
    """

    def __init__(self, input_dim=3*32*32, hidden_dim=100, num_classes=10,
                 weight_scale=1e-3, reg=0.0):
        """
        Initialize a new network.

        Inputs:
        - input_dim: An integer giving the size of the input
        - hidden_dim: An integer giving the size of the hidden layer
        - num_classes: An integer giving the number of classes to classify
        - weight_scale: Scalar giving the standard deviation for random
          initialization of the weights.
        - reg: Scalar giving L2 regularization strength.
        """
        self.params = {}
        self.reg = reg

        # TODO: Initialize the weights and biases of the two-layer net. Weights    #
        # should be initialized from a Gaussian centered at 0.0 with               #
        # standard deviation equal to weight_scale, and biases should be           #
        # initialized to zero.
        self.params["W1"] = np.random.randn(input_dim,hidden_dim) * weight_scale
        self.params["b1"] = np.zeros(hidden_dim)
        self.params["W2"] = np.random.randn(hidden_dim,num_classes) * weight_scale
        self.params["b2"] = np.zeros(num_classes)
    def loss(self, X, y=None):
        """
        Compute loss and gradient for a minibatch of data.

        Inputs:
        - X: Array of input data of shape (N, d_1, ..., d_k)
        - y: Array of labels, of shape (N,). y[i] gives the label for X[i].

        Returns:
        If y is None, then run a test-time forward pass of the model and return:
        - scores: Array of shape (N, C) giving classification scores, where
          scores[i, c] is the classification score for X[i] and class c.

        If y is not None, then run a training-time forward and backward pass and
        return a tuple of:
        - loss: Scalar value giving the loss
        - grads: Dictionary with the same keys as self.params, mapping parameter
          names to gradients of the loss with respect to those parameters.
        """
        ############################################################################
        # TODO: Implement the forward pass for the two-layer net, computing the#
        # class scores for X and storing them in the scores variable.              #
        ############################################################################
        H , cache_layer1 = affine_relu_forward(X,self.params["W1"],self.params["b1"])
        scores , cache_layer2 = affine_forward(H, self.params["W2"], self.params["b2"])
        # If y is None then we are in test mode so just return scores
        if y is None:
            return scores

        loss, grads = 0, {}
        ############################################################################
        # TODO: Implement the backward pass for the two-layer net. Store the loss  #
        # in the loss variable and gradients in the grads dictionary. Compute data #
        # loss using softmax, and make sure that grads[k] holds the gradients for  #
        # self.params[k]. Don't forget to add L2 regularization!                   #
        #                                                                          #
        # NOTE: To ensure that your implementation matches ours and you pass the   #
        # automated tests, make sure that your L2 regularization includes a factor #
        # of 0.5 to simplify the expression for the gradient.                      #
        ############################################################################
        loss, dS = softmax_loss(scores,y)
        loss += 0.5 * self.reg * np.sum(self.params["W1"] * self.params["W1"])
        loss += 0.5 * self.reg * np.sum(self.params["W2"] * self.params["W2"]) # 添加正则项

        dH , dW2 , grads["b2"] = affine_backward(dS,cache_layer2)
        dx, dW1,   grads["b1"] = affine_relu_backward(dH , cache_layer1)

        grads["W1"] = dW1 + self.reg * self.params["W1"]
        grads["W2"] = dW2 + self.reg * self.params["W2"] # 正则项损失
        return loss, grads


class FullyConnectedNet(object):
    """
    A fully-connected neural network with an arbitrary number of hidden layers,
    ReLU nonlinearities, and a softmax loss function. This will also implement
    dropout and batch/layer normalization as options. For a network with L layers,
    the architecture will be

    {affine - [batch/layer norm] - relu - [dropout]} x (L - 1) - affine - softmax

    where batch/layer normalization and dropout are optional, and the {...} block is
    repeated L - 1 times.

    Similar to the TwoLayerNet above, learnable parameters are stored in the
    self.params dictionary and will be learned using the Solver class.
    """

    def __init__(self, hidden_dims, input_dim=3*32*32, num_classes=10,
                 dropout=1, normalization=None, reg=0.0,
                 weight_scale=1e-2, dtype=np.float32, seed=None):
        """
        Initialize a new FullyConnectedNet.

        Inputs:
        - hidden_dims: A list of integers giving the size of each hidden layer.
        - input_dim: An integer giving the size of the input.
        - num_classes: An integer giving the number of classes to classify.
        - dropout: Scalar between 0 and 1 giving dropout strength. If dropout=1 then
          the network should not use dropout at all.
        - normalization: What type of normalization the network should use. Valid values
          are "batchnorm", "layernorm", or None for no normalization (the default).
        - reg: Scalar giving L2 regularization strength.
        - weight_scale: Scalar giving the standard deviation for random
          initialization of the weights.
        - dtype: A numpy datatype object; all computations will be performed using
          this datatype. float32 is faster but less accurate, so you should use
          float64 for numeric gradient checking.
        - seed: If not None, then pass this random seed to the dropout layers. This
          will make the dropout layers deteriminstic so we can gradient check the
          model.
        """
        self.normalization = normalization
        self.use_dropout = dropout != 1
        self.reg = reg
        self.num_layers = 1 + len(hidden_dims)
        self.dtype = dtype
        self.params = {}

        ############################################################################
        # TODO: Initialize the parameters of the network, storing all values in    #
        # the self.params dictionary. Store weights and biases for the first layer #
        # in W1 and b1; for the second layer use W2 and b2, etc.                   #
        # When using batch normalization, store scale and shift parameters for the #
        # first layer in gamma1 and beta1; for the second layer use gamma2 and     #
        # beta2, etc. Scale parameters should be initialized to ones and shift     #
        # parameters should be initialized to zeros.                               #
        ############################################################################
        input_size = input_dim
        for i in range(len(hidden_dims)):
            output_size = hidden_dims[i]
            self.params['W' + str(i+1)] = np.random.randn(input_size,output_size) * weight_scale
            self.params['b' + str(i+1)] = np.zeros(output_size)
            if self.normalization == 'batchnorm':
                self.params['gamma' + str(i+1)] = np.ones(output_size)
                self.params['beta' + str(i+1)] = np.zeros(output_size)
            input_size = output_size # 下一层的输入
        # 输出层，没有BN操作
        self.params['W' + str(self.num_layers)] = np.random.randn(input_size,num_classes) * weight_scale
        self.params['b' + str(self.num_layers)] = np.zeros(num_classes)
        # When using dropout we need to pass a dropout_param dictionary to each
        # dropout layer so that the layer knows the dropout probability and the mode
        # (train / test). You can pass the same dropout_param to each dropout layer.
        self.dropout_param = {}
        if self.use_dropout:
            self.dropout_param = {'mode': 'train', 'p': dropout}
            if seed is not None:
                self.dropout_param['seed'] = seed

        # With batch normalization we need to keep track of running means and
        # variances, so we need to pass a special bn_param object to each batch
        # normalization layer. You should pass self.bn_params[0] to the forward pass
        # of the first batch normalization layer, self.bn_params[1] to the forward
        # pass of the second batch normalization layer, etc.
        self.bn_params = []
        if self.normalization=='batchnorm':
            self.bn_params = [{'mode': 'train'} for i in range(self.num_layers - 1)]
        if self.normalization=='layernorm':
            self.bn_params = [{} for i in range(self.num_layers - 1)]

        # Cast all parameters to the correct datatype
        for k, v in self.params.items():
            self.params[k] = v.astype(dtype)


    def loss(self, X, y=None):
        """
        Compute loss and gradient for the fully-connected net.

        Input / output: Same as TwoLayerNet above.
        """
        X = X.astype(self.dtype)
        mode = 'test' if y is None else 'train'

        # Set train/test mode for batchnorm params and dropout param since they
        # behave differently during training and testing.
        if self.use_dropout:
            self.dropout_param['mode'] = mode
        if self.normalization=='batchnorm':
            for bn_param in self.bn_params:
                bn_param['mode'] = mode
        ############################################################################
        # TODO: Implement the forward pass for the fully-connected net, computing  #
        # the class scores for X and storing them in the scores variable.          #
        #                                                                          #
        # When using dropout, you'll need to pass self.dropout_param to each       #
        # dropout forward pass.                                                    #
        #                                                                          #
        # When using batch normalization, you'll need to pass self.bn_params[0] to #
        # the forward pass for the first batch normalization layer, pass           #
        # self.bn_params[1] to the forward pass for the second batch normalization #
        # layer, etc.                                                              #
        ############################################################################
        cache = {} # 需要存储反向传播需要的参数
        hidden = X
        for i in range(self.num_layers - 1):
            if self.normalization :
                hidden,cache[i+1] = affine_bn_relu_forward(hidden,
                                    self.params['W' + str(i+1)],
                                    self.params['b' + str(i+1)],
                                    self.params['gamma' + str(i+1)],
                                    self.params['beta' + str(i+1)],
                                    self.bn_params[i])
            else:
                hidden , cache[i+1] = affine_relu_forward(hidden,self.params['W' + str(i+1)],
                                                          self.params['b' + str(i+1)])
            if self.use_dropout:
                pass
        # 最后一层不用激活
        scores, cache[self.num_layers] = affine_forward(hidden , self.params['W' + str(self.num_layers)],
                                                       self.params['b' + str(self.num_layers)])

        # If test mode return early
        if mode == 'test':
            return scores

        ############################################################################
        # TODO: Implement the backward pass for the fully-connected net. Store the #
        # loss in the loss variable and gradients in the grads dictionary. Compute #
        # data loss using softmax, and make sure that grads[k] holds the gradients #
        # for self.params[k]. Don't forget to add L2 regularization!               #                                               
        loss, grads = 0.0, {}
        loss, dS = softmax_loss(scores , y)
        # 最后一层没有relu激活
        dhidden, grads['W' + str(self.num_layers)], grads['b' + str(self.num_layers)] \
            = affine_backward(dS,cache[self.num_layers])
        loss += 0.5 * self.reg * np.sum(self.params['W' + str(self.num_layers)] * self.params['W' + str(self.num_layers)])
        grads['W' + str(self.num_layers)] += self.reg * self.params['W' + str(self.num_layers)]

        for i in range(self.num_layers - 1, 0, -1):
            loss += 0.5 * self.reg * np.sum(self.params["W" + str(i)] * self.params["W" + str(i)])
            # 倒着求梯度
            if self.use_dropout:
                pass
            if self.normalization == 'batchnorm':
                dhidden, dw, db, dgamma, dbeta = affine_bn_relu_backward(dhidden, cache[i])
                grads['gamma' + str(i)] = dgamma
                grads['beta' + str(i)] = dbeta
            else:
                dhidden, dw, db = affine_relu_backward(dhidden, cache[i])
            grads['W' + str(i)] = dw + self.reg * self.params['W' + str(i)]
            grads['b' + str(i)] = db
        return loss, grads

2. Batchnorm for deep networks

我们使用1000个训练样本来训练一个6层的神经网络，并比较有BN层和没有BN层的效果：

训练

np.random.seed(231)
# Try training a very deep net with batchnorm
hidden_dims = [100, 100, 100, 100, 100]

num_train = 1000
small_data = {
  'X_train': data['X_train'][:num_train],
  'y_train': data['y_train'][:num_train],
  'X_val': data['X_val'],
  'y_val': data['y_val'],
}

epochs = 10
weight_scale = 2e-2
bn_model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, normalization='batchnorm')
model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, normalization=None)

bn_solver = Solver(bn_model, small_data,
                num_epochs=epochs, batch_size=50,
                update_rule='adam',
                optim_config={
                  'learning_rate': 1e-3,
                },
                verbose=True,print_every=20)
bn_solver.train()

solver = Solver(model, small_data,
                num_epochs=epochs, batch_size=50,
                update_rule='adam',
                optim_config={
                  'learning_rate': 1e-3,
                },
                verbose=True, print_every=20)
solver.train()

比较：你将会发现使用BN层将加速网络的收敛

def plot_training_history(title, label, baseline, bn_solvers, plot_fn, bl_marker='.', bn_marker='.', labels=None):
    """utility function for plotting training history"""
    plt.title(title)
    plt.xlabel(label)
    bn_plots = [plot_fn(bn_solver) for bn_solver in bn_solvers]
    bl_plot = plot_fn(baseline)
    num_bn = len(bn_plots)
    for i in range(num_bn):
        label='with_norm'
        if labels is not None:
            label += str(labels[i])
        plt.plot(bn_plots[i], bn_marker, label=label)
    label='baseline'
    if labels is not None:
        label += str(labels[0])
    plt.plot(bl_plot, bl_marker, label=label)
    plt.legend(loc='lower center', ncol=num_bn+1) 

    
plt.subplot(3, 1, 1)
plot_training_history('Training loss','Iteration', solver, [bn_solver], \
                      lambda x: x.loss_history, bl_marker='o', bn_marker='o')
plt.subplot(3, 1, 2)
plot_training_history('Training accuracy','Epoch', solver, [bn_solver], \
                      lambda x: x.train_acc_history, bl_marker='-o', bn_marker='-o')
plt.subplot(3, 1, 3)
plot_training_history('Validation accuracy','Epoch', solver, [bn_solver], \
                      lambda x: x.val_acc_history, bl_marker='-o', bn_marker='-o')

plt.gcf().set_size_inches(15, 15)
plt.show()

在这里插入图片描述

3. Batch normalization and initialization

这一节，我们将比较BN层和权重初始化的关系。

首先，我们定义一个8层的神经网络，然后分别比较不同的权重初始化参数下，带BN和不带BN的网络的性能差异

np.random.seed(231)
# Try training a very deep net with batchnorm
hidden_dims = [50, 50, 50, 50, 50, 50, 50]
num_train = 1000
small_data = {
  'X_train': data['X_train'][:num_train],
  'y_train': data['y_train'][:num_train],
  'X_val': data['X_val'],
  'y_val': data['y_val'],
}

bn_solvers_ws = {}
solvers_ws = {}
weight_scales = np.logspace(-4, 0, num=20)
for i, weight_scale in enumerate(weight_scales):
  print('Running weight scale %d / %d' % (i + 1, len(weight_scales)))
  bn_model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, normalization='batchnorm')
  model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, normalization=None)

  bn_solver = Solver(bn_model, small_data,
                  num_epochs=10, batch_size=50,
                  update_rule='adam',
                  optim_config={
                    'learning_rate': 1e-3,
                  },
                  verbose=False, print_every=200)
  bn_solver.train()
  bn_solvers_ws[weight_scale] = bn_solver

  solver = Solver(model, small_data,
                  num_epochs=10, batch_size=50,
                  update_rule='adam',
                  optim_config={
                    'learning_rate': 1e-3,
                  },
                  verbose=False, print_every=200)
  solver.train()
  solvers_ws[weight_scale] = solver

结果比较：

# Plot results of weight scale experiment
best_train_accs, bn_best_train_accs = [], []
best_val_accs, bn_best_val_accs = [], []
final_train_loss, bn_final_train_loss = [], []

for ws in weight_scales:
  best_train_accs.append(max(solvers_ws[ws].train_acc_history))
  bn_best_train_accs.append(max(bn_solvers_ws[ws].train_acc_history))
  
  best_val_accs.append(max(solvers_ws[ws].val_acc_history))
  bn_best_val_accs.append(max(bn_solvers_ws[ws].val_acc_history))
  
  final_train_loss.append(np.mean(solvers_ws[ws].loss_history[-100:]))
  bn_final_train_loss.append(np.mean(bn_solvers_ws[ws].loss_history[-100:]))
  
plt.subplot(3, 1, 1)
plt.title('Best val accuracy vs weight initialization scale')
plt.xlabel('Weight initialization scale')
plt.ylabel('Best val accuracy')
plt.semilogx(weight_scales, best_val_accs, '-o', label='baseline')
plt.semilogx(weight_scales, bn_best_val_accs, '-o', label='batchnorm')
plt.legend(ncol=2, loc='lower right')

plt.subplot(3, 1, 2)
plt.title('Best train accuracy vs weight initialization scale')
plt.xlabel('Weight initialization scale')
plt.ylabel('Best training accuracy')
plt.semilogx(weight_scales, best_train_accs, '-o', label='baseline')
plt.semilogx(weight_scales, bn_best_train_accs, '-o', label='batchnorm')
plt.legend()

plt.subplot(3, 1, 3)
plt.title('Final training loss vs weight initialization scale')
plt.xlabel('Weight initialization scale')
plt.ylabel('Final training loss')
plt.semilogx(weight_scales, final_train_loss, '-o', label='baseline')
plt.semilogx(weight_scales, bn_final_train_loss, '-o', label='batchnorm')
plt.legend()
plt.gca().set_ylim(1.0, 3.5)

plt.gcf().set_size_inches(15, 15)
plt.show()

在这里插入图片描述
Q: Describe the results of this experiment. How does the scale of weight initialization affect models with/without batch normalization differently, and why?
A: 由上图可以得到，BN层使得网络的训练对网络参数初始化变得不那么敏感。不
带BN层，则若权重初始化过小，则参数分布逐渐集中在0附近，导致回传的梯度乘以参数之后变得非常小。若权重初始化过大，则参数分布逐渐两极化，出现饱和现像。

4. Batch normalization and batch size

这一节，我们来比较BN操作和我们一个批量大小的关系，从我们初步的认识来看，batchsize越大，则我们估计的均值和方差就越准确。

初始化一个带BN层的6层神经网络，然后使用不同batchsiz大小的参数进行训练：

def run_batchsize_experiments(normalization_mode):
    np.random.seed(231)
    # Try training a very deep net with batchnorm
    hidden_dims = [100, 100, 100, 100, 100]
    num_train = 1000
    small_data = {
      'X_train': data['X_train'][:num_train],
      'y_train': data['y_train'][:num_train],
      'X_val': data['X_val'],
      'y_val': data['y_val'],
    }
    n_epochs=10
    weight_scale = 2e-2
    batch_sizes = [5,10,50]
    lr = 10**(-3.5)
    solver_bsize = batch_sizes[0]

    print('No normalization: batch size = ',solver_bsize)
    model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, normalization=None)
    solver = Solver(model, small_data,
                    num_epochs=n_epochs, batch_size=solver_bsize,
                    update_rule='adam',
                    optim_config={
                      'learning_rate': lr,
                    },
                    verbose=False)
    solver.train()
    
    bn_solvers = []
    for i in range(len(batch_sizes)):
        b_size=batch_sizes[i]
        print('Normalization: batch size = ',b_size)
        bn_model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, normalization=normalization_mode)
        bn_solver = Solver(bn_model, small_data,
                        num_epochs=n_epochs, batch_size=b_size,
                        update_rule='adam',
                        optim_config={
                          'learning_rate': lr,
                        },
                        verbose=False)
        bn_solver.train()
        bn_solvers.append(bn_solver)
        
    return bn_solvers, solver, batch_sizes

batch_sizes = [5,10,50]
bn_solvers_bsize, solver_bsize, batch_sizes = run_batchsize_experiments('batchnorm')

结果比较：

plt.subplot(2, 1, 1)
plot_training_history('Training accuracy (Batch Normalization)','Epoch', solver_bsize, bn_solvers_bsize, \
                      lambda x: x.train_acc_history, bl_marker='-^', bn_marker='-o', labels=batch_sizes)
plt.subplot(2, 1, 2)
plot_training_history('Validation accuracy (Batch Normalization)','Epoch', solver_bsize, bn_solvers_bsize, \
                      lambda x: x.val_acc_history, bl_marker='-^', bn_marker='-o', labels=batch_sizes)

plt.gcf().set_size_inches(15, 10)
plt.show()

在这里插入图片描述

Q: Describe the results of this experiment. What does this imply about the relationship between batch normalization and batch size? Why is this relationship observed?
随着batchsize的增加，我们的模型收敛的更快，说明BN层适合大的batchsize，因为大的批量使得我们得均值和方差估计得更准确。但是在测试的时候，batchsize貌似影响不是很大，因为我们使用的是训练集的滑动平均值。

5. 层级归一化（Layer Normalization）

由于批量归一化和我们得batchsize有一定关系，所以，当我们得设备性能有限，不能选择合适得batchsize的时候，这种操作就会受到影响。
推荐阅读2中提到了一种Layer Normalization:each feature vector corresponding to a single datapoint is normalized based on the sum of all terms within that feature vector.

批量归一化和层级归一化的简单关系如下图，来源于推荐阅读3：

简单来说LN对特征上进行归一化，BN对不同样本取统计量。
Question:
Which of these data preprocessing steps is analogous to batch normalization, and which is analogous to layer normalization?

Scaling each image in the dataset, so that the RGB channels for each row of pixels within an image sums up to 1.
Scaling each image in the dataset, so that the RGB channels for all pixels within an image sums up to 1.
Subtracting the mean image of the dataset from each image in the dataset.
Setting all RGB values to either 0 or 1 depending on a given threshold.

Answer:
12类似于LN，3类似于BN

5.1 LN forward

由于我们的LN是对行归一化，而BN是对列进行归一化，所以，如果为了方便，我们完全可以简单修改一下之前已经实现好的BN代码，来完成LN的实现

In cs231n/layers.py, implement the forward pass for layer normalization in the function layernorm_backward.

def layernorm_forward(x, gamma, beta, ln_param):
    """
    Forward pass for layer normalization.

    During both training and test-time, the incoming data is normalized per data-point,
    before being scaled by gamma and beta parameters identical to that of batch normalization.
    
    Note that in contrast to batch normalization, the behavior during train and test-time for
    layer normalization are identical, and we do not need to keep track of running averages
    of any sort.

    Input:
    - x: Data of shape (N, D)
    - gamma: Scale parameter of shape (D,)
    - beta: Shift paremeter of shape (D,)
    - ln_param: Dictionary with the following keys:
        - eps: Constant for numeric stability

    Returns a tuple of:
    - out: of shape (N, D)
    - cache: A tuple of values needed in the backward pass
    """
    out, cache = None, None
    eps = ln_param.get('eps', 1e-5)
    
    # HINT: this can be done by slightly modifying your training-time         #
    # implementation of  batch normalization, and inserting a line or two of  #
    # well-placed code. In particular, can you think of any matrix            #
    # transformations you could perform, that would enable you to copy over   #
    # the batch norm code and leave it almost unchanged?                      #
    # 我们只需要对输入取一个转置，则和BN的训练阶段差不多了
    x = x.T # (D, N )
    _mean = np.mean(x, axis=0) # (N,)
    _var = np.var(x, axis=0) # (N, )
    x_hat = (x - _mean) / (np.sqrt(_var + eps)) #[D, N]
    x_hat = x_hat.T # [N, D]
    out = x_hat * gamma + beta
    cache = (gamma , x_hat, x, _mean, _var, eps)
    return out, cache

5.2 LN backward

In cs231n/layers.py, implement the backward pass for layer normalization in the function layernorm_backward.

def layernorm_backward(dout, cache):
    """
    Backward pass for layer normalization.

    For this implementation, you can heavily rely on the work you've done already
    for batch normalization.

    Inputs:
    - dout: Upstream derivatives, of shape (N, D)
    - cache: Variable of intermediates from layernorm_forward.

    Returns a tuple of:
    - dx: Gradient with respect to inputs x, of shape (N, D)
    - dgamma: Gradient with respect to scale parameter gamma, of shape (D,)
    - dbeta: Gradient with respect to shift parameter beta, of shape (D,)
    """
    ###########################################################################
    # TODO: Implement the backward pass for layer norm.                       #
    #                                                                         #
    # HINT: this can be done by slightly modifying your training-time         #
    # implementation of batch normalization. The hints to the forward pass    #
    # still apply!                                                            #
    ###########################################################################
    gamma, x_hat, x, _mean, _var ,eps = cache
    N = x_hat.shape[1]
    dgamma = np.sum(dout * x_hat, axis = 0)
    dbeta = np.sum(dout, axis = 0)
    dx_hat = (dout * gamma).T
    # 把前面的BN代码复制过来
    a = np.sqrt(_var + eps)
    # 先计算方差
    dvar = np.sum(- 0.5 * (x - _mean) * dx_hat / a ** 3, axis=0)
    dmean = np.sum(- dx_hat / a, axis=0) + dvar * np.sum(-2 * (x - _mean), axis=0) / N
    dx = dx_hat / a + dmean / N + 2 * dvar * (x - _mean) / N

    dx = dx.T
    return dx, dgamma, dbeta

5.3 Fully Connected Nets with Layer Normalization

Modify cs231n/classifiers/fc_net.py to add layer normalization to the FullyConnectedNet. When the normalization flag is set to "layernorm" in the constructor, you should insert a layer normalization layer before each ReLU nonlinearity.
只需要把之前在fcn_net.py中对BN进行的定义再对LN定义一遍就行，过程都差不多，这里直接放结果：

ln_solvers_bsize, solver_bsize, batch_sizes = run_batchsize_experiments('layernorm')

plt.subplot(2, 1, 1)
plot_training_history('Training accuracy (Layer Normalization)','Epoch', solver_bsize, ln_solvers_bsize, \
                      lambda x: x.train_acc_history, bl_marker='-^', bn_marker='-o', labels=batch_sizes)
plt.subplot(2, 1, 2)
plot_training_history('Validation accuracy (Layer Normalization)','Epoch', solver_bsize, ln_solvers_bsize, \
                      lambda x: x.val_acc_history, bl_marker='-^', bn_marker='-o', labels=batch_sizes)

plt.gcf().set_size_inches(15, 10)
plt.show()

在这里插入图片描述
可以发现，batchsize对LN的影响要比对BN的影响小
Q：
When is layer normalization likely to not work well, and why?

Using it in a very deep network
Having a very small dimension of features
Having a high regularization term

A：因为LN是对一层神经元上进行归一化，所以，其结果可能会受该层神经元的数目影响，即2的影响要大些。