CS231n-2017 Assignment2 NN、BP、SGD、BN、CNN

一、全连接神经网络

在上一次作业中，已经实现了两层神经网络的架构。但该实现有个问题，即程序不够模块化，比如在loss()函数中，同时计算了损失函数和各参数的梯度。这种耦合，使得扩展网络深度时，需要做大量修改。另外，神经网络的层与层的结构也类似，这意味着朴素实现的方式中存在着代码重复。而本作业中，将要实现一种模块化的神经网路架构：将各个功能层封装为一个对象，如全连接层对象、ReLU层对象；在各层对象的前向传播函数中，将由上层传来的数据和本层的系数，产生本层的激活函数输出值，并缓存计算梯度时所必需的参数；在各层对象的后向传播函数中，将由下层传来的各个激活值的梯度和本层的缓存值，计算本层各参数的梯度值。

1. 仿射层的前向传播

前向传播的实现非常直白，与上次作业中的实现相比，这次还需缓存计算本层参数的梯度时，所需的中间计算结果。
layers文件中的affine_forward()函数：

def affine_forward(x, w, b):
    out = None
    
    # TODO: Implement the affine forward pass. 
    batch_size = x.shape[0]
    x_oneline = x.reshape(batch_size, -1)
    out = x_oneline.dot(w) + b

    cache = (x, w, b)
    return out, cache

2. 仿射层的后向传播

相比于上次作业中的实现，这里只是将具体逻辑抽离出来，并封装为单独的函数。
layers文件中的affine_backward()函数：

def affine_backward(dout, cache):
    x, w, b = cache
    x_shape = x.shape
    batch_size = x_shape[0]
    sample_shape = x_shape[1:]
    M = w.shape[1]

    # TODO: Implement the affine backward pass.
    x_oneline = x.reshape(batch_size, -1)
    dx, dw, db = None, None, None
    dx = dout.dot(w.T).reshape(batch_size, *sample_shape)
    dw = x_oneline.T.dot(dout)
    db = np.sum(np.ones(M) * dout, axis=0)
   
    return dx, dw, db

3. `ReLU`层的前向传播

layers文件中的relu_forward()函数：

def relu_forward(x):
    out = None

    # TODO: Implement the ReLU forward pass.
    out = np.maximum(0, x)

    cache = x
    return out, cache

4. `ReLU`层的后向传播

在计算图的后向传播结构中，激活层就像一个门开关。
layers文件中的relu_backward()函数：

def relu_backward(dout, cache):
    dx, x = None, cache
    # TODO: Implement the ReLU backward pass.
    dx = dout*(x>0)
    return dx

5. 利用层对象重新实现两层神经网络

fc_net文件中的TwoLayerNet对象：

class TwoLayerNet(object):
    def __init__(self, input_dim=3*32*32, hidden_dim=100, num_classes=10,
                 weight_scale=1e-3, reg=0.0):
        self.params = {}
        self.reg = reg
        # TODO: Initialize the weights and biases of the two-layer net.
        self.params["W1"] = weight_scale * np.random.randn(input_dim, hidden_dim)
        self.params["b1"] = np.zeros(hidden_dim)
        self.params["W2"] = weight_scale * np.random.randn(hidden_dim, num_classes)
        self.params["b2"] = np.zeros(num_classes)

    def loss(self, X, y=None):
        
        scores = None
        # TODO: Implement the forward pass for the two-layer net.
        layer1_relu_out, layer1_relu_cache = affine_relu_forward(X, self.params["W1"], self.params["b1"])
        layer2_out, layer2_cache = affine_forward(layer1_relu_out, self.params["W2"], self.params["b2"])
        
        scores = layer2_out

        # If y is None then we are in test mode so just return scores
        if y is None:
            return scores

        loss, grads = 0, {}

        # TODO: Implement the backward pass for the two-layer net. 
        loss, dloss = softmax_loss(layer2_out, y)
        loss += 0.5 * self.reg * (np.sum(np.square(self.params["W1"])) + np.sum(np.square(self.params["W2"])))
        dlayer2_out, dW2, db2 = affine_backward(dloss, layer2_cache)
        _, dW1, db1 = affine_relu_backward(dlayer2_out, layer1_relu_cache)

        grads["W1"] = dW1 + self.reg * self.params["W1"]
        grads["b1"] = db1
        grads["W2"] = dW2 + self.reg * self.params["W2"]
        grads["b2"] = db2

        return loss, grads

6. 封装训练过程

下述参数可以使得在测试集上的准确率约为53%：

# TODO: Use a Solver instance to train a TwoLayerNet.
model = TwoLayerNet(reg=0.2)
solver = Solver(model, data,
                update_rule='sgd',
                optim_config={
                  'learning_rate': 1e-3,
                },
                lr_decay=0.95,
                num_epochs=20, batch_size=500,
                print_every=500)
solver.train()

7. 层数任意的全连接网络

依据传入的hidden_dims参数，决定网络层数。
fc_net文件中的FullyConnectedNet对象：

class FullyConnectedNet(object):
    def __init__(self, hidden_dims, input_dim=3*32*32, num_classes=10,
                 dropout=0, use_batchnorm=False, reg=0.0,
                 weight_scale=1e-2, dtype=np.float32, seed=None):

        self.use_batchnorm = use_batchnorm
        self.use_dropout = dropout > 0
        self.reg = reg
        self.num_layers = 1 + len(hidden_dims)
        self.dtype = dtype
        self.params = {}

        # TODO: Initialize the parameters of the network.
        param_dims = [input_dim] + hidden_dims + [num_classes]
        for indx in range(1, len(param_dims)):
            self.params["W"+str(indx)] = weight_scale * np.random.randn(param_dims[indx-1], param_dims[indx])
            self.params["b"+str(indx)] = np.zeros(param_dims[indx])
        if self.use_batchnorm:
            for indx in range(1, len(param_dims) - 1):
                self.params["gamma"+str(indx)] = np.ones(param_dims[indx])
                self.params["beta" +str(indx)] = np.zeros(param_dims[indx])
        
        self.dropout_param = {}
        if self.use_dropout:
            self.dropout_param = {'mode': 'train', 'p': dropout}
            if seed is not None:
                self.dropout_param['seed'] = seed

        self.bn_params = []
        if self.use_batchnorm:
            self.bn_params = [{'mode': 'train'} for i in range(self.num_layers - 1)]

        # Cast all parameters to the correct datatype
        for k, v in self.params.items():
            self.params[k] = v.astype(dtype)


    def loss(self, X, y=None):
        
        X = X.astype(self.dtype)
        mode = 'test' if y is None else 'train'

        # Set train/test mode for batchnorm params and dropout param since they
        # behave differently during training and testing.
        if self.use_dropout:
            self.dropout_param['mode'] = mode
        if self.use_batchnorm:
            for bn_param in self.bn_params:
                bn_param['mode'] = mode

        # TODO: Implement the forward pass for the fully-connected net
        layer_relu_out = X
        layer_cache_dict = {}
        if self.use_batchnorm:
            for i in range(1, self.num_layers):
                layer_relu_out, layer_relu_cache = affine_norm_relu_forward(layer_relu_out, self.params["W"+str(i)],\
                                                    self.params["b"+str(i)], self.params["gamma"+str(i)], 
                                                    self.params["beta"+str(i)], self.bn_params[i-1])
                if self.use_dropout:
                    layer_relu_out, dropout_cache = dropout_forward(layer_relu_out, self.dropout_param)
                    layer_cache_dict["dropout"+str(i)] = dropout_cache
                layer_cache_dict[i] = layer_relu_cache
        else:
            for i in range(1, self.num_layers):
                layer_relu_out, layer_relu_cache = affine_relu_forward(layer_relu_out, self.params["W"+str(i)], self.params["b"+str(i)])

                if self.use_dropout:
                    layer_relu_out, dropout_cache = dropout_forward(layer_relu_out, self.dropout_param)
                    layer_cache_dict["dropout"+str(i)] = dropout_cache
                layer_cache_dict[i] = layer_relu_cache
        
        final_layer_out, final_layer_cache = affine_forward(layer_relu_out, self.params["W" + str(self.num_layers)], self.params["b"+str(self.num_layers)])
        layer_cache_dict[self.num_layers] = final_layer_cache
        
        scores = final_layer_out
        
        if mode == 'test':
            return scores

        loss, grads = 0.0, {}
        # TODO: Implement the backward pass for the fully-connected net.
        loss, dloss = softmax_loss(final_layer_out, y)
        for i in range(self.num_layers):
            loss += 0.5 * self.reg * (np.sum(np.square(self.params["W"+str(i+1)])))

        dx, final_dW, final_db = affine_backward(dloss, layer_cache_dict[self.num_layers])
        grads["W"+str(self.num_layers)] = final_dW + self.reg * self.params["W"+str(self.num_layers)]
        grads["b"+str(self.num_layers)] = final_db
        if self.use_batchnorm:
            for i in range(self.num_layers-1, 0, -1):
                if self.use_dropout:
                    dx = dropout_backward(dx, layer_cache_dict["dropout"+str(i)])
                dx, dw, db, dgamma, dbeta = affine_norm_relu_backward(dx, layer_cache_dict[i])
                grads["W"+str(i)] = dw + self.reg * self.params["W"+str(i)]
                grads["b"+str(i)] = db
                grads["gamma"+str(i)] = dgamma
                grads["beta" +str(i)] = dbeta
        else:
            for i in range(self.num_layers-1, 0, -1):
                if self.use_dropout:
                    dx = dropout_backward(dx, layer_cache_dict["dropout"+str(i)])
                dx, dw, db = affine_relu_backward(dx, layer_cache_dict[i])
                grads["W"+str(i)] = dw + self.reg * self.params["W"+str(i)]
                grads["b"+str(i)] = db

        return loss, grads

8. 随机梯度下降法的改良：`SGD+Momentum`、`RNSProp`、`Adam`

SGD+Momentum

$v_{t+1} = \rho v_t - \nabla f(x_t), \quad x_{t+1} = x_t + \alpha v_{t+1}$
所以 $t+1$ 次时，参数的更新方向为：

$v_{t+1} = - \nabla f(x_t) - \rho \nabla f(x_{t-1}) - \cdots - \rho^t\nabla f(x_0)$
optim文件中的sgd_momentum()函数：

def sgd_momentum(w, dw, config=None):
    if config is None: config = {}
    config.setdefault('learning_rate', 1e-2)
    config.setdefault('momentum', 0.9)
    v = config.get('velocity', np.zeros_like(w))

    next_w = None
    # TODO: Implement the momentum update formula.
    v = config["momentum"] * v - config["learning_rate"] * dw
    next_w = w + v
    config['velocity'] = v

    return next_w, config

RMSProp
$\mathrm{norm2}_{t+1} = \rho \cdot \mathrm{norm2}_{t} + (1-\rho) \|\nabla f(x_t)\|^2, \quad x_{t+1} = x_t - \alpha \frac{\nabla f(x_t)}{\sqrt{\mathrm{norm2}_{t+1}}}$

optim文件中的rmsprop函数：

def rmsprop(x, dx, config=None):
    if config is None: config = {}
    config.setdefault('learning_rate', 1e-2)
    config.setdefault('decay_rate', 0.99)
    config.setdefault('epsilon', 1e-8)
    config.setdefault('cache', np.zeros_like(x))

    next_x = None
    # TODO: Implement the RMSprop update formula.
    grad_squared = config["decay_rate"] * config["cache"] + (1-config["decay_rate"]) * dx * dx
    next_x = x - config["learning_rate"] * dx / (np.sqrt(grad_squared + config["epsilon"]))
    config["cache"] = grad_squared
    return next_x, config

Adam
optim文件中的adam函数：

def adam(x, dx, config=None):

    if config is None: config = {}
    config.setdefault('learning_rate', 1e-3)
    config.setdefault('beta1', 0.9)
    config.setdefault('beta2', 0.999)
    config.setdefault('epsilon', 1e-8)
    config.setdefault('m', np.zeros_like(x))
    config.setdefault('v', np.zeros_like(x))
    config.setdefault('t', 1)

    next_x = None
    # TODO: Implement the Adam update formula.
    first_moment = config["beta1"] * config["m"] + (1 - config["beta1"]) * dx
    second_moment = config["beta2"] * config["v"] + (1 - config["beta2"]) * dx * dx
    
    first_unbias = first_moment / (1 - np.power(config["beta1"], config["t"]))
    second_unbias = second_moment / (1 - np.power(config["beta2"], config["t"]))
    next_x = x - config["learning_rate"] * first_unbias / (np.sqrt(second_unbias + config["epsilon"]))

    config["m"] = first_moment
    config["v"] = second_moment

    return next_x, config

二、批量标准化(`Batch Normalization`)

1. BN层的前向传播

使用加权平均记录整个训练数据的均值和方差，用于预测分类过程。
layers文件中的batchnorm_forward()函数：

def batchnorm_forward(x, gamma, beta, bn_param):
    mode = bn_param['mode']
    eps = bn_param.get('eps', 1e-5)
    momentum = bn_param.get('momentum', 0.9)

    N, D = x.shape
    running_mean = bn_param.get('running_mean', np.zeros(D, dtype=x.dtype))
    running_var = bn_param.get('running_var', np.zeros(D, dtype=x.dtype))

    out, cache = None, None
    if mode == 'train':
        # TODO: Implement the training-time forward pass for batch norm.
        batch_mean = np.mean(x, axis=0)
        batch_var = np.var(x, axis=0)
        running_mean = momentum * running_mean + (1 - momentum) * batch_mean
        running_var = momentum * running_var + (1 - momentum) * batch_var
        out = (x - batch_mean)/np.sqrt(batch_var)
        cache = {"batch_var": batch_var, "x_norm":out, "gamma":gamma}
        out = out*gamma + beta
        
    elif mode == 'test':
        # TODO: Implement the test-time forward pass for batch normalization.
        out = (x - running_mean)/np.sqrt(running_var)
        out = out*gamma + beta
        
    else:
        raise ValueError('Invalid forward batchnorm mode "%s"' % mode)

    # Store the updated running means back into bn_param
    bn_param['running_mean'] = running_mean
    bn_param['running_var'] = running_var

    return out, cache

2. BN层的后向传播

设一个样本数据为 $\vec{x} = [x_1,\, x_2, \cdots,\, x_m]$ ，归一化后可表示为 $\hat{\vec{x}}$ 。用于表示数据偏移的参数 $\vec{\gamma}$ 和 $\vec{\beta}$ 与 $\vec{x}$ 同维，则BN层的输出可表示为：

$\vec{y} =\vec{\gamma}\odot\hat{X} + \vec{\beta} = \left[ \begin{array}{c} -\,\, \hat{\vec{x}}_1 \odot \vec{\gamma} + \vec{\beta} \,\,-\\ -\,\, \hat{\vec{x}}_2 \odot \vec{\gamma} + \vec{\beta} \,\,-\\ \vdots \\ - \,\, \hat{\vec{x}}_n \odot \vec{\gamma} + \vec{\beta} \,\,- \end{array} \right]$
假设反向传播回来的关于 $\vec{y}$ 的导数为 $\delta\vec{y}$ ，则反向传播过程主要关注的是：考虑 $\delta\vec{y}$ 后， $\delta\vec{\gamma}$ 、 $\delta\vec{\beta}$ 、 $\delta\vec{X}$ 的表达。

首先需要明确的是 $\delta\vec{x}$ 的形状与 $\vec{x}$ 的形状一致(这里 $\vec{x}$ 是一个统指，其可为 $\vec{\gamma}$ 、 $\vec{\beta}$ 、 $\vec{X}$ )。其次， $\delta\vec{x}$ 中的一个元素 $\delta x_i$ 是 $\vec{y}$ 中各个元素对 $\vec{x}$ 中的元素 $x_i$ 的导数的累加值。

考虑 $\vec{y}$ 中任意一行 $\vec{y}_i$ 对 $\vec{\gamma}$ 和 $\vec{\beta}$ 的求导：

$\frac{d\,\vec{y}_i}{d\vec{\gamma}} = [-\, \bar{\vec{x}}_i\,-], \quad \frac{d\,\vec{y}_i}{d\vec{\beta}} = [-\, 1\,-]$
因此， $\delta\vec{\gamma}$ 和 $\delta\vec{\beta}$ 为：

$\left .(\delta\vec{y}\odot\vec{y}')\right\downarrow _{\Sigma\,\,\mathrm{rows}} \Rightarrow \quad \delta\vec{\gamma} = \left . (\delta\vec{y}\odot\bar{X})\right\downarrow _{\Sigma\,\,\mathrm{rows}}, \, \delta\vec{\beta} = \left .(\delta\vec{y})\right\downarrow _{\Sigma\,\,\mathrm{rows}}$

下面计算 $\delta\vec{X}$ 。为简略计，考虑 $\vec{y}$ 中任意一行 $\vec{y}_i$ 对 $\vec{X}$ 中任意一行 $\vec{x}_j$ 的导数。更进一步简化，由于 $\vec{y}_i$ 中每个元素与 $\vec{x}_j$ 中对应位置的元素的关系式都是一个形式，且对于BN层， $\vec{y}$ 与 $\vec{X}$ 都是元素级的变换，因此可考虑 $\vec{y}_i$ 与 $\vec{x}_j$ 的标量形式 $y_i$ 、 $x_j$ ：

$y_i = \gamma \frac{x_i - \bar{x}}{\sqrt{\mathrm{var}(x)}}+\beta,\quad \bar{x} = \frac{1}{n}\sum x_i,\quad \mathrm{var}(x)=\frac{1}{n}\sum(x_i-\bar{x})^2=\frac{1}{n}\sum x_i^2 - \bar{x}^2$
故有：

$\begin{array}{ccl} \dfrac{dy_i}{dx_j} &=& \gamma\cdot\dfrac{\sqrt{\mathrm{var}(x)}\cdot(x_i-\bar{x})'-(x_i-\bar{x})\cdot(\sqrt{\mathrm{var}(x)})'}{\mathrm{var(x)}}\\ \\ & = & \gamma\cdot\left[\dfrac{\delta_{ij} - 1/n}{\sqrt{\mathrm{var}(x)}}-\dfrac{x_i-\bar{x}}{\mathrm{var}(x)}\cdot\dfrac{x_j-\bar{x}}{\sqrt{\mathrm{var}(x)}}\right] \\ \\ &=& \dfrac{\gamma}{\sqrt{\mathrm{var}(x)}}\cdot\left[\delta_{ij} - \dfrac{1}{n} - \hat{x}_i\cdot\hat{x}_j\right] \end{array}$

然后对 $y_i$ 的角标 $i$ 进行求和即可。
layers文件batchnorm_backward_alt()函数：

def batchnorm_backward_alt(dout, cache):
    dx, dgamma, dbeta = None, None, None
    # TODO: Implement the backward pass for batch normalization.
    batch_var, x_norm, gamma = cache["batch_var"], cache["x_norm"], cache["gamma"]
    dgamma = np.sum(x_norm*dout, axis=0)
    dbeta = np.sum(dout, axis=0)
    dx = (dout - np.mean(dout, axis=0) - np.mean(x_norm*dout, axis=0)*x_norm)/np.sqrt(batch_var)*gamma

    return dx, dgamma, dbeta

三、随机丢弃层(`Dropout`)

1. `Dropout`层的前向传播

在训练阶段，以一定的概率 $1-p$ 将输入数据置零；在预测分类阶段，保证数据的能量(统计参数、数据样本求和)与训练过程一致。

layers文件dropout_forward()函数：

def dropout_forward(x, dropout_param):

    p, mode = dropout_param['p'], dropout_param['mode']
    if 'seed' in dropout_param:
        np.random.seed(dropout_param['seed'])
    mask = None
    out = None

    if mode == 'train':
        # TODO: Implement training phase forward pass for inverted dropout. 
        mask = np.random.rand(*x.shape) < p
        out = x*mask
    elif mode == 'test':
        # TODO: Implement the test phase forward pass for inverted dropout.
        out = x*p

    cache = (dropout_param, mask)
    out = out.astype(x.dtype, copy=False)

    return out, cache

2. `Dropout`层的后向传播

Dropout层像一个门开关。
layers文件中的dropout_backward()函数：

def dropout_backward(dout, cache):
    dropout_param, mask = cache
    mode = dropout_param['mode']

    dx = None
    if mode == 'train':
        # TODO: Implement training phase backward pass for inverted dropout
        dx = dout*mask
    elif mode == 'test':
        dx = dout
    return dx

四、卷积神经网络

1. 卷积层的前向传播

依据是否需要填充等参数，构建好被卷积的数据大小，然后写循环即可。
layers文件中的conv_forward_naive()函数：

def conv_forward_naive(x, w, b, conv_param):
    out = None
    # TODO: Implement the convolutional forward pass.
    # Hint: you can use the function np.pad for padding.
    N, C, H, W = x.shape
    F, _, HH, WW = w.shape
    n_pad, n_stride = conv_param["pad"], conv_param["stride"]
    if n_pad > 0:
        data = np.zeros((N, C, H+2*n_pad, W+2*n_pad))
        data[:,:,n_pad:H+n_pad, n_pad:W+n_pad] = x
    else:
        data = x
    N, C, H, W = data.shape
    rH, rW = 1 + (H - HH) // n_stride, 1 + (W - WW) // n_stride
    out = np.zeros((N, F, rH, rW))
    for iH in range(0, rH):
        for iW in range(0, rW):
            for iF in range(0, F):
                for iN in range(0, N):
                    out[iN, iF, iH, iW] = np.sum(data[iN, :, iH*n_stride:iH*n_stride+HH, iW*n_stride:iW*n_stride+WW]*w[iF, :, :, :]) + b[iF]

    cache = (x, w, b, conv_param)
    return out, cache

2. 卷积层的后向传播

考虑只有一个输出值的卷积，其实就是一个全连接网络。从这个角度，只需对每个输出值，按照全连接网络的后向传播过程进行迭代即可。
layers文件中的conv_backward_naive()函数：

def conv_backward_naive(dout, cache):
    dx, dw, db = None, None, None
    # TODO: Implement the convolutional backward pass.
    x, w, b, conv_param = cache
    dw = np.zeros_like(w)
    N, C, H, W = x.shape
    F, _, HH, WW = w.shape
    n_pad, n_stride = conv_param["pad"], conv_param["stride"]
    if n_pad > 0:
        data = np.zeros((N, C, H+2*n_pad, W+2*n_pad))
        data[:,:,n_pad:H+n_pad, n_pad:W+n_pad] = x
    else:
        data = x

    _, _, rH, rW = dout.shape
    db = np.zeros_like(b)
    dx = np.zeros_like(data)
    for iF in range(F):
        for iH in range(rH):
            for iW in range(rW):
                dw[iF, :, :, :] += np.sum(data[:,:,iH*n_stride:iH*n_stride+HH, iW*n_stride:iW*n_stride+WW] * dout[:, iF, iH, iW].reshape((N,1,1,1)), axis = 0)
                db[iF] += np.sum(dout[:, iF, iH, iW], axis=0)
    
    for iF in range(F):
        for iH in range(rH):
            for iW in range(rW):
                for iN in range(N):
                    dx[iN, :, iH*n_stride:iH*n_stride+HH, iW*n_stride:iW*n_stride+WW] += dout[iN, iF, iH, iW]*w[iF,:,:,:]

    dx = dx[:, :, n_pad:n_pad+H, n_pad:n_pad+W]

    return dx, dw, db

3. 最大池化层的前向传播

layers文件中的max_pool_forward_naive()函数：

def max_pool_forward_naive(x, pool_param):
    out = None
    # TODO: Implement the max pooling forward pass
    pool_height, pool_width, n_stride = pool_param["pool_height"], pool_param["pool_height"], pool_param["stride"]
    N, F, H, W = x.shape
    rH, rW = 1 + (H - pool_height)//n_stride, 1 + (W - pool_width) // n_stride
    out = np.zeros((N, F, rH, rW))
    for iH in range(rH):
        for iW in range(rW):
            for iF in range(F):
                for iN in range(N):
                    out[iN, iF, iH, iW] = x[iN, iF, iH*n_stride:iH*n_stride+pool_height, iW*n_stride:iW*n_stride+pool_width].max()

    cache = (x, pool_param)
    return out, cache

4. 最大池化层的反向传播

layers文件中的max_pool_backward_naive()函数：

def max_pool_backward_naive(dout, cache):
    dx = None

    # TODO: Implement the max pooling backward pass
    x, pool_param = cache
    pool_height, pool_width, n_stride = pool_param["pool_height"], pool_param["pool_height"], pool_param["stride"]
    dx = np.zeros_like(x)
    N, F, H, W = x.shape
    _, _, rH, rW = dout.shape
    for iH in range(rH):
        for iW in range(rW):
            for iF in range(F):
                for iN in range(N):
                    dx[iN, iF, iH*n_stride:iH*n_stride+pool_height, iW*n_stride:iW*n_stride+pool_width]= dout[iN, iF, iH, iW] * \
                        (x[iN, iF, iH*n_stride:iH*n_stride+pool_height, iW*n_stride:iW*n_stride+pool_width] == x[iN, iF, iH*n_stride:iH*n_stride+pool_height, iW*n_stride:iW*n_stride+pool_width].max())

    return dx

5. 三层卷积网络

cnn文件中的ThreeLayerConvNet()对象：

class ThreeLayerConvNet(object):

    def __init__(self, input_dim=(3, 32, 32), num_filters=32, filter_size=7,
                 hidden_dim=100, num_classes=10, weight_scale=1e-3, reg=0.0,
                 dtype=np.float32):
        # Initialize a new network.
        self.params = {}
        self.reg = reg
        self.dtype = dtype

        # TODO: Initialize weights and biases for the three-layer convolutional network.
        channel_no, img_height, img_width = input_dim
        self.params["W1"] = weight_scale * np.random.randn(num_filters, channel_no, filter_size, filter_size)
        self.params["b1"] = np.zeros(num_filters)
        n_features = num_filters * img_height * img_width//2//2
        self.params["W2"] = weight_scale * np.random.randn(n_features, hidden_dim)
        self.params["b2"] = np.zeros(hidden_dim)
        self.params["W3"] = weight_scale * np.random.randn(hidden_dim, num_classes)
        self.params["b3"] = np.zeros(num_classes)

        for k, v in self.params.items():
            self.params[k] = v.astype(dtype)


    def loss(self, X, y=None):
        W1, b1 = self.params['W1'], self.params['b1']
        W2, b2 = self.params['W2'], self.params['b2']
        W3, b3 = self.params['W3'], self.params['b3']

        # pass conv_param to the forward pass for the convolutional layer
        filter_size = W1.shape[2]
        conv_param = {'stride': 1, 'pad': (filter_size - 1) // 2}

        # pass pool_param to the forward pass for the max-pooling layer
        pool_param = {'pool_height': 2, 'pool_width': 2, 'stride': 2}

        scores = None
        # TODO: Implement the forward pass for the three-layer convolutional net

        # conv - relu - 2x2 max pool - affine - relu - affine - softmax
        conv_relu_pool_out, conv_relu_pool_cache = conv_relu_pool_forward(X, W1, b1, conv_param, pool_param)
        layer2_affine_relu_out, layer2_affine_relu_cache = affine_relu_forward(conv_relu_pool_out, W2, b2)
        final_affine_out, final_affine_cache = affine_forward(layer2_affine_relu_out, W3, b3)
        scores = final_affine_out

        if y is None:
            return scores

        loss, grads = 0, {}
        # TODO: Implement the backward pass for the three-layer convolutional net
        loss, dloss = softmax_loss(final_affine_out, y)
        loss += 0.5 * self.reg * (np.sum(np.square(W1)) + np.sum(np.square(W2)) + np.sum(np.square(W3)))
        dfinal_affine_out, dW3, db3 = affine_backward(dloss, final_affine_cache)
        dlayer2_affine_relu_out, dW2, db2 = affine_relu_backward(dfinal_affine_out, layer2_affine_relu_cache)
        _, dW1, db1 = conv_relu_pool_backward(dlayer2_affine_relu_out, conv_relu_pool_cache)

        grads["W1"] = dW1 + self.reg * W1
        grads["b1"] = db1
        grads["W2"] = dW2 + self.reg * W2
        grads["b2"] = db2
        grads["W3"] = dW3 + self.reg * W3
        grads["b3"] = db3

        return loss, grads

6. 训练三层卷积网络

以下参数可以在验证集上获取约50%的准确率：

model = ThreeLayerConvNet(weight_scale=0.001, hidden_dim=500, reg=0.001)

solver = Solver(model, data, num_epochs=1, batch_size=50,
                update_rule='adam', optim_config={ 'learning_rate': 1e-3, },
                verbose=True, print_every=20)

7. `Spatial Batch Normalization`的前向传播和后向传播

layers文件中的spatial_batchnorm_forward()函数：

def spatial_batchnorm_forward(x, gamma, beta, bn_param):
    out, cache = None, None

    # TODO: Implement the forward pass for spatial batch normalization.
    N, C, H, W = x.shape
    x = x.transpose(0, 2, 3, 1).reshape(N * H * W, C)
    out, cache = batchnorm_forward(x, gamma, beta, bn_param)
    out = out.reshape(N, H, W, C).transpose(0, 3, 1, 2)

    return out, cache

layers文件中的spatial_batchnorm_backward()函数：

def spatial_batchnorm_backward(dout, cache):
    dx, dgamma, dbeta = None, None, None

    # TODO: Implement the backward pass for spatial batch normalization.
    N, C, H, W = dout.shape
    dout = dout.transpose(0, 2, 3, 1).reshape(N * H * W, C)
    dx, dgamma, dbeta = batchnorm_backward_alt(dout, cache)
    dx = dx.reshape(N, H, W, C).transpose(0, 3, 1, 2)

    return dx, dgamma, dbeta