c、b
There is a problem
After the input is standardized, the following situation may occur: a layer in the middle of the network learns that the feature data itself is distributed on both sides of the sigmoid activation function, and the standardization will force the input mean to be limited to 0 and the standard deviation to be limited to 1. In this way, the data is transformed into a distribution in the middle part of the sigmoid activation function, which destroys the feature distribution learned by a certain layer in the middle of the network.
Solution
Transform reconstruction, introduce learnable parameters γ, β, all γ and all β are initialized to 1 and 0 respectively, and each neuron in each layer has its own γ and β. For forward propagation, each neuron's own γ and β are used. During backpropagation, the gradients of all γ and all β are computed, i.e. dγ and dβ. After the backpropagation is over, each γ and β can be updated by an optimizer (such as SGD, Adam, etc.) and the corresponding gradients dγ and dβ respectively.
forward propagation
import numpy as np
def batchnorm_forward(x, gamma, beta, bn_param):
r"""
args:
- x: Data of shape (N, D)
- gamma: Scale parameter of shape (D,)
- beta: Shift paremeter of shape (D,)
- bn_param: Dictionary with the following keys:
Returns:
- out: of shape (N, D)
- cache: A tuple of values needed in the backward pass
"""
mode = bn_param['mode']
eps = bn_param.get('eps', 1e-5)
momentum = bn_param.get('momentum', 0.9)
N, D = x.shape
running_mean = bn_param.get('running_mean', np.zeros(D, dtype=x.dtype))
running_var = bn_param.get('running_var', np.zeros(D, dtype=x.dtype))
out, cache = None, None
if mode == 'train':
sample_mean = np.mean(x, axis=0)
sample_var = np.var(x, axis=0)
out_ = (x - sample_mean) / np.sqrt(sample_var + eps)
running_mean = momentum * running_mean + (1 - momentum) * sample_mean
running_var = momentum * running_var + (1 - momentum) * sample_var
out = gamma * out_ + beta
cache = (out_, x, sample_var, sample_mean, eps, gamma, beta)
elif mode == 'test':
scale = gamma / np.sqrt(running_var + eps)
out = x * scale + (beta - running_mean * scale)
bn_param['running_mean'] = running_mean
bn_param['running_var'] = running_var
return out, cache
backpropagation
import numpy as np
def batchnorm_backward(dout, cache):
r"""
args:
- dout: Upstream derivatives, of shape (N, D)
- cache: Variable of intermediates from batchnorm_forward.
Returns:
- dx: Gradient with respect to inputs x, of shape (N, D)
- dgamma: Gradient with respect to scale parameter gamma, of shape (D,)
- dbeta: Gradient with respect to shift parameter beta, of shape (D,)
"""
dx, dgamma, dbeta = None, None, None
out_, x, sample_var, sample_mean, eps, gamma, beta = cache
N = x.shape[0]
dout_ = gamma * dout
dvar = np.sum(dout_ * (x - sample_mean) * -0.5 * (sample_var + eps) ** -1.5, axis=0)
dx_ = 1 / np.sqrt(sample_var + eps)
dvar_ = 2 * (x - sample_mean) / N
di = dout_ * dx_ + dvar * dvar_
dmean = -1 * np.sum(di, axis=0)
dmean_ = np.ones_like(x) / N
dx = di + dmean * dmean_
dgamma = np.sum(dout * out_, axis=0)
dbeta = np.sum(dout, axis=0)
return dx, dgamma, dbeta
【References】
Batch Normalization study notes and its implementation - Programmer Sought
Deep Learning (29) Batch Normalization Study Notes - hjimce's Blog - CSDN Blog