The parameters of the BN layer γ, β and forward and backward propagation

c、b

There is a problem

After the input is standardized, the following situation may occur: a layer in the middle of the network learns that the feature data itself is distributed on both sides of the sigmoid activation function, and the standardization will force the input mean to be limited to 0 and the standard deviation to be limited to 1. In this way, the data is transformed into a distribution in the middle part of the sigmoid activation function, which destroys the feature distribution learned by a certain layer in the middle of the network.

Solution

Transform reconstruction, introduce learnable parameters γ, β, all γ and all β are initialized to 1 and 0 respectively, and each neuron in each layer has its own γ and β. For forward propagation, each neuron's own γ and β are used. During backpropagation, the gradients of all γ and all β are computed, i.e. dγ and dβ. After the backpropagation is over, each γ and β can be updated by an optimizer (such as SGD, Adam, etc.) and the corresponding gradients dγ and dβ respectively.

forward propagation

import numpy as np


def batchnorm_forward(x, gamma, beta, bn_param):
    r"""
    args:
    - x: Data of shape (N, D)
    - gamma: Scale parameter of shape (D,)
    - beta: Shift paremeter of shape (D,)
    - bn_param: Dictionary with the following keys:
    Returns:
    - out: of shape (N, D)
    - cache: A tuple of values needed in the backward pass
    """
    mode = bn_param['mode']
    eps = bn_param.get('eps', 1e-5)
    momentum = bn_param.get('momentum', 0.9)
    N, D = x.shape
    running_mean = bn_param.get('running_mean', np.zeros(D, dtype=x.dtype))
    running_var = bn_param.get('running_var', np.zeros(D, dtype=x.dtype))
    out, cache = None, None
    
    if mode == 'train':
        sample_mean = np.mean(x, axis=0)
        sample_var = np.var(x, axis=0)
        out_ = (x - sample_mean) / np.sqrt(sample_var + eps)
        running_mean = momentum * running_mean + (1 - momentum) * sample_mean
        running_var = momentum * running_var + (1 - momentum) * sample_var
        out = gamma * out_ + beta
        cache = (out_, x, sample_var, sample_mean, eps, gamma, beta)

    elif mode == 'test':
        scale = gamma / np.sqrt(running_var + eps)
        out = x * scale + (beta - running_mean * scale)

    bn_param['running_mean'] = running_mean
    bn_param['running_var'] = running_var

    return out, cache

backpropagation

import numpy as np


def batchnorm_backward(dout, cache):
    r"""
    args:
    - dout: Upstream derivatives, of shape (N, D)
    - cache: Variable of intermediates from batchnorm_forward.
    Returns:
    - dx: Gradient with respect to inputs x, of shape (N, D)
    - dgamma: Gradient with respect to scale parameter gamma, of shape (D,)
    - dbeta: Gradient with respect to shift parameter beta, of shape (D,)
    """
    dx, dgamma, dbeta = None, None, None
    out_, x, sample_var, sample_mean, eps, gamma, beta = cache
    N = x.shape[0]
    dout_ = gamma * dout
    dvar = np.sum(dout_ * (x - sample_mean) * -0.5 * (sample_var + eps) ** -1.5, axis=0)
    dx_ = 1 / np.sqrt(sample_var + eps)
    dvar_ = 2 * (x - sample_mean) / N

    di = dout_ * dx_ + dvar * dvar_
    dmean = -1 * np.sum(di, axis=0)
    dmean_ = np.ones_like(x) / N

    dx = di + dmean * dmean_
    dgamma = np.sum(dout * out_, axis=0)
    dbeta = np.sum(dout, axis=0)

    return dx, dgamma, dbeta

References

Batch Normalization study notes and its implementation - Programmer Sought 

Deep Learning (29) Batch Normalization Study Notes - hjimce's Blog - CSDN Blog 

Guess you like

Origin blog.csdn.net/qq_38964360/article/details/131442126