batch norm反向公式推导

版权声明:https://blog.csdn.net/z0n1l2 https://blog.csdn.net/z0n1l2/article/details/85858338

输入

X i = ( x i 0 , x i 1 , . . . , x i ( n 1 ) ) X_i=(x_{i0},x_{i1},...,x_{i(n-1)}) i [ 0 , m 1 ] i \in [0,m-1] batch-size等于m,特征维度n

输出

Y i = ( y i 0 , y i 1 , . . . , y i ( n 1 ) ) Y_i=(y_{i0},y_{i1},...,y_{i(n-1)}) i [ 0 , m 1 ] i \in [0,m-1] 维度和输入 X X 一致

前向计算

  1. 均值
    μ = μ 0 , μ 1 , . . . , μ n \mu = {\mu_0,\mu_1,...,\mu_n} 其中
    μ p = 1 m i x i p \mu_p = \frac{1}{m}\sum_ix_{ip}

  2. 方差
    σ = σ 0 , σ 1 , . . . , σ n \sigma = {\sigma_0,\sigma_1,...,\sigma_n} 其中
    σ p = 1 m i ( x i p μ p ) 2 \sigma_p = \frac{1}{m}\sum_i(x_{ip}-\mu_p)^2

  3. 中间结果
    x i p = x i p μ p σ p 2 + ϵ \overline x_{ip}=\frac{x_{ip}-\mu_p}{\sqrt{\sigma_p^2+\epsilon}}

  4. 结果
    y i p = γ p x i p + β p y_{ip}=\gamma_p \overline x_{ip}+\beta_p 其中
    参数 γ = γ 0 , γ 1 , . . . , γ n 1 \gamma = {\gamma_0, \gamma_1,...,\gamma_{n-1}}
    β = β 0 , β 1 , . . . , β n 1 \beta = {\beta_0,\beta_1,...,\beta_{n-1}}
    是learnable parameters

反向计算

O x i j = k l O y k l y k l x i j = k l O y k l y k l x i j x i j x i j = k l O y k l γ l x i j x i j ( 1 ) \frac{\partial O}{\partial x_{ij}}=\sum_{kl}{ \frac{\partial O}{\partial y_{kl}} } \frac{\partial y_{kl}}{\partial x_{ij}} = \sum_{kl}{ \frac{\partial O}{\partial y_{kl}} } \frac{\partial y_{kl}}{\partial \overline x_{ij}} \frac{\partial \overline x_{ij}}{\partial x_{ij}} = \sum_{kl}{ \frac{\partial O}{\partial y_{kl}} } \gamma_l \frac{\partial \overline x_{ij}}{\partial x_{ij} } \quad (1)

x i j x i j = ( x k l μ l ) x i j σ l 2 + ϵ σ l 2 + ϵ x i j ( x k l μ l ) σ l 2 + ϵ ( 2 ) \frac{\partial \overline x_{ij}}{\partial x_{ij}} = \frac { \frac{\partial{ (x_{kl}-\mu_l)}}{\partial x_{ij}} \sqrt{\sigma_l^2+\epsilon} - \frac{ \partial {\sqrt{\sigma_l^2+\epsilon}} }{\partial x_{ij}}(x_{kl}-\mu_l) } { \sigma_l^2+\epsilon } \quad (2)

( x k l μ l ) x i j = δ k i δ l j δ l j 1 m ( 3 ) \frac{ \partial (x_{kl}-\mu_l)}{\partial x_{ij}} = \delta_{ki}\delta_{lj} - \delta_{lj} \frac{1}{m} \quad (3)
其中
δ p q = { 1 p = q 0 e l s e \delta_{pq}= \begin{cases} 1 \quad p=q \\ 0 \quad else \end{cases}
这个符号可以替代推导过程中的if-else,遇到求和号可以消除
σ l 2 + ϵ x i j = 1 m 1 σ l 2 + ϵ δ l j ( x i l μ l ) ( 4 ) \frac{\partial \sqrt{\sigma_l^2 + \epsilon}} {\partial x_{ij}} = \frac{1}{m} \frac{1}{\sqrt{\sigma_l^2+\epsilon}} \delta_{lj} (x_{il} - \mu_l) \quad (4)
(3)(4)带入(2)得到
x i j x i j = δ l j ( δ k i 1 m ) σ l 2 + ϵ 1 m σ l 2 + ϵ ( x k l μ l ) ( x i l μ l ) σ l 2 + ϵ \frac{\partial \overline x_{ij}}{\partial x_{ij}} = \delta_{lj} \frac { (\delta_{ki} - \frac{1}{m}) \sqrt{\sigma_l^2 + \epsilon} - \frac{1}{m\sqrt{\sigma_l^2 + \epsilon}}(x_{kl}-\mu_l)(x_{il}-\mu_l) } {\sigma_l^2 + \epsilon}
上式带入公式(1)得到
O x i j = γ j m σ j 2 + ϵ ( σ j 2 + ϵ ) ( ( σ j 2 + ϵ ) ( m O y j j k O y k j ) ( x i j μ j ) ( x k j μ j ) k O y k j ) ( d o n e ) \frac{\partial O}{\partial x_{ij}} = \frac{\gamma_j}{m\sqrt{\sigma_j^2 + \epsilon}(\sigma_j^2 + \epsilon)} ( (\sigma_j^2 + \epsilon)( m\frac{\partial O}{\partial y_{jj}}-\sum_k\frac{\partial O}{\partial y_{kj}}) - (x_{ij}-\mu_j)(x_{kj}-\mu_j)\sum_k\frac{\partial O}{\partial y_{kj}} ) \quad (done)

猜你喜欢

转载自blog.csdn.net/z0n1l2/article/details/85858338