[Hands-on Deep Learning v2 Li Mu] Study Notes 09: Numerical Stability, Model Initialization, Activation Function

Review of the previous article: the discard method

1. Numerical stability

1.1 Introduction

Numerical stability is a very important content in deep learning, especially when the number of network layers is very deep, the numerical value can easily become unstable.

Gradients of Neural Networks

  • Consider the following neural network with d layers:
    h ⃗ t = ft ( h ⃗ t − 1 ) and y = l ∘ fd ∘ ⋯ ∘ f 1 ( x ⃗ ) \vec h^t=f_t(\vec h^{t -1}) \qquad \text{and} \qquad y = \mathscr{l} \circ f_d \circ \cdots \circ f_1(\vec x)h t=ft(h t1)andy=lfdf1(x )
  • Calculate the loss lll Regarding the parameterW t W_tWtGradient of
    ∂ l ∂ W t = ∂ l ∂ h ⃗ d ∂ h ⃗ d ∂ h ⃗ d − 1 ⋯ ∂ h ⃗ t + 1 ∂ h ⃗ t ⏟ d − t matrix multiplication ∂ h ⃗ t ∂ W t \frac{\partial l}{\partial W^t} = \frac{\partial l}{\partial \vec h^d} \underbrace{ \frac{\partial \vec h^d}{\partial \vec h^{d-1}} \cdots \frac{\partial \vec h^{t+1}}{\partial \vec h^t} }_{dt \text{ sub-matrix multiplication}} \frac{\ partial \vec h^t}{\partial W^t}Wtl=h dld t matrix multiplications  h d1h dh th t+1Wth tSince the derivative of a vector with respect to a vector is a matrix, we would do d − t dtdt times of matrix multiplication. And too many matrix multiplications can lead to problems with numerical stability.

1.2 Two common problems

  • The two most common problems caused by numerical stability are exploding and vanishing gradients , for example:
    1. 5 100 ≈ 4 × 1 0 17 0. 8 100 ≈ 2 × 1 0 − 10 1.5^{100} \approx 4 \times 10^{17} \\ 0.8^{100} \approx 2 \times 10^{-10}1.51004×10170.81002×1010

  • For example: MLP
    Suppose the following MLP (offset omitted for simplicity). No. ttThe output of layer t is equal to the ttthThe input of the t layer is multiplied by the weight, and then passed through the activation function:
    ft ( h ⃗ t − 1 ) = σ ( W th ⃗ t − 1 ) σ is the activation function f_t(\vec h^{t-1}) = \sigma (W^t \vec h^{t-1}) \qquad \sigma \text{is the activation function}ft(h t1)=s ( Wth t1)σ is the activation functionttt layer output pairttThe derivative of the input of the t -layer is first deriving the activation function, and then multiplying by the transposition of the weight matrix:
    ∂ h ⃗ t ∂ h ⃗ t − 1 = diag ( σ ′ ( W th ⃗ t − 1 ) ) ( W t ) T σ ′ is the derivative function of σ\frac{\partial \vec h^{t}}{\partial \vec h^{t-1}} = \mathop{diag} (\sigma'(W^t\vec h^{t-1}))(W^t)^T \qquad \sigma'\text{is the derivative function of}\sigma \text{}h t1h t=dia g ( σ _(Wth t1))(Wt)Tp isthe derivative function ofσ. Therefore, thettthThe derivative of the t -layer output to the original input is the multiplicationof a series of derivatives and the transposed matrix:
    ∏ i = td − 1 ∂ h ⃗ i + 1 ∂ h ⃗ i = ∏ i = td − 1 diag ( σ ′ ( W ih ⃗ i − 1 ) ) ( W i ) T \prod_{i=t}^{d-1}\frac{\partial \vec h^{i+1}}{\partial \vec h^i}=\ prod_{i=t}^{d-1}\mathop{diag}\big(\sigma'(W^i\vec h^{i-1})\big)(W^i)^Ti=td1h ih i+1=i=td1diag( p(Wih i1))(Wi)T

1.3 Gradient Explosion

1.3.1 Causes

  • If we use the ReLU function as the activation function.
    σ ( x ) = max ⁡ ( 0 , x ) and σ ′ ( x ) = { 1 if x > 0 0 otherwise \sigma(x)=\max(0, x) \qquad \text{and} \qquad \ sigma'(x) = \begin{cases}1 \quad \text{if } x > 0 \\ 0 \quad \text{otherwise}\end{cases}σ ( x )=max(0,x)andp(x)={ 1if x>00otherwise
  • ∏ i = t d − 1 ∂ h ⃗ i + 1 ∂ h ⃗ i = ∏ i = t d − 1 d i a g ( σ ′ ( W i h ⃗ i − 1 ) ) ( W i ) T \prod_{i=t}^{d-1}\frac{\partial \vec h^{i+1}}{\partial \vec h^i} = \prod_{i=t}^{d-1}\mathop{diag}\big(\sigma'(W^i\vec h^{i-1})\big)(W^i)^T i=td1h ih i+1=i=td1diag( p(Wih i1))(Wi)Some elements of T will come from∏ i = td − 1 ( W i ) T \prod_{i=t}^{d-1} (W^i)^Ti=td1(Wi)T
    • If d − t dtdIf t is very large, the value of this multiplication will be very large.

1.3.2 causing problems

  • Value out of range (infinity)
    • It is especially serious for 16-bit floating point numbers (value range ( 6 − 5 , 6 4 ) (6^{-5}, 6^4)(65,64)
  • sensitive to learning rate
    • If the learning rate is too large, it will result in large parameter values ​​and thus larger gradients.
    • If the learning rate is too small, it will lead to no progress in training.
    • We may need to keep adjusting the learning rate during retraining.

1.4 Gradient disappearance

1.4.1 Causes

  • If we use the sigmoid function as the activation function.
    σ ( x ) = 1 1 + e − x σ ′ ( x ) = σ ( x ) ( 1 − σ ( x ) ) \sigma(x)=\frac{1}{1+e^{-x}} \qquad \sigma'(x)=\sigma(x)(1-\sigma(x))σ ( x )=1+ex1p(x)=σ ( x ) ( 1s ( x ))
  • When the input is relatively large, the gradient will be small (close to 0).
  • ∏ i = t d − 1 ∂ h ⃗ i + 1 ∂ h ⃗ i = ∏ i = t d − 1 d i a g ( σ ′ ( W i h ⃗ i − 1 ) ) ( W i ) T \prod_{i=t}^{d-1}\frac{\partial \vec h^{i+1}}{\partial \vec h^i} = \prod_{i=t}^{d-1}\mathop{diag}\big(\sigma'(W^i\vec h^{i-1})\big)(W^i)^T i=td1h ih i+1=i=td1diag( p(Wih i1))(Wi)The element values ​​of T will bed − t dtdThe product of t decimal values.
    • Like 0. 8 100 ≈ 2 × 1 0 − 10 0.8^{100} \approx 2 \times 10^{-10}0.81002×1010

1.4.2 causing problems

  • The gradient value becomes 0.
    • This is especially true for 16-bit floating point numbers.
  • Training is not progressing.
    • Regardless of how the learning rate is chosen.
  • Especially for the bottom layer.
    • Only the top layers are trained better.
    • There is no way to make the neural network deeper.

1.5 Summary

  • When the value is too large or too small, it will cause numerical problems.
  • Often occurs in deep models because it multiplies n numbers.

2. Make training more stable

2.1 Overview

  • Goal: Make the gradient value within a reasonable range, for example: [ 1 − 6 , 1 3 ] [1^{-6}, 1^3][16,13]
  • method:
    • Turn multiplication into addition, for example: ResNet, LSTM
    • Normalized
      • Gradient normalization, gradient clipping
    • Reasonable weight initialization and activation functions

This article mainly explains how to properly initialize the weights and use an appropriate activation function .

2.2 Weight initialization

2.2.1 Objective: Let the variance of each layer be a constant.

  • Treat the output and gradient of each layer as random variables.
  • Keep their mean and variance the same.
Forward reverse
E [ h i t ] = 0 V a r [ h i t ] = a \mathbb{E}[h_i^t]=0 \\ Var[h_i^t]=a E[hit]=0Where r [ hit]=a a a a is a constant E [ ∂ l ∂ h i t ] = 0 V a r [ ∂ l ∂ h i t ] = b ∀ i , t \mathbb{E}\Bigg[\frac{\partial l}{\partial h_i^t}\Bigg]=0 \qquad Var\Bigg[\frac{\partial l}{\partial h_i^t}\Bigg]=b \qquad \forall i,t E[hitl]=0was [ _hitl]=bi,t b b b is a constant

2.2.2 Problem Analysis

  • Randomize the initial parameters within a range of reasonable values.
  • More prone to numerical instabilities at the beginning of training
    • The loss function surface can be complex far from the optimal solution.
    • Surfaces near the optimal solution will be relatively flat.
  • Use N ( 0 , 0.01 ) \mathcal{N}(0, 0.01)N(0,0.01 ) may be fine for small networks, but not guaranteed for deep neural networks.

What should we do if we want to satisfy the previous goal that both the variance and the mean are constant?
Let's look at an example first: MLP

  • suppose
    • w i , j t w^t_{i, j} wi,jti . Yo . d iidi . i . d (independent and identical distribution), thenE [ wi , jt ] = 0 , V ar [ wi , jt ] = γ t \mathbb{E}[w^t_{i, j}]=0, Var [w^t_{i, j}]=\gamma_tE [ wi,jt]=0,r [ w _i,jt]=ct
    • h i t − 1 h^{t-1}_i hit1Independent of wi , jtw^t_{i, j}wi,jt
  • Assuming there is no activation function , then h ⃗ t = W th ⃗ t − 1 \vec h^t = W^t\vec h^{t-1}h t=Wth t1,这里 W t ∈ R n t × n t − 1 W^t\in \mathbb{R}^{n_t\times n_{t-1}} WtRnt×nt1
    • Properly, independently : E [ hit ] = E [ ∑ jwi , jthjt − 1 ] = ∑ j E [ wi , jt ] E [ hjt − 1 ] = 0 \mathbb{E}[h^t_i]= \mathbb{E}\Bigg[\sum_jw^t_{i, j} h^{t-1}_j\Bigg]=\sum_j\mathbb{E}[w^t_{i, j}]\mathbb{E }[h^{t-1}_j]=0E[hit]=E[jwi,jthjt1]=jE [ wi,jt]E[hjt1]=0
    • 正向方差: V a r [ h i t ] = E [ ( h i t ) 2 ] − ( E [ h i t ] ) 2 = E [ ( ∑ j w i , j t h j t − 1 ) 2 ] = E [ ∑ j ( w i , j t ) 2 ( h j t − 1 ) 2 + ∑ j ≠ k w i , j t w i , k t h j t − 1 h k t − 1 ⏟ 独立同分布,此项的期望为0 ] = ∑ j E [ ( w i , j t ) 2 ] E [ ( h j t − 1 ) 2 ] = ∑ j V a r [ w i , j t ] V a r [ h j t − 1 ] = n t − 1 γ t V a r [ h j t − 1 ] \begin{aligned} Var[h^t_i]&=\mathbb{E}[(h^t_i)^2]-(\mathbb{E}[h^t_i])^2 =\mathbb{E}\Bigg[\bigg(\sum_jw^t_{i, j}h^{t-1}_j\bigg)^2\Bigg] \\ &=\mathbb{E}\Bigg[\sum_j(w^t_{i, j})^2(h^{t-1}_j)^2+\underbrace{\sum_{j\neq k}w^t_{i, j}w^t_{i, k}h^{t-1}_j h^{t-1}_k}_{\text{独立同分布,此项的期望为0}}\Bigg] \\ &=\sum_j\mathbb{E}\big[(w^t_{i, j})^2\big]\mathbb{E}\big[(h^{t-1}_j)^2\big] \\ &=\sum_jVar[w^t_{i, j}]Var[h^{t-1}_j] \\ &=n_{t-1}\gamma_t Var[h^{t-1}_j] \end{aligned} Where r [ hit]=E[(hit)2](E[hit])2=E[(jwi,jthjt1)2]=E[j(wi,jt)2(hjt1)2+Independent and identically distributed, the expectation of this term is 0 j=kwi,jtwi,kthjt1hkt1]=jE [ ( wi,jt)2]E[(hjt1)2]=jr [ w _i,jt] Where r [ hjt1]=nt1ctWhere r [ hjt1]It can be seen that if you want to make the positive variances the same, you need to satisfy: nt − 1 γ t = 1 n_{t-1}\gamma_t=1nt1ct=1
    • 反向均值:
      跟正向情况相似 ∂ l ∂ h ⃗ t − 1 = ∂ l ∂ h ⃗ t W t ⇒ ( ∂ l ∂ h ⃗ t − 1 ) T = ( W t ) T ( ∂ l ∂ h ⃗ t ) T \frac{\partial l}{\partial \vec h^{t-1}}=\frac{\partial l}{\partial \vec h^t}W^t \qquad \Rightarrow \qquad \Big(\frac{\partial l}{\partial \vec h^{t-1}}\Big)^T=(W^t)^T\Big(\frac{\partial l}{\partial \vec h^t}\Big)^T h t1l=h tlWt(h t1l)T=(Wt)T(h tl)T E [ ∂ l ∂ h i t − 1 ] = 0 \mathbb{E}\bigg[\frac{\partial l}{\partial h^{t-1}_i}\bigg]=0 E[hit1l]=0
    • 反向方差:
      V a r [ ∂ l ∂ h i t − 1 ] = n t γ t V a r [ ∂ l ∂ h j t ] Var\bigg[\frac{\partial l}{\partial h^{t-1}_i}\bigg]=n_t\gamma_tVar\bigg[\frac{\partial l}{\partial h^{t}_j}\bigg] was [ _hit1l]=ntctwas [ _hjtl] is similar to the forward case, if you want to make the reverse variances the same, you need to satisfy:nt γ t = 1 n_{t}\gamma_t=1ntct=1
    • It can be solved by using Xavier initialization , see "2.2.3 Xavier initialization" below for details.

2.2.3 Initialization of Xavier

  • It is difficult to satisfy nt − 1 γ t = 1 n_{t-1}\gamma_t=1nt1ct=1 n t γ t = 1 n_{t}\gamma_t=1 ntct=1
  • Xavier使得 γ t ( n t − 1 + n t ) / 2 = 1 → γ t = 2 / ( n t − 1 + n t ) \gamma_t(n_{t-1}+n_t)/2=1 \quad \rightarrow \quad \gamma_t=2/(n_{t-1}+n_t) ct(nt1+nt)/2=1ct=2/(nt1+nt)
    • Normal distribution: N ( 0 , 2 / ( nt − 1 + nt ) ) \mathcal{N}\big(0, \sqrt{2/(n_{t-1}+n_t)}\big)N(0,2/(nt1+nt) )
    • 均匀分布: U ( − 6 / ( n t − 1 + n t ) , 6 / ( n t − 1 + n t ) ) \mathcal{U}\big(-\sqrt{6/(n_{t-1}+n_t)}, \sqrt{6/(n_{t-1}+n_t)}\big) U(6/(nt1+nt) ,6/(nt1+nt) )
      • distribution U [ − a , a ] \mathcal{U}[-a, a]U[a,The variance of a ] isa 2 / 3 a^2/3a2/3
  • Adapt weight shape transformations, specifically nt n_tnt

2.3 Activation function

2.3.1 Problem Analysis

Undertake the assumptions in "2.2.2 Problem Analysis" above:

  • Assuming a linear activation function
    • Suppose σ ( x ) = α x + β \sigma(x)=\alpha x+\betaσ ( x )=αx+β
      h ⃗ ′ = W t h ⃗ t − 1 and h ⃗ t = σ ( h ⃗ ′ ) \vec h'=W^t \vec h^{t-1} \quad \text{and} \quad \vec h^t=\sigma(\vec h') h =Wth t1andh t=s (h )
    • Syntax: E [ hit ] = E [ α hi ′ + β ] = β \mathbb{E}[h^t_i] = \mathbb{E}[\alpha h_i'+\beta] = \betaE[hit]=E [ a hi+b ]=βThis means that if you want to make the expectation 0, you need to satisfyβ = 0 \beta=0b=0
    • Solution: V ar [ hit ] = E [ ( hit ) 2 ] − ( E [ hit ] ) 2 = E [ ( α hi ′ + β ) 2 ] − β 2 = E [ α 2 ( hi ′ ) 2 + 2 α β hi ′ + β 2 ] − β 2 = α 2 V ar [ hi ′ ] \begin{aligned} Var[h^t_i] &= \mathbb{E}[(h^t_i)^2]-; (\mathbb{E}[h^t_i])^2 \\ &= \mathbb{E}[(\alpha h_i'+\beta)^2]-\beta^2 \\ &= \mathbb{E} [\alpha^2(h_i')^2+2\alpha\beta h_i'+\beta^2]-\beta^2\\ &=\alpha^2Var[h_i'] \end{aligned}Where r [ hit]=E[(hit)2](E[hit])2=E [( a hi+b )2]b2=E [ a2(hi)2+2 a b hi+b2]b2=a2Var[hi]If the variances are made the same, it needs to satisfy α = 1 \alpha=1a=1
    • Inverse expectation:
      Suppose σ ( x ) = α x + β \sigma(x)=\alpha x+\betaσ ( x )=αx+β ∂ l ∂ h ⃗ ′ = ∂ l ∂ h ⃗ t ( W t ) T and ∂ l ∂ h ⃗ t − 1 = α ∂ l ∂ h ⃗ ′ \frac{\partial l}{\partial \vec h'}=\frac{\partial l}{\partial \vec h^t}(W^t)^T \quad \text{and} \quad \frac{\partial l}{\partial \vec h^{t-1}}=\alpha\frac{\partial l}{\partial \vec h'} h l=h tl(Wt)Tandh t1l=ah l E [ ∂ l ∂ h i t − 1 ] = 0 \mathbb{E}\bigg[\frac{\partial l}{\partial h^{t-1}_i}\bigg]=0 E[hit1l]=0 If you want to make the expectation 0, you need to satisfyβ = 0 \beta=0b=0
    • 反向方差:
      V a r [ ∂ l ∂ h i t − 1 ] = α 2 V a r [ ∂ l ∂ h j ′ ] Var\bigg[\frac{\partial l}{\partial h^{t-1}_i}\bigg]=\alpha^2Var\bigg[\frac{\partial l}{\partial h_j'}\bigg] was [ _hit1l]=a2 Vr[hjl] If the variance is the same, it needs to satisfyα = 1 \alpha=1a=1
    • This means that our activation function must be σ ( x ) = x \sigma(x)=xσ ( x )=x

2.3.2 Check common activation functions

  • Using Taylor expansion:
    sigmoid ( x ) = 1 2 + x 4 − x 3 48 + O ( x 5 ) sigmoid(x)=\frac 12+\frac x4-\frac{x^3}{48}+O( x^5)sigmoid(x)=21+4x48x3+O(x5 ) tanh ( x ) = 0 + x − x 3 3 + O ( x 5 ) tanh(x)=0+x-\frac{x^3}{3}+O(x^5)t english ( x )=0+x3x3+O(x5) r e l u ( x ) = 0 + x  for  x ≥ 0 relu(x)=0+x \qquad \text{ for }x \geq 0 read again ( x ) _=0+x for x0 We find that fortanh ( x ) tanh(x)t anh ( x ) sumrelu ( x ) relu(x)re l u ( x ) function can be approximated as a linear function y = xy=xnear the zero pointy=x;而 s i g m o i d ( x ) sigmoid(x) s i g m o i d ( x ) will not work.
  • Therefore, our sigmoid ( x ) sigmoid(x)s i g m o i d ( x ) is adjusted:
    4 × sigmoid ( x ) − 2 4\times sigmoid(x)-24×sigmoid(x)2 adjustedsigmoid ( x ) sigmoid(x)s i g m o i d ( x ) can also be approximated asy = xy=xy=x

2.5 Summary

  • Reasonable selection of weight initial values ​​and activation functions can improve numerical stability.

Guess you like

Origin blog.csdn.net/weixin_45800258/article/details/127198243