Optimization algorithms -----week2

  1. Batch vs. mini-batch gradientt descent
    (1)可以分成5000个子集。
    for t = 1, …., 5000
    Forward prop on x { t } .
    Z [ 1 ] = w [ 1 ] x { t } + b [ t ]
    A [ 1 ] = g [ 1 ] ( Z [ 1 ] )
    …..
    A [ l ] = g [ l ] ( z [ l ] )
    (2) Compute cost :
    J = 1 1000 i = 1 l L ( y ^ ( i ) , y ( i ) ) + λ 2 1000 l w [ l ] F 2
    (3) Backprop to compute gradients J { t } ( x { t } , y { t } )
    w [ t ] = w [ l ] α d w [ l ]
    b [ l ] = b [ l ] α d b [ l ]
  2. Choosing mini-batch size
    (1)if mini-batch size = m: the size of training set——->Batch gradient descent.(如果训练集数据大将会运行很长时间)
    (2) if mini-batch size = 1: Stochastic gradient descent—->Every example is its own mini-batch.(噪声大,而且最后总是在最小值附近摆动)
    (3)Choose In-between(minibatch size not too big/small)
  3. Some guidelines about choosing your mini-batch size:
    (1)If small training(m <= 2000 ) set: Use batch gradient descent.
    (2)Typical mini-batch size:64, 128, 256, 512(据说 2 n 代码运行得快)
    (3)Make sure some mini-batch fit in CPU/GPU memory. x [ t ] , y [ t ]
  4. Exponentially weighted moving averages(指数加权滑动平均)
    V t = β V t 1 + ( 1 β ) θ t
    β = 0.9 1 1 β d a y s t e m p e t a u r e
    β = 0.98 1 1 β = 50 d a y s t e m p e r a t u r e
  5. Bias correction(偏差修正) in exponentially weighted average.
    V t = β V t 1 + ( 1 β ) θ t
    V t 1 β t
  6. Gradient descent with momentum(动量梯度下降法):
    (1) Compute d w , d b on current mini-batch.
    V d w = β V d w + ( 1 β ) d w
    V θ = β V θ + ( 1 β ) θ t
    V d b = β V d b + ( 1 β ) d b
    (2) Update w , b :
    w = w α V d w
    b = b α V d b
    使得梯度下降法在垂直方向的震荡幅度变小,水平方向的移动更快速,以更快速度进行梯度下降。
    (3)Implementation details:
    V d w = 0 , V d b = 0
    On iteration t:
    Compute d W , d b on the current mini-batch.
    v d W = β v d W + ( 1 β ) d W
    v d b = β v d b + ( 1 β ) d b
    W = W α v d W , b = b α v d b
    Hyperparameters: α , β , β = 0.9
    (4) RMSprop(root mean square prop)(均方根传递)
    On iteration t:
    Compute d w , d b on current mini-batch
    S d W = β S d W + ( 1 β ) ( d W ) 2 : ( d W ) 2 , e l e m e n t w i s e
    S d b = β S d b + ( 1 β ) ( d b ) 2
    Update:
    W = W α d W S d W + ε
    b = b α d b S d b + ε
    (5) Adam optimization algorithms
    V d W = 0 , S d W = 0 , V d b = 0 , S d b = 0
    On iteration t:
    Compute d W , d b using current mini-batch.(mini-batch gradient)
    “momentum”:
    V d W = β 1 V d W + ( 1 β 1 ) d W , V d b = β 1 V d b + ( 1 β 1 ) d b
    “RMSprop”:
    S d w = β 2 S d w + ( 1 β 2 ) ( d W ) 2 , S d b = β 2 S d b + ( 1 β 2 ) d b
    Bias corrected:
    v d w c o r r e c t e d = V d W 1 β 1 t ,
    V d b c o r r e c t e d = V d b 1 β 1 t
    S d W c o r r e c t e d = S d w 1 β 2 t ,
    S d b c o r r e c t e d = S d b 1 β 2 t
    W = W α V d w c o r r e c t e d S d w c o r r e c t e d + + ε
    b = b α V d b c o r r e c t e d S d b c o r r e c t e d +
    Hyperparameters choice:
    α : n e e d s t o b e t u n e
    β 1 : 0.9 ( d W )
    β 2 : 0.999 ( d W 2 ) e l e m e n t w i s e
    ε : 10 8
  7. Learning rate decay
    epoch: 迭代次数
    α = 1 1 + d e c a y _ r a t e e p o c h _ n u m b e r s α 0
    8.Local optima in neural network

猜你喜欢

转载自blog.csdn.net/qq_31805127/article/details/79788571