神经网络中的正则化

本文已参与「新人创作礼」活动,一起开启掘金创作之路。

Adding regularization will often help To prevent overfitting problem (high variance problem ).

1. Logistic regression

回忆一下训练时的优化目标函数

min w , b J ( w , b ) ,      w R n x , b R (1-1) \min \limits_{w,b}J\left(w,b\right), \ \ \ \ w\in\mathbb{R}^{n_x},b\in\mathbb{R} \tag{1-1}

其中

J ( w , b ) = 1 m i = 1 m L ( y ^ ( i ) , y ( i ) ) (1-2) J\left(w,b\right)=\frac{1}{m}\sum_{i=1}^{m}L\left(\hat y^{(i)},y^{(i)}\right)\\ \tag{1-2}

L 2    r e g u l a r i z a t i o n L_2 \ \ regularization (most commonly used):

其中

Why do we regularize just the parameter w? Because w Is usually a high dimensional parameter vector while b is A scalar. Almost all The parameters are in w rather than b.
L 1    r e g u l a r i z a t i o n L_1 \ \ regularization

J ( w , b ) = 1 m i = 1 m L ( y ^ ( i ) , y ( i ) ) + λ m w 1 (1-5) J\left(w,b\right)=\frac{1}{m}\sum_{i=1}^{m}L\left(\hat y^{(i)},y^{(i)}\right)+\frac{\lambda}{m}\left\lvert w \right\rvert_1\tag{1-5}

其中

w 1 = j n x w j (1-6) \left\lvert w \right\rvert_1=\sum_j^{n_x}\left\lvert w_j \right\rvert \tag{1-6}

w will end up being sparse. In other words the w vector will have a lot of zeros in it. This can help with compressing the model a little.

2. Neural network "Frobenius norm"

其中

w [ l ] F 2 = i n [ l 1 ] j n [ l ] ( w i j ) 2 (2-2) \left\lVert w^{[l]} \right\rVert_F^2=\sum_i^{n^{[l-1]}}\sum_j^{n^{[l]}}\left(w_{ij}\right)^2 \tag{2-2}

L 2 L_2 regulation is also called Weight decay:

d w [ l ] = ( f r o m   b a c k p r o p ) + λ m w [ l ] w l : = w [ l ] α d w [ l ] = ( 1 α λ m ) w [ l ] α ( f r o m   b a c k p r o p ) (2-3) \begin{aligned} dw^{[l]}&=\left(from\ backprop\right)+\frac{\lambda}{m}w^{[l]}\\ w^{l}:&=w^{[l]}-\alpha dw^{[l]}\\ &=\left(1-\frac{\alpha\lambda}{m}\right)w^{[l]}-\alpha(from\ backprop)\\ \tag{2-3} \end{aligned}

能够防止权重 w w 过大,从而避免过拟合

3. inverted dropout

对于不同的训练样本都可以随机消除一部分结点
反向随机失活(前向和后向都需要dropout):

d 3 = n p . r a n d o m . r a n d ( a 3 . s h a p e [ 0 ] , a 3 . s h a p e [ 1 ] ) < k e e p . p r o b a 3 = n p . m u l t i p l y ( a 3 , d 3 )     # a 3 d 3 , e l e m e n t   w i s e   m u l t i p l i c a t i o n a 3 / = k e e p . p r o b     # i n   o r d e r   t o   n o t   r e d u c e   t h e   e x p e c t e d   v a l u e   o f   a 3    i n v e r t e d   d r o p o u t z [ 4 ] = w [ 4 ] a [ 3 ] + b [ 4 ] z [ 4 ] / = k e e p . p r o b (3-1) \begin{aligned} d^3&=np.random.rand(a_3.shape[0],a_3.shape[1]) < keep.prob\\ a^3&=np.multiply(a_3,d_3)\ \ \ \#a3*d3, element\ wise\ multiplication\\ a^3/&=keep.prob\ \ \ \#in\ order\ to\ not\ reduce\ the\ expected\ value\ of\ a^3\ \ inverted\ dropout\\ z^{[4]}&=w^{[4]}a^{[3]}+b^{[4]}\\ z^{[4]}/&=keep.prob\\ \tag{3-1} \end{aligned}

this inverted dropout technique by dividing by the keep.prob, it ensures that the expected value of a3 remains the same. This makes test time easier because you have less of a scaling problem. 测试时不需要使用drop out

猜你喜欢

转载自juejin.im/post/7109128137614721032