Practical level to improve the DNN (a) the depth of learning

1, L regularization layer neural network:

(1) L2 regularization:

 

(2) Why regularization to avoid over-fitting?

 When lambda is large enough, J is minimized, so that the weight matrix w will be close to zero, a simplified neural network is a high bias state:

 lambda was larger than w, the z = w * a + b, z is small, to tanh function as an example:

When z is smaller stage, the function g (z) close to linear. If the linear approach each layer, the network is a linear network, the situation will not fit through.

 

(3) dropout regularization (random inactivation):

 Each node of the neural network contains deactivation probability p, as follows:

 

Simplify wiring, get a node less, smaller-scale networks:

Codes are as follows:

For the third layer be a random inactivation, 0.8 keep_prob = (the probability of a hidden unit to retain, i.e., eliminating the probability of a hidden unit is 0.2), keep_prob different layers may be different.

d3 = np.random.rand(a3.shape[0], a3.shape[1]) < keep_prob

a3 = np.multiply (a3, d3) # filtered inactivated node

a3 = a3 / keep_prob # make up 20% of the filtered off, such that the expected value of the constant a3

 

(4) Other regularization method:

① expand the data set;

② early termination of the iteration:

 

 

(5) Input regularization:

① zero-mean:

μ = 1 / m * ∑x(i)

x = x - m

 

② variance normalization:

σ² = 1 / m * ∑(x(i)

x = x / s²

③ Why regularization input?

Non regularization may cause the image input cost functions are somewhat elongated, as values ​​x1 to 1000, but after only 0-1. Regularization of the input value x2, the cost function looks more symmetrical.

FIG see the non-regularized gradient descent is more tortuous, and regularized gradient decreased rapidly. 

 

2, Vanishing / Exploding gradients (gradient explosion dissipates the gradient):

(1) described as an example:

 

假设:g(z) = z; b[l] = 0.

y = w[L]w[L-1]w[L-2] ... w[2]w[1]x

 

(2)解决方案:权重初始化

由 z = w1x1 + w2x2 + ... + wnxn

随着 n 的增大,期望的 w[l] 越小,由此设置 Var(w[l]) = 1/n 或者 2/n(效果更好),即:

w[l] = np.random.randn(shape) * np.sqrt(2/n[l-1])

 

3、梯度检验:

(1)梯度的数值逼近:

双边误差公式更准确,可以用来判断 g(θ) 是否实现了函数 f 的偏导.

 

(2)神经网络的梯度检验:

① 将 W[1],b[1],...,W[L],b[L] 从矩阵转为一个向量 θ;

② 将 dW[1],db[1],...,dW[L],db[L] 从矩阵转为一个向量 dθ;

③ J = J(θ1, θ2, ..., θi, ...)

for each i :

  dθapprox[i] = (J(θ1, θ2, ..., θ+ ε, ...) - J(θ1, θ2, ..., θ- ε, ...)) / (2 * ε)

  check dθapprox[i] ≈ dθ[i] by calculate || dθapprox[i] - dθ[i] ||2  / (|| dθapprox[i] ||2 + || dθ[i] ||2)  < 10^-7(或其他误差阈值)

 

(3)梯度检验注意点:

① 检测完关闭梯度检验;

② 检查是否完成了正则化;

③ 不适用于dropout;

④ 检查是否进行了随机初始化.

Guess you like

Origin www.cnblogs.com/orangecyh/p/11810840.html