1, L regularization layer neural network:
(1) L2 regularization:
(2) Why regularization to avoid over-fitting?
When lambda is large enough, J is minimized, so that the weight matrix w will be close to zero, a simplified neural network is a high bias state:
lambda was larger than w, the z = w * a + b, z is small, to tanh function as an example:
When z is smaller stage, the function g (z) close to linear. If the linear approach each layer, the network is a linear network, the situation will not fit through.
(3) dropout regularization (random inactivation):
Each node of the neural network contains deactivation probability p, as follows:
Simplify wiring, get a node less, smaller-scale networks:
Codes are as follows:
For the third layer be a random inactivation, 0.8 keep_prob = (the probability of a hidden unit to retain, i.e., eliminating the probability of a hidden unit is 0.2), keep_prob different layers may be different.
d3 = np.random.rand(a3.shape[0], a3.shape[1]) < keep_prob
a3 = np.multiply (a3, d3) # filtered inactivated node
a3 = a3 / keep_prob # make up 20% of the filtered off, such that the expected value of the constant a3
(4) Other regularization method:
① expand the data set;
② early termination of the iteration:
(5) Input regularization:
① zero-mean:
μ = 1 / m * ∑x(i)
x = x - m
② variance normalization:
σ² = 1 / m * ∑(x(i))²
x = x / s²
③ Why regularization input?
Non regularization may cause the image input cost functions are somewhat elongated, as values x1 to 1000, but after only 0-1. Regularization of the input value x2, the cost function looks more symmetrical.
FIG see the non-regularized gradient descent is more tortuous, and regularized gradient decreased rapidly.
2, Vanishing / Exploding gradients (gradient explosion dissipates the gradient):
(1) described as an example:
假设:g(z) = z; b[l] = 0.
y = w[L]w[L-1]w[L-2] ... w[2]w[1]x
(2)解决方案:权重初始化
由 z = w1x1 + w2x2 + ... + wnxn
随着 n 的增大,期望的 w[l] 越小,由此设置 Var(w[l]) = 1/n 或者 2/n(效果更好),即:
w[l] = np.random.randn(shape) * np.sqrt(2/n[l-1])
3、梯度检验:
(1)梯度的数值逼近:
双边误差公式更准确,可以用来判断 g(θ) 是否实现了函数 f 的偏导.
(2)神经网络的梯度检验:
① 将 W[1],b[1],...,W[L],b[L] 从矩阵转为一个向量 θ;
② 将 dW[1],db[1],...,dW[L],db[L] 从矩阵转为一个向量 dθ;
③ J = J(θ1, θ2, ..., θi, ...)
for each i :
dθapprox[i] = (J(θ1, θ2, ..., θi + ε, ...) - J(θ1, θ2, ..., θi - ε, ...)) / (2 * ε)
check dθapprox[i] ≈ dθ[i] by calculate || dθapprox[i] - dθ[i] ||2 / (|| dθapprox[i] ||2 + || dθ[i] ||2) < 10^-7(或其他误差阈值)
(3)梯度检验注意点:
① 检测完关闭梯度检验;
② 检查是否完成了正则化;
③ 不适用于dropout;
④ 检查是否进行了随机初始化.