DeepLearning.ai code笔记2：超参数调试、正则化以及优化

1、L2正则化

不使用正则化的公式：

\begin{matrix} (1) & J = - \frac{1}{m} \sum_{i = 1}^{m} (y^{(i)} \log (a^{[L] (i)}) + (1 - y^{(i)}) \log (1 - a^{[L] (i)})) \end{matrix}

$J = -\frac{1}{m} \sum\limits_{i = 1}^{m} \large{(}\small y^{(i)}\log\left(a^{[L](i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right) \large{)} \tag{1}$
正则化的公式：

\begin{matrix} (2) & J_{r e g u l a r i z e d} = \underset{cross-entropy cost}{\underset{⏟}{- \frac{1}{m} \sum_{i = 1}^{m} (y^{(i)} \log (a^{[L] (i)}) + (1 - y^{(i)}) \log (1 - a^{[L] (i)}))}} + \underset{L2 regularization cost}{\underset{⏟}{\frac{1}{m} \frac{λ}{2} \sum_{l} \sum_{k} \sum_{j} W_{k, j}^{[l] 2}}} \end{matrix}

$J_{regularized} = \small \underbrace{-\frac{1}{m} \sum\limits_{i = 1}^{m} \large{(}\small y^{(i)}\log\left(a^{[L](i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right) \large{)} }_\text{cross-entropy cost} + \underbrace{\frac{1}{m} \frac{\lambda}{2} \sum\limits_l\sum\limits_k\sum\limits_j W_{k,j}^{[l]2} }_\text{L2 regularization cost} \tag{2}$

cross_entropy_cost = compute_cost(A3, Y) # This gives you the cross-entropy part of the cost

L2_regularization_cost = 1./m * lambd/2 * (np.sum(np.square(W1)) + np.sum(np.square(W2)) + np.sum(np.square(W3)))

cost = cross_entropy_cost + L2_regularization_cost

dZ3 = A3 - Y
dW3 = 1./m * np.dot(dZ3, A2.T) + lambd / m * W3   # + lambd / m * W3
db3 = 1./m * np.sum(dZ3, axis=1, keepdims = True)
dA2 = np.dot(W3.T, dZ3)
dZ2 = np.multiply(dA2, np.int64(A2 > 0))
dW2 = 1./m * np.dot(dZ2, A1.T) + lambd / m * W2
db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True)
dA1 = np.dot(W2.T, dZ2)
dZ1 = np.multiply(dA1, np.int64(A1 > 0))
dW1 = 1./m * np.dot(dZ1, X.T) + lambd / m * W1
db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True)

2、He 初始化

parameters['W' + str(l)] = np.random.randn(layers_dims[l], layers_dims[l-1]) * np.sqrt(2.0/layers_dims[l-1])
parameters['b' + str(l)] = np.zeros((layers_dims[l], 1))

3、Adam 梯度下降

How does Adam work?

It calculates an exponentially weighted average of past gradients, and stores it in variables $v$ (before bias correction) and $v^{corrected}$ (with bias correction).
It calculates an exponentially weighted average of the squares of the past gradients, and stores it in variables $s$ (before bias correction) and $s^{corrected}$ (with bias correction).
It updates parameters in a direction based on combining information from “1” and “2”.
The update rule is, for $l = 1, ..., L$ :

{\begin{cases} v_{d W^{[l]}} = β_{1} v_{d W^{[l]}} + (1 - β_{1}) \frac{\partial J}{\partial W^{[l]}} \\ v_{d W^{[l]}}^{c o r r e c t e d} = \frac{v_{d W^{[l]}}}{1 - (β_{1})^{t}} \\ s_{d W^{[l]}} = β_{2} s_{d W^{[l]}} + (1 - β_{2}) (\frac{\partial J}{\partial W^{[l]}})^{2} \\ s_{d W^{[l]}}^{c o r r e c t e d} = \frac{s_{d W^{[l]}}}{1 - (β_{1})^{t}} \\ W^{[l]} = W^{[l]} - α \frac{v_{d W^{[l]}}^{c o r r e c t e d}}{\sqrt{s_{d W^{[l]}}^{c o r r e c t e d}} + ε} \end{cases}

$\begin{cases} v_{dW^{[l]}} = \beta_1 v_{dW^{[l]}} + (1 - \beta_1) \frac{\partial \mathcal{J} }{ \partial W^{[l]} } \\ v^{corrected}_{dW^{[l]}} = \frac{v_{dW^{[l]}}}{1 - (\beta_1)^t} \\ s_{dW^{[l]}} = \beta_2 s_{dW^{[l]}} + (1 - \beta_2) (\frac{\partial \mathcal{J} }{\partial W^{[l]} })^2 \\ s^{corrected}_{dW^{[l]}} = \frac{s_{dW^{[l]}}}{1 - (\beta_1)^t} \\ W^{[l]} = W^{[l]} - \alpha \frac{v^{corrected}_{dW^{[l]}}}{\sqrt{s^{corrected}_{dW^{[l]}}} + \varepsilon} \end{cases}$
where:

t counts the number of steps taken of Adam
L is the number of layers
$\beta_1$ and $\beta_2$ are hyperparameters that control the two exponentially weighted averages.
$\alpha$ is the learning rate
$\varepsilon$ is a very small number to avoid dividing by zero

翻译： 第一步和第二步分别使用了动量梯度下降和均方根支算法，其原理都是利用指数加权平均的思想和过去的梯度进行联系，让其具有一定的原来的动量（趋势）或者得到一个更平均的值不至于矫枉过正。corrected是指进行一定的修正。

for l in range(L):
     # Moving average of the gradients. Inputs: "v, grads, beta1". Output: "v".
     v["dW" + str(l+1)] = beta1 * v["dW" + str(l+1)] + (1-beta1) * grads['dW' + str(l+1)]
     v["db" + str(l+1)] = beta1 * v["db" + str(l+1)] + (1-beta1) * grads['db' + str(l+1)]

     # Compute bias-corrected first moment estimate. Inputs: "v, beta1, t". Output: "v_corrected".
     v_corrected["dW" + str(l+1)] = v["dW" + str(l+1)] / (1-np.power(beta1,t))
     v_corrected["db" + str(l+1)] = v["db" + str(l+1)] / (1-np.power(beta1,t))

     # Moving average of the squared gradients. Inputs: "s, grads, beta2". Output: "s".
     s["dW" + str(l+1)] = beta2 * s["dW" + str(l+1)] + (1-beta2) * grads['dW' + str(l+1)]**2
     s["db" + str(l+1)] = beta2 * s["db" + str(l+1)] + (1-beta2) * grads['db' + str(l+1)]**2

     # Compute bias-corrected second raw moment estimate. Inputs: "s, beta2, t". Output: "s_corrected".
     s_corrected["dW" + str(l+1)] = s["dW" + str(l+1)] / (1-np.power(beta2,t))
     s_corrected["db" + str(l+1)] = s["db" + str(l+1)] / (1-np.power(beta2,t))