Building your Deep Neural Network: Step by Step

接上一篇，训练一个两层的网络模型（包含一个隐藏层），这次，要搭建一个深度的神经网络模型，它有多个隐藏层．

首先，需要编写一些辅助函数来帮助实现这个模型．
在下一篇，我们还会继续用到这些辅助函数，来实现一个图片分类的深度神经网络
　
阅读玩之后你将学习到一下几点:
使用非线性单元比如：ReLU(修正线性单元)来提升你的模型
搭建一个深度神经网络模型(包含多个隐藏层)
Implement an easy-to-use neural network class

符号表示:

上标：表示模型的层数， layer.
- Example: $a^{[L]}$ 表示网络第 $L^{th}$ 层的激活(activation)输出.
- Example: $W^{[L]}$ ， $b^{[L]}$ 为 $L^{th}$ 层的参数(parameters).
上标：表示样本id,第个样本(example).
- Example: $x^{(i)}$ : 第 $i^{th}$ 个training example.
下标：表示向量的第个元素.
- Example: $a^{[l]}_i$ ： $l^{th}$ 层(activations)向量(vector)的第 $i^{th}$ 个元素．

Let’s get started!

1 -Use Packages

Let’s first import all the packages that you will need during this assignment.
- numpy is the main package for scientific computing with Python.
- matplotlib is a library to plot graphs in Python.
- np.random.seed(1) is used to keep all the random function calls consistent. It will help us grade your work. Please don’t change the seed.

import numpy as np
import h5py
import matplotlib.pyplot as plt
from testCases_v2 import *

%matplotlib inline
plt.rcParams['figure.figsize'] = (5.0, 4.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

%load_ext autoreload
%autoreload 2

np.random.seed(1)

* Help function

计算sigmoid() 和　tanh() 的非线性输出和激活函数的求导

def sigmoid(Z):
    """
    Returns:
    A -- output of sigmoid(z), same shape as Z
    cache -- returns Z as well, useful during backpropagation
    """
    A = 1/(1+np.exp(-Z))
    cache = Z
    return A, cache

def relu(Z):
    """
    Returns:
    A -- Post-activation parameter, of the same shape as Z
    Z -- linear output
    """
    A = np.maximum(0,Z)
    assert(A.shape == Z.shape)
    cache = Z 
    return A, cache

def relu_backward(dA, cache):
    """
    Arguments:
    dA -- post-activation gradient, of any shape
    cache -- 'Z' where we store for computing backward propagation efficiently
    Returns:
    dZ -- Gradient of the cost with respect to Z
    """
    Z = cache
    dZ = np.array(dA, copy=True) # just converting dz to a correct object.
    # When z <= 0, you should set dz to 0 as well. 
    dZ[Z <= 0] = 0
    assert (dZ.shape == Z.shape)
    return dZ

def sigmoid_backward(dA, cache):
    """
    Arguments:
    dA -- post-activation gradient, of any shape
    cache -- 'Z' where we store for computing backward propagation efficiently
    Returns:
    dZ -- Gradient of the cost with respect to Z
    """
    Z = cache
    s = 1/(1+np.exp(-Z))
    dZ = dA * s * (1-s)
    assert (dZ.shape == Z.shape)
    return dZ

2 - Outline of the Assignment

初始化参数（ $L$ 层的DNN）
正向传播(forward propagation)
- 线性计算部分 $WX+b = Z$ (resulting in $Z^{[l]}$ ).
- 非线性激活 ACTIVATION function (relu/sigmoid).
- 合并以上两步[LINEAR->ACTIVATION] 实现forward function.
- 在forward function中的前（L-1）次的正向传播的计算当中．使用的是［LINEAR->RELU］的组合，但是在最后一层，使用的是[LINEAR->SIGMOID］的组合．
计算损失cost.
反向传播 (backward propagation)
- 线性部分求偏导数
- 激活函数的求导(relu_backward/sigmoid_backward)
- 通过链式法则，求每层的参数的梯度
最后更新参数.

Note
在正向传播的过程中一些中间的计算结果需要保存下来，应为在后面的反向传播中，需要用到这些值来计算参数的梯度．所在正向传播的计算当中，我们包一些中间的计算结果存储在＇cache’中，cache是Python中的一个字典对象．

3 - Initialization

参数的初始化：

先编写一个初始化２层网络的函数，练练手
编写一个能对更深层网络（L层）初始化的函数

3.1 - 2-layer Neural Network

Exercise: 写一个实现两层网络初始化的函数　initialize_parameters( )
Instructions:
- 模型的结构: LINEAR -> RELU -> LINEAR -> SIGMOID.
- 随即初始化权值矩阵 np.random.randn(shape)*0.01 .
- 全０初始化偏置np.zeros(shape).

# GRADED FUNCTION: initialize_parameters

def initialize_parameters(n_x, n_h, n_y):
    """
    n_x -- size of the input layer
    n_h -- size of the hidden layer
    n_y -- size of the output layer
    Returns:
    parameters -- python dictionary containing your parameters:
                    W1 -- weight matrix of shape (n_h, n_x)
                    b1 -- bias vector of shape (n_h, 1)
                    W2 -- weight matrix of shape (n_y, n_h)
                    b2 -- bias vector of shape (n_y, 1)
    """
    np.random.seed(1)
    W1 = np.random.randn(n_h, n_x)*0.01
    b1 = np.zeros((n_h, 1))
    W2 = np.random.randn(n_y, n_h)*0.01
    b2 = np.zeros((n_y, 1))
    parameters = {"W1": W1,
                  "b1": b1,
                  "W2": W2,
                  "b2": b2}
    return parameters

3.2 - L-layer Neural Network

对 $L$ 层网络的参数进行初始化，比浅层网络的初始化要复杂的多，因为有很多的权重矩阵和偏置向量．但是它也是有技巧的，完成下面的＇initialize_parameters_deep() ‘,就可以对任意层数的全连接网络进行初始化了．
$n^{[l]}$ ： $l$ 　层神经元的个数.
举个例子：模型的输入为 $X$ ， $(12288, 209)$ (with $m=209$ examples) then:

	Shape of W	Shape of b	Activation	Shape of Activation
Layer 1	$(n^{[1]},12288)$	$(n^{[1]},1)$	$Z^{[1]} = W^{[1]} X + b^{[1]}$	$(n^{[1]},209)$
Layer 2	$(n^{[2]}, n^{[1]})$	$(n^{[2]},1)$	$Z^{[2]} = W^{[2]} A^{[1]} + b^{[2]}$	$(n^{[2]}, 209)$
$\vdots$	$\vdots$	$\vdots$	$\vdots$	$\vdots$
Layer L-1	$(n^{[L-1]}, n^{[L-2]})$	$(n^{[L-1]}, 1)$	$Z^{[L-1]} = W^{[L-1]} A^{[L-2]} + b^{[L-1]}$	$(n^{[L-1]}, 209)$
Layer L	$(n^{[L]}, n^{[L-1]})$	$(n^{[L]}, 1)$	$Z^{[L]} = W^{[L]} A^{[L-1]} + b^{[L]}$	$(n^{[L]}, 209)$

计算 $W X + b$ in python, it carries out broadcasting. For example, if:

\begin{matrix} (2) & W = [\begin{matrix} j & k & l \\ m & n & o \\ p & q & r \end{matrix}] X = [\begin{matrix} a & b & c \\ d & e & f \\ g & h & i \end{matrix}] b = [\begin{matrix} s \\ t \\ u \end{matrix}] \end{matrix}

$W = \begin{bmatrix} j & k & l\\ m & n & o \\ p & q & r \end{bmatrix}\;\;\; X = \begin{bmatrix} a & b & c\\ d & e & f \\ g & h & i \end{bmatrix} \;\;\; b =\begin{bmatrix} s \\ t \\ u \end{bmatrix}\tag{2}$

Then $WX + b$ will be:

\begin{matrix} (3) & W X + b = [\begin{matrix} (j a + k d + l g) + s & (j b + k e + l h) + s & (j c + k f + l i) + s \\ (m a + n d + o g) + t & (m b + n e + o h) + t & (m c + n f + o i) + t \\ (p a + q d + r g) + u & (p b + q e + r h) + u & (p c + q f + r i) + u \end{matrix}] \end{matrix}

$WX + b = \begin{bmatrix} (ja + kd + lg) + s & (jb + ke + lh) + s & (jc + kf + li)+ s\\ (ma + nd + og) + t & (mb + ne + oh) + t & (mc + nf + oi) + t\\ (pa + qd + rg) + u & (pb + qe + rh) + u & (pc + qf + ri)+ u \end{bmatrix}\tag{3}$

Exercise: L-layer网络模型的初始化.

Instructions:
- 模型的结构 [linear -> ReLU] $\times$ (L-1) -> linear -> Sigmoid.DNN模型中有 $L-1$ 使用的是ReLU activation function，在输出层使用的是　sigmoid activation function.
- 随即初始化权值矩阵 Use np.random.rand(shape) * 0.01.
- ０初始化偏置向量 Use np.zeros(shape).
- $n^{[l]}$ ,为每一层神经元的个数, 存储在变量layer_dims中.the layer_dims是python中的列表对象，举个例子：layer_dims = [2,4,1]:bi表示的是一个两层的网络结构，输入层神经元的个数为：２，隐藏层的神经元个数为：４，输出层的个数为：１．

# GRADED FUNCTION: initialize_parameters_deep

def initialize_parameters_deep(layer_dims):
    """
    layer_dims -- python array (list),元素为网络每层神经元的数量
    Returns:
    parameters -- python dictionary containing your parameters "W1", "b1", ..., "WL", "bL":
                    Wl -- weight matrix of shape (layer_dims[l], layer_dims[l-1])
                    bl -- bias vector of shape (layer_dims[l], 1)
    """
    np.random.seed(3)
    parameters = {}
    L = len(layer_dims)            # number of layers in the network
    for l in range(1, L):
        parameters['W' + str(l)] = np.random.randn(layer_dims[l], layer_dims[l-1])*0.01
        parameters['b' + str(l)] = np.zeros((layer_dims[l], 1))

    return parameters

4 - Forward propagation module

4.1 - Linear Forward

线性计算部分方程如下：

\begin{matrix} (4) & Z^{[l]} = W^{[l]} A^{[l - 1]} + b^{[l]} \end{matrix}

$Z^{[l]} = W^{[l]}A^{[l-1]} +b^{[l]}\tag{4}$

$A^{[0]} = X$ .

Exercise: linear_forward( )实现线性部分的正向传播

# GRADED FUNCTION: linear_forward

def linear_forward(A, W, b):
    """
    A -- 前一层的激活输出或者输入层: (size of previous layer, number of examples)
    W -- 权值矩阵:(size of current layer, size of previous layer)
    b -- 偏置向量:(size of the current layer, 1)
    Returns:
    Z -- 激活函数的输入：pre-activation parameter 
    cache -- a python tuple containing "A", "W" and "b" ; 
    误差反向传播是需要用到
    """
    Z = np.dot(W, A)+b
    cache = (A, W, b)
    return Z, cache

4.2 - Linear-Activation Forward

在DNN模型中会用到两个activation functions:

Sigmoid: $\sigma(Z) = \sigma(W A + b) = \frac{1}{ 1 + e^{-(W A + b)}}$ .
在上面的helper function部分，已经实现了这个函数， sigmoid function. 这个函数返回两个值．一个是activation value “a” 和 a “cache“,cache中包含”Z“：函数的输入值．

A, activation_cache = sigmoid(Z)

ReLU:　 $A = RELU(Z) = max(0, Z)$ .
在helper function部分也提供了这个函数的实现，函数返回的结果也是两个值，activation value “A” 和 “cache“，cache中包含”Z”

A, activation_cache = relu(Z)

接下来要把上面的线性和非线性的两步计算合并到一个函数中来完成
(linear->activation)＝liner_acivate_forward().

数学表达式:
$A^{[l]} = g(Z^{[l]}) = g(W^{[l]}A^{[l-1]} +b^{[l]})$
激活函数 “g” 可以是：sigmoid() or relu().

# GRADED FUNCTION: linear_activation_forward

def linear_activation_forward(A_prev, W, b, activation):
    """
    A_prev -- 上一层的输出或者输入层输入: (size of previous layer, number of examples)
    W -- weights matrix:(size of current layer, size of previous layer)
    b -- bias vector：(size of the current layer, 1)
    activation -- 该层使用的激活函数"sigmoid" or "relu"
    Returns:
    A -- 激活后的输出值post-activation value 
    cache -- a python tuple containing "linear_cache" and "activation_cache";
    """
    # Inputs: "A_prev, W, b". Outputs: "A, activation_cache".
    Z, linear_cache = linear_forward(A_prev, W, b)

    if activation == "sigmoid":
        A, activation_cache = sigmoid(Z)
    elif activation == "relu":
        A, activation_cache = relu(Z)
    cache = (linear_cache,activation_cache)
    return A, cache

d) L-Layer Model

说了这么多，要实现一个 $L$ -layer Neural Net,需要调用(linear_activation_forward with ReLU) $L-1$ times,和 linear_activation_forward with sigmoid一次．

这里写图片描述

Exercise: L_model_forward(X, parameters)

Instruction: 在下面的代码中变量 AL： $A^{[L]} = \sigma(Z^{[L]}) = \sigma(W^{[L]} A^{[L-1]} + b^{[L]})$ .
(有时候也用Yhat来表示 $\hat{Y}$ .)

Tips:
- 循环执行linear_activation_forward (L-1) times
- 不要忘记保存中间计算结果到”caches” list中. 　

# GRADED FUNCTION: L_model_forward

def L_model_forward(X, parameters):
    """
    Implement forward propagation for the [LINEAR->RELU]*(L-1)->LINEAR->SIGMOID computation
    X -- data, numpy array of shape (input size, number of examples)
    parameters -- output of initialize_parameters_deep()
    Returns:
    AL -- last post-activation value
    caches -- list of caches containing:
                every cache of linear_relu_forward() (there are L-1 of them, indexed from 0 to L-2)
                the cache of linear_sigmoid_forward() (there is one, indexed L-1)
    """
    caches = []
    A = X
    L = len(parameters) // 2                  # number of layers in the neural network

    # Implement [LINEAR -> RELU]*(L-1). Add "cache" to the "caches" list.
    for l in range(1, L):
        A_prev = A 
        W = parameters['W'+str(l)]
        b = parameters['b'+str(l)]
        A, cache = linear_activation_forward(A, W, b, 'relu')
        caches.append(cache)
    # Implement LINEAR -> SIGMOID. Add "cache" to the "caches" list.
    AL, cache = linear_activation_forward(A, parameters['W'+str(L)],
                                         parameters['b'+str(L)], 'sigmoid')
    caches.append(cache)

    assert(AL.shape == (1,X.shape[1]))

    return AL, caches

Great!现在我们已经完整的实现了正向传播的整个过程，从输入 X　到输出最后一层的结果向量 $A^{[L]}$ ,下面同过 $A^{[L]}$ 计算cost.

5 - Cost function

现在开始实现反向传播的计算，首先计算　cost.

Exercise: 计算交叉熵损失 $J$ , 　

\begin{matrix} (7) & - \frac{1}{m} \sum_{i = 1}^{m} (y^{(i)} \log (a^{[L] (i)}) + (1 - y^{(i)}) \log (1 - a^{[L] (i)})) \end{matrix}

$-\frac{1}{m} \sum\limits_{i = 1}^{m} (y^{(i)}\log\left(a^{[L] (i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right)) \tag{7}$

# GRADED FUNCTION: compute_cost

def compute_cost(AL, Y):
    """
    Implement the cost function defined by equation (7).

    Arguments:
    AL -- shape (1, number of examples)
    Y -- true "label" vectorshape (1, number of examples)

    Returns:
    cost -- cross-entropy cost
    """

    m = Y.shape[1]

    # Compute loss from aL and y.
    cost = -1*(np.dot(Y, np.log(AL.T))+np.dot(np.log(1-AL), (1-Y).T))/m

    cost = np.squeeze(cost)      # To make sure your cost's shape is what we expect (e.g. this turns [[17]] into 17).
    assert(cost.shape == ())

    return cost

6 - Backward propagation module

反向传播的过程，主要是计算，loss function中参数的梯度．
Reminder:
这里写图片描述
- 计算偏导loss $\frac{d \mathcal{L}(a^{[2]},y)}{{dz^{[1]}}}$ 在一个两层的网络中:

\begin{matrix} (8) & \frac{d L (a^{[2]}, y)}{d z^{[1]}} = \frac{d L (a^{[2]}, y)}{d a^{[2]}} \frac{d a^{[2]}}{d z^{[2]}} \frac{d z^{[2]}}{d a^{[1]}} \frac{d a^{[1]}}{d z^{[1]}} \end{matrix}

$\frac{d \mathcal{L}(a^{[2]},y)}{{dz^{[1]}}} = \frac{d\mathcal{L}(a^{[2]},y)}{{da^{[2]}}}\frac{{da^{[2]}}}{{dz^{[2]}}}\frac{{dz^{[2]}}}{{da^{[1]}}}\frac{{da^{[1]}}}{{dz^{[1]}}} \tag{8}$

最终，我们要更新的参数是 W 和 b,还要进行一步进行计算：

$dW^{[1]} = \frac{\partial L}{\partial W^{[1]}}$
$dW^{[1]} = dz^{[1]} \times \frac{\partial z^{[1]} }{\partial W^{[1]}}$ .
$db^{[1]} = \frac{\partial L}{\partial b^{[1]}}$
$db^{[1]} = dz^{[1]} \times \frac{\partial z^{[1]} }{\partial b^{[1]}}$ .

This is all backpropagation.
3 steps:
- linear backward
- linear-> activation backward where ACTIVATION computes the derivative
- [linear -> ReLU] $\times$ (L-1) -> linear -> sigmoid backward (whole model)

6.1 - Linear backward

For layer $l$ , the linear part is: $Z^{[l]} = W^{[l]} A^{[l-1]} + b^{[l]}$ (followed by an activation).

假设我们已经算出了 $dZ^{[l]} = \frac{\partial \mathcal{L} }{\partial Z^{[l]}}$ .接下来要计算 $(dW^{[l]}, db^{[l]} dA^{[l-1]})$ .

这里写图片描述
计算 $(dW^{[l]}, db^{[l]}, dA^{[l]})$ 用到的方程:

\begin{matrix} (8) & d W^{[l]} = \frac{\partial L}{\partial W^{[l]}} = \frac{1}{m} d Z^{[l]} A^{[l - 1] T} \end{matrix}

$dW^{[l]} = \frac{\partial \mathcal{L} }{\partial W^{[l]}} = \frac{1}{m} dZ^{[l]} A^{[l-1] T} \tag{8}$

\begin{matrix} (9) & d b^{[l]} = \frac{\partial L}{\partial b^{[l]}} = \frac{1}{m} \sum_{i = 1}^{m} d Z^{[l] (i)} \end{matrix}

$db^{[l]} = \frac{\partial \mathcal{L} }{\partial b^{[l]}} = \frac{1}{m} \sum_{i = 1}^{m} dZ^{[l](i)}\tag{9}$

\begin{matrix} (10) & d A^{[l - 1]} = \frac{\partial L}{\partial A^{[l - 1]}} = W^{[l] T} d Z^{[l]} \end{matrix}

$dA^{[l-1]} = \frac{\partial \mathcal{L} }{\partial A^{[l-1]}} = W^{[l] T} dZ^{[l]} \tag{10}$

Exercise: Use the 3 formulas above to implement linear_backward().

# GRADED FUNCTION: linear_backward

def linear_backward(dZ, cache):
    """
    dZ -- Gradient of the cost with respect to the linear output (of current layer l)
    cache --  of values (A_prev, W, b) coming from the forward propagation in the current layer
    dA_prev -- Gradient of the cost with respect to the activation (of the previous layer l-1), same shape as A_prev
    dW -- Gradient of the cost with respect to W (current layer l), same shape as W
    db -- Gradient of the cost with respect to b (current layer l), same shape as b
    """
    A_prev, W, b = cache
    m = A_prev.shape[1]

    dW = np.dot(dZ, A_prev.T)/m
    db = np.sum(dZ, axis=1,keepdims=True)/m
    dA_prev = np.dot(W.T, dZ)
　   
    assert (dA_prev.shape == A_prev.shape)
    assert (dW.shape == W.shape)
    assert (db.shape == b.shape)

    return dA_prev, dW, db

6.2 - Linear-Activation backward

linear_activation_backward.

实现这一步骤需要用到之前编写的两个辅助函数：

sigmoid_backward: Implements the backward propagation for SIGMOID unit.

dZ = sigmoid_backward(dA, activation_cache)

relu_backward: Implements the backward propagation for RELU unit.

dZ = relu_backward(dA, activation_cache)

If $g(.)$ is the activation function,
sigmoid_backward and relu_backward compute

\begin{matrix} (11) & d Z^{[l]} = d A^{[l]} * g^{'} (Z^{[l]}) \end{matrix}

$dZ^{[l]} = dA^{[l]} * g'(Z^{[l]}) \tag{11}$ .

Exercise: Implement the backpropagation for the LINEAR->ACTIVATION layer.

# GRADED FUNCTION: linear_activation_backward

def linear_activation_backward(dA, cache, activation):
    """
    Implement the backward propagation for the LINEAR->ACTIVATION layer.
    Arguments:
    dA -- post-activation gradient for current layer l 
    cache -- tuple of values (linear_cache, activation_cache) we store for computing backward propagation efficiently
    activation -- the activation to be used in this layer, stored as a text string: "sigmoid" or "relu"

    Returns:
    dA_prev -- Gradient of the cost with respect to the activation (of the previous layer l-1), same shape as A_prev
    dW -- Gradient of the cost with respect to W (current layer l), same shape as W
    db -- Gradient of the cost with respect to b (current layer l), same shape as b
    """
    linear_cache, activation_cache = cache

    if activation == "relu":

        dZ = relu_backward(dA, activation_cache)
        dA_prev, dW, db = linear_backward(dZ, linear_cache)
    elif activation == "sigmoid":
        　
        dZ = sigmoid_backward(dA, activation_cache)
        dA_prev, dW, db = linear_backward(dZ, linear_cache)
    return dA_prev, dW, db

6.3 - L-Model Backward

下面我们要实现蔓完整的反向传播功能，在L_model_forward 函数中,在模型的每一次的迭代当中，都要存储一些中间计算结果，比如(X(A),W,b, and z). 在反向传播的过程中，我们需要这些变量来计算，参数的梯度，并且在每次的迭代中更新参数的值．在每一次的迭代当中，会计算每一层的参数的梯度，从最后一层　 $L$ 层开始，从后往前．然后完成一轮的学习．

这里写图片描述
计算 $d{A^{[L]}}$ :

$A^{[L]} = \sigma(Z^{[L]})$ .
$d{A^{[L]}} = \frac{\partial \mathcal{L}}{\partial A^{[L]}}$ .

dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL)) # derivative of cost with respect to AL

\begin{matrix} (15) & g r a d s [" d W " + s t r (l)] = d W^{[l]} \end{matrix}

$grads["dW" + str(l)] = dW^{[l]}\tag{15}$

For example, for $l=3$ this would store $dW^{[l]}$ in grads["dW3"].

# GRADED FUNCTION: L_model_backward

def L_model_backward(AL, Y, caches):
    """
    Implement the backward propagation for the [LINEAR->RELU] * (L-1) -> LINEAR -> SIGMOID group
    Arguments:
    AL -- probability vector, output of the forward propagation (L_model_forward())
    Y -- true "label" vector (containing 0 if non-cat, 1 if cat)
    caches -- list of caches containing:
                every cache of linear_activation_forward() with "relu" (it's caches[l], for l in range(L-1) i.e l = 0...L-2)
                the cache of linear_activation_forward() with "sigmoid" (it's caches[L-1])

    Returns:
    grads -- A dictionary with the gradients
             grads["dA" + str(l)] = ...
             grads["dW" + str(l)] = ...
             grads["db" + str(l)] = ...
    """
    grads = {}
    L = len(caches) # the number of layers
    m = AL.shape[1]
    Y = Y.reshape(AL.shape) # after this line, Y is the same shape as AL

    # Initializing the backpropagation

    dAL = - (np.divide(Y, AL)- np.divide(1-Y, 1-AL))


    # Lth layer (SIGMOID -> LINEAR) gradients. Inputs: "AL, Y, caches". Outputs: "grads["dAL"], grads["dWL"], grads["dbL"]

    current_cache = caches[L-1]
    grads["dA" + str(L-1)], grads["dW" + str(L)], grads["db" + str(L)] = linear_activation_backward(dAL, current_cache, 'sigmoid')


    for l in reversed(range(L - 1)):
        # lth layer: (RELU -> LINEAR) gradients.
        # Inputs: "grads["dA" + str(l + 1)], caches". Outputs: "grads["dA" + str(l + 1)] , grads["dW" + str(l + 1)] , grads["db" + str(l + 1)] 

        current_cache = caches[l]
        dA_prev_temp, dW_temp, db_temp = linear_activation_backward(grads['dA'+str(l+1)],current_cache, 'relu')
        grads["dA" + str(l + 1)] = dA_prev_temp
        grads["dW" + str(l + 1)] = dW_temp
        grads["db" + str(l + 1)] = db_temp

    return grads

6.4 - Update Parameters

In this section you will update the parameters of the model, using gradient descent:

\begin{matrix} (16) & W^{[l]} = W^{[l]} - α d W^{[l]} \end{matrix}

$W^{[l]} = W^{[l]} - \alpha \text{ } dW^{[l]} \tag{16}$

\begin{matrix} (17) & b^{[l]} = b^{[l]} - α d b^{[l]} \end{matrix}

$b^{[l]} = b^{[l]} - \alpha \text{ } db^{[l]} \tag{17}$

where $\alpha$ is the learning rate. After computing the updated parameters, store them in the parameters dictionary.

Exercise: Implement update_parameters() to update your parameters using gradient descent.

Instructions:
Update parameters using gradient descent on every $W^{[l]}$ and $b^{[l]}$ for $l = 1, 2, ..., L$ .

# GRADED FUNCTION: update_parameters

def update_parameters(parameters, grads, learning_rate):
    """
    Update parameters using gradient descent

    Arguments:
    parameters -- python dictionary containing your parameters 
    grads -- python dictionary containing your gradients, output of L_model_backward

    Returns:
    parameters -- python dictionary containing your updated parameters 
                  parameters["W" + str(l)] = ... 
                  parameters["b" + str(l)] = ...
    """

    L = len(parameters) // 2 # number of layers in the neural network

    # Update rule for each parameter. Use a for loop.

    for l in range(L):
        parameters["W" + str(l+1)] = parameters["W"+str(l+1)]-learning_rate*grads['dW'+str(l+1)]
        parameters["b" + str(l+1)] = parameters["b"+str(l+1)]-learning_rate*grads['db'+str(l+1)]


    return parameters

7 - Conclusion

Congrats on implementing all the functions required for building a deep neural network!

We know it was a long assignment but going forward it will only get better. The next part of the assignment is easier.

In the next assignment you will put all these together to build two models:
- A two-layer neural network
- An L-layer neural network

You will in fact use these models to classify cat vs non-cat images!

DL_C1_week4-1(Build Deep Neural Network)