Build a neural network with one hidden layer

上一篇讲的是如何实现一个Logistic Regression分类器，Neural network其实和LR是很相似的，可以把Neural Network看作是有多个LR对叠起来实现的．只要理解了Logistic Regression,就不难理解Neural Network.

本文的主要内容

实现一个２分类的，单个隐藏层的神经网络模型
神经元的非线性激活，使用tanh函数
计算交叉熵损失
实现正向和反向传播
更新参数
超参数的选择

1 - Packages

numpy
sklearn:scikit-learn
matplotlib

# Package imports
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import sklearn.datasets
import sklearn.linear_model

%matplotlib inline

np.random.seed(1) # set a seed so that the results are consistent

２.Helper function

# 生成训练数据
def create_dataset(m = 400, D = 2):
    """
    m : number of example
    D : number of features
    N : number of class
    X : data matrix each row is a single example
    Y : label vector
    """
    np.random.seed(1)
    N = int(m/2)  
    X = np.zeros((m,D))
    Y = np.zeros((m,1), dtype='uint8') # (0 for red, 1 for blue)
    a = 4 # maximum ray of the flower
    for j in range(2):
        ix = range(N*j,N*(j+1))
        t = np.linspace(j*3.12,(j+1)*3.12,N) + np.random.randn(N)*0.2 # theta
        r = a*np.sin(4*t) + np.random.randn(N)*0.2 # radius
        X[ix] = np.c_[r*np.sin(t), r*np.cos(t)]
        Y[ix] = j

    # X:shape(m, D)
    # Y:shape(m, 1)
    return X, Y

# 画出模型的分类决策边界
def plot_decision_boundary(model, X, y):
    # Set min and max values and give it some padding
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    h = 0.01
    # Generate a grid of points with distance h between them
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    # Predict the function value for the whole grid
    Z = model(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    # Plot the contour and training examples
    plt.contourf(xx, yy, Z, cmap=plt.cm.Spectral)
    plt.ylabel('x2')
    plt.xlabel('x1')
    plt.scatter(X[:,0], X[:,1], c=y[:,0], cmap=plt.cm.Spectral)

3. Create and overview dataset

create data function : create_dataset()
随即生成一些，两个类别的训练数据

3.1 generate data

X, Y = create_dataset(400, 2)

print ('The shape of X is: ' + str(X.shape))
print ('The shape of Y is: ' + str(Y.shape))
print ('We have m = %d training examples!' % (X.shape[0]))

The shape of X is: (400, 2)
The shape of Y is: (400, 1)
We have m = 400 training examples!

3.2 visualize dataset

目标：build a model 拟合这些数据

# Visualize the data:
plt.scatter(X[:, 0], X[:, 1], c=Y[:,0], s=30, cmap=plt.cm.Spectral);

这里写图片描述

training dataset:

a numpy-array (matrix) X,features (x1, x2)
a numpy-array (vector) Y,labels (red:0, blue:1).

4. Simple Logistic Regression

　　在实现全连接网络之前，先使用Logistic Regression 分类器来fit数据，看看LR在这个问题上的表现如何，通过sklearn来实现Logistic Regression非常简单，两行代码搞定．

４.1 train logistic regression classifier

# Train the logistic regression classifier
clf = sklearn.linear_model.LogisticRegressionCV()
clf.fit(X, Y.reshape(X.shape[0],))

LogisticRegressionCV(Cs=10, class_weight=None, cv=None, dual=False,
           fit_intercept=True, intercept_scaling=1.0, max_iter=100,
           multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
           refit=True, scoring=None, solver='lbfgs', tol=0.0001, verbose=0)

４.2 plot decision boundary

# Plot the decision boundary for logistic regression
plot_decision_boundary(lambda x: clf.predict(x), X, Y)
plt.title("Logistic Regression")

# Print accuracy
LR_predictions = clf.predict(X)
print ('Accuracy of logistic regression: %d ' % float((np.dot(Y[:,0],LR_predictions) + np.dot(1-Y[:,0],1-LR_predictions))/float(Y.size)*100) +
       '% ' + "(percentage of correctly labelled datapoints)")

Accuracy of logistic regression: 47 % (percentage of correctly labelled datapoints)

这里写图片描述

Output:

Accuracy

47%

分类的准确率只有47%,logistic regression对数据拟合不是很好.下面使用neural network来对数据进行分类. Let’s try this now!

5 - Neural Network model

a Neural Network with a single hidden layer.

Here is our model:
这里写图片描述
Mathematically:

For one example $x^{(i)}$ :

\begin{matrix} (1) & z^{[1] (i)} = W^{[1]} x^{(i)} + b^{[1] (i)} \end{matrix}

$z^{[1] (i)} = W^{[1]} x^{(i)} + b^{[1] (i)}\tag{1}$

\begin{matrix} (2) & a^{[1] (i)} = \tanh (z^{[1] (i)}) \end{matrix}

$a^{[1] (i)} = \tanh(z^{[1] (i)})\tag{2}$

\begin{matrix} (3) & z^{[2] (i)} = W^{[2]} a^{[1] (i)} + b^{[2] (i)} \end{matrix}

$z^{[2] (i)} = W^{[2]} a^{[1] (i)} + b^{[2] (i)}\tag{3}$

\begin{matrix} (4) & {\hat{y}}^{(i)} = a^{[2] (i)} = σ (z^{[2] (i)}) \end{matrix}

$\hat{y}^{(i)} = a^{[2] (i)} = \sigma(z^{ [2] (i)})\tag{4}$

\begin{matrix} (5) & y_{p r e d i c t i o n}^{(i)} = {\begin{cases} 1 & if a^{[2] (i)} > 0.5 \\ 0 & otherwise \end{cases} \end{matrix}

$y^{(i)}_{prediction} = \begin{cases} 1 & \mbox{if } a^{[2](i)} > 0.5 \\ 0 & \mbox{otherwise } \end{cases}\tag{5}$

计算 m 个样本时的 cost $J$ as follows:

\begin{matrix} (6) & J = - \frac{1}{m} \sum_{i = 0}^{m} (y^{(i)} \log (a^{[2] (i)}) + (1 - y^{(i)}) \log (1 - a^{[2] (i)})) \end{matrix}

$J = - \frac{1}{m} \sum\limits_{i = 0}^{m} \large\left(\small y^{(i)}\log\left(a^{[2] (i)}\right) + (1-y^{(i)})\log\left(1- a^{[2] (i)}\right) \large \right) \small \tag{6}$

Reminder: Bulid neural network 的步骤:

确定网络的结构 ( # of input units, # of hidden units, etc).
初始化模型的参数
Loop:
- 正向传播
- 计算loss
- 反向传播，计算梯度
- 更新参数 (gradient descent)

需要实现一些辅助函数来实现１－３，然后在将辅助函数集中到＇nn_model()’中，最后训练模型，学习参数．在新的数据上进行预测．

５.1 - Defining the neural network structure

Exercise: Define three variables:

n_x: the size of the input layer
n_h: the size of the hidden layer (set this to 4)
n_y: the size of the output layer

# GRADED FUNCTION: layer_sizes

def layer_sizes(X, Y):
    """
    Arguments:
    X -- input dataset of shape ( number of examples，inputs/features)
    Y -- labels of shape (number of examples, output)
    Returns:
    n_x -- the size of the input layer
    n_h -- the size of the hidden layer
    n_y -- the size of the output layer
    """
    n_x = X.shape[1]
    n_h = 4 # hard code
    n_y = Y.shape[1]
    　
    return (n_x, n_h, n_y)

test function

(n_x, n_h, n_y) = layer_sizes(X, Y)
print("The size of the input layer is: n_x = " + str(n_x))
print("The size of the hidden layer is: n_h = " + str(n_h))
print("The size of the output layer is: n_y = " + str(n_y))

The size of the input layer is: n_x = 2
The size of the hidden layer is: n_h = 4
The size of the output layer is: n_y = 1

５.2 - Initialize the model’s parameters

初始化参数: function initialize_parameters().

初始化方法:

随即初始化.
- Use: np.random.randn(a,b) * 0.01 to randomly initialize a matrix of shape (a,b).
全０初始化
- Use: np.zeros((a,b)) to initialize a matrix of shape (a,b) with 0.
使用不同的初始化方法，观察对模型的影响

# GRADED FUNCTION: initialize_parameters
# 提供两种初始化方案，增加标志参数，flag
def initialize_parameters(n_x, n_h, n_y, flag=0):
    """
    Argument:
    n_x -- size of the input layer
    n_h -- size of the hidden layer
    n_y -- size of the output layer
    flag = 0 ,random initial
    flag = 1 , zeros initial
    Returns:
    params -- python dictionary containing your parameters:
                    W1 -- weight matrix of shape (n_h, n_x)
                    b1 -- bias vector of shape (n_h, 1)
                    W2 -- weight matrix of shape (n_y, n_h)
                    b2 -- bias vector of shape (n_y, 1)
    """

    np.random.seed(2) #set up a seed although the initialization is random.
    if flag:
        W1 = np.zeros((n_h, n_x))
        b1 = np.zeros((n_h, 1))
        W2 = np.zeros((n_y, n_h))
        b2 = np.zeros((n_y, 1))
    else : 
        W1 = np.random.randn(n_h, n_x)*0.01
        b1 = np.zeros((n_h, 1))
        W2 = np.random.randn(n_y, n_h)*0.01
        b2 = np.zeros((n_y, 1))

    assert (W1.shape == (n_h, n_x))
    assert (b1.shape == (n_h, 1))
    assert (W2.shape == (n_y, n_h))
    assert (b2.shape == (n_y, 1))

    parameters = {"W1": W1,
                  "b1": b1,
                  "W2": W2,
                  "b2": b2}
    return parameters

test function：initialize_parameters()

随即初始化
0 初始化

parameters = initialize_parameters(n_x, n_h, n_y)
print("W1 : shape " + str(parameters["W1"].shape))
print("b1 : shape " + str(parameters["b1"].shape))
print("W2 : shape " + str(parameters["W2"].shape))
print("b2 : shape " + str(parameters["b2"].shape))
print('------------')
print("W1 = " + str(parameters["W1"]))
print("b1 = " + str(parameters["b1"]))
print("W2 = " + str(parameters["W2"]))
print("b2 = " + str(parameters["b2"]))

W1 : shape (4, 2)
b1 : shape (4, 1)
W2 : shape (1, 4)
b2 : shape (1, 1)
------------
W1 = [[-0.00416758 -0.00056267]
 [-0.02136196  0.01640271]
 [-0.01793436 -0.00841747]
 [ 0.00502881 -0.01245288]]
b1 = [[ 0.]
 [ 0.]
 [ 0.]
 [ 0.]]
W2 = [[-0.01057952 -0.00909008  0.00551454  0.02292208]]
b2 = [[ 0.]]

parameters_0 = initialize_parameters(n_x, n_h, n_y, 1)
print("W1 = " + str(parameters_0["W1"]))
print("b1 = " + str(parameters_0["b1"]))
print("W2 = " + str(parameters_0["W2"]))
print("b2 = " + str(parameters_0["b2"]))

W1 = [[ 0.  0.]
 [ 0.  0.]
 [ 0.  0.]
 [ 0.  0.]]
b1 = [[ 0.]
 [ 0.]
 [ 0.]
 [ 0.]]
W2 = [[ 0.  0.  0.  0.]]
b2 = [[ 0.]]

5.3 - The Loop

正向传播:forward_propagation()

用到的激活函数和要计算的值：

sigmoid(),需要实现
np.tanh(),numpy提供
$Z^{[1]}, A^{[1]}, Z^{[2]}$ and $A^{[2]}$ ( $A^{[2]}$ 包含对所有样本的预测输出).
以上计算结果在，反向传播时需要用到

5.3.1 forward propagation

# Function: sigmoid()

def sigmoid(z):
    return 1./(1 + np.exp(-z))

# GRADED FUNCTION: forward_propagation

def forward_propagation(X, parameters):
    """
    Argument:
    X -- input data of size (m, n_x)
    parameters -- python dictionary containing parameters  
    Returns:
    A2 -- The sigmoid output of the second activation
    cache -- a dictionary containing "Z1", "A1", "Z2" and "A2"
    """
    W1 = parameters['W1'] # (4,2)
    b1 = parameters['b1'] # (4,1)
    W2 = parameters['W2'] # (1,4)
    b2 = parameters['b2'] # (1,1)

    Z1 = np.dot(W1, X.T) + b1  #(n_h, X.shape[0])
    A1 = np.tanh(Z1)           
    Z2 = np.dot(W2, A1) + b2   #(n_y, X.shape[0])
    A2 = sigmoid(Z2)

    assert(A2.shape == (1, X.shape[0]))

    cache = {"Z1": Z1,
             "A1": A1,
             "Z2": Z2,
             "A2": A2}

    return A2, cache

用测试数据，测试forward_propagation()

# 测试数据
X_assess = np.random.randn(3, 2)

parameters = {'W1': np.array([[-0.00416758, -0.00056267],
                              [-0.02136196,  0.01640271],
                              [-0.01793436, -0.00841747],
                              [ 0.00502881, -0.01245288]]),
              'W2': np.array([[-0.01057952, -0.00909008,  0.00551454,  0.02292208]]),
              'b1': np.array([[ 0.],
                              [ 0.],
                              [ 0.],
                              [ 0.]]),
              'b2': np.array([[ 0.]])}

A2, cache = forward_propagation(X_assess, parameters)
print('Z1 ,shpae = '+str(cache['Z1'].shape))
print('A1 ,shape = '+str(cache['A1'].shape))
print('Z2 ,shape = '+str(cache['Z2'].shape))
print('A2 ,shape = '+str(cache['A2'].shape))

Z1 ,shpae = (4, 3)
A1 ,shape = (4, 3)
Z2 ,shape = (1, 3)
A2 ,shape = (1, 3)

5.3.2 cost function

$A^{[2]}$ (in the Python variable “A2“),矩阵 $A^{[2]}$ 中的每一个元素 $a^{[2](i)}$ 为模型对样本的d的预测输出．

cost function as follows:

J = - \frac{1}{m} \sum_{i = 0}^{m} (y^{(i)} \log (a^{[2] (i)}) + (1 - y^{(i)}) \log (1 - a^{[2] (i)}))

$J = - \frac{1}{m} \sum\limits_{i = 0}^{m} \large{(} \small y^{(i)}\log\left(a^{[2] (i)}\right) + (1-y^{(i)})\log\left(1- a^{[2] (i)}\right) \large{)} \small$

compute_cost(): 计算cost $J$ .
交叉熵计算numpy： $- \sum\limits_{i=0}^{m} y^{(i)}\log(a^{[2](i)})$ :

logprobs = np.multiply(np.log(A2),Y)
cost = - np.sum(logprobs)                # no need to use a for loop!

也可以np.dot(A2,Y)

# GRADED FUNCTION: compute_cost

def compute_cost(A2, Y, parameters):
    """
    Computes the cross-entropy cost given in equation (13)

    Arguments:
    A2 -- The sigmoid output of shape (1, number of examples)
    Y -- "true" labels vector of shape (number of examples，１)
    parameters -- python dictionary containing your parameters W1, b1, W2 and b2

    Returns:
    cost -- cross-entropy cost given equation (13)
    """

    m = Y.shape[0] # number of example

    # Compute the cross-entropy cost
    # logprobs = np.multiply(np.log(A2), Y.T) + np.multiply((1-Y.T),np.log(1-A2))
    # cost = -1*np.sum(logprobs)/m
    cost = -1*(np.dot(np.log(A2), Y) + np.dot(np.log(1-A2), (1-Y)))/m
    cost = np.squeeze(cost)     
    # makes sure cost is the dimension we expect. 
    # E.g., turns [[17]] into 17 
    #assert(isinstance(cost, float))

    return cost

test cost function

测试数据：
A2, Y_assess, parameters

Y_assess = np.random.randn(3, 1)
parameters = {'W1': np.array([[-0.00416758, -0.00056267],
                              [-0.02136196,  0.01640271],
                              [-0.01793436, -0.00841747],
                              [ 0.00502881, -0.01245288]]),
              'W2': np.array([[-0.01057952, -0.00909008,  0.00551454,  0.02292208]]),
              'b1': np.array([[ 0.],[ 0.],[ 0.],[ 0.]]),
              'b2': np.array([[ 0.]])}
A2 = (np.array([[ 0.5002307 ,  0.49985831,  0.50023963]]))

print("cost = " + str(compute_cost(A2, Y_assess, parameters)))

cost = 0.6934522895013014

5.3.3 backward propagation.

反向传播: backward_propagation()
实现反向传播的六个方程

上标 $(i)$ ,表示第 $i$ 个样本
$\begin{matrix} (1) & \frac{\partial J}{\partial z_{2}^{(i)}} = (a^{[2] (i)} - y^{(i)}) \end{matrix}$ $\frac{\partial \mathcal{J} }{ \partial z_{2}^{(i)} } = (a^{[2](i)} - y^{(i)})\tag{1}$

\begin{matrix} (2) & \frac{\partial J}{\partial W_{2}} = \frac{1}{m} \sum_{i = 1}^{m} \frac{\partial J}{\partial z_{2}^{(i)}} a^{[1] (i) T} \end{matrix}

$\frac{\partial \mathcal{J} }{ \partial W_2 } = \frac{1}{m}\sum_{i=1}^m\frac{\partial \mathcal{J} }{ \partial z_{2}^{(i)} } a^{[1] (i) T}\tag{2}$

\begin{matrix} (3) & \frac{\partial J}{\partial b_{2}} = \frac{1}{m} \sum_{i = 1}^{m} \frac{\partial J}{\partial z_{2}^{(i)}} \end{matrix}

$\frac{\partial \mathcal{J} }{ \partial b_2 } = \frac{1}{m}\sum_{i=1}^m{\frac{\partial \mathcal{J} }{ \partial z_{2}^{(i)}}}\tag{3}$

$\odot$ : 两向量对应元素相乘，返回等大的向量
$tanh()$ : $tanh(z) = a, tanh'(z) = 1 - a^2$

\begin{matrix} (4) & \frac{\partial J}{\partial z_{1}^{(i)}} = W_{2}^{T} \frac{\partial J}{\partial z_{2}^{(i)}} ⊙ (1 - a^{[1] (i) 2}) \end{matrix}

$\frac{\partial \mathcal{J} }{ \partial z_{1}^{(i)} } = W_2^T \frac{\partial \mathcal{J} }{ \partial z_{2}^{(i)} } \odot ( 1 - a^{[1] (i) 2}) \tag{4}$

\begin{matrix} (5) & \frac{\partial J}{\partial W_{1}} = \frac{1}{m} \sum_{i = 1}^{m} \frac{\partial J}{\partial z_{1}^{(i)}} X^{T} \end{matrix}

$\frac{\partial \mathcal{J} }{ \partial W_1 } =\frac{1}{m}\sum_{i=1}^m \frac{\partial \mathcal{J} }{ \partial z_{1}^{(i)} } X^T \tag{5}$

\begin{matrix} (6) & \frac{\partial J_{i}}{\partial b_{1}} = \frac{1}{m} \sum_{i = 1}^{m} \frac{\partial J}{\partial z_{1}^{(i)}} \end{matrix}

$\frac{\partial \mathcal{J} _i }{ \partial b_1 } =\frac{1}{m}\sum_{i=1}^m {\frac{\partial \mathcal{J} }{ \partial z_{1}^{(i)}}}\tag{6}$

下面是矩阵乘法版本的六个方程

\begin{matrix} (1) & d Z^{[2]} = A^{[2]} - Y \end{matrix}

$dZ^{[2]} = A^{[2]} - Y \tag{1}$

\begin{matrix} (2) & d W^{[2]} = \frac{1}{m} d Z^{[2]} A^{[1]^{T}} \end{matrix}

$dW^{[2]} = \frac{1}{m}dZ^{[2]}A^{[1]^T}\tag{2}$

\begin{matrix} (3) & d b^{[2]} = \frac{1}{m} \sum_{i = 1}^{m} (d Z^{[2]}) \end{matrix}

$db^{[2]} = \frac{1}{m}\sum_{i=1}^m(dZ^{[2]})\tag{3}$

\begin{matrix} (4) & d Z^{[1]} = W^{[2]^{T}} d Z^{[2]} ⊙ (1 - A^{[1]^{2}}) \end{matrix}

$dZ^{[1]} = W^{[2]^T}dZ^{[2]}\odot(1- A^{[1]^2})\tag{4}$

\begin{matrix} (5) & d W^{[1]} = \frac{1}{m} d Z^{[1]} X \end{matrix}

$dW^{[1]} = \frac{1}{m}dZ^{[1]}X \tag{5}$

\begin{matrix} (6) & d b^{[1]} = \frac{1}{m} \sum_{i = 1}^{m} d Z^{[1]} \end{matrix}

$db^{[1]} = \frac{1}{m}\sum_{i=1}^m dZ^{[1]} \tag{6}$

The notation you will use is common in deep learning coding:
- dW1 = $\frac{\partial \mathcal{J} }{ \partial W_1 }$
- db1 = $\frac{\partial \mathcal{J} }{ \partial b_1 }$
- dW2 = $\frac{\partial \mathcal{J} }{ \partial W_2 }$
- db2 = $\frac{\partial \mathcal{J} }{ \partial b_2 }$

# GRADED FUNCTION: backward_propagation

def backward_propagation(parameters, cache, X, Y):
    """
    Arguments:
    parameters -- python dictionary containing our parameters 
    cache -- a dictionary containing "Z1", "A1", "Z2" and "A2".
    X -- input data of shape (2, number of examples)
    Y -- "true" labels vector of shape (1, number of examples)
    Returns:
    grads -- python dictionary containing your gradients with respect to different parameters
    """
    # X.shape = (400, 2)
    # Y.shape = (400, 1)
    m = X.shape[0]
    W1 = parameters['W1'] # W1.shape = (4, n_x)
    W2 = parameters['W2'] # W2.shape = (n_y, 4)

    A1 = cache['A1']      # A1.shape = (4, m)
    A2 = cache['A2']      # A2.shape = (n_y, m)
    # Backward propagation: calculate dW1, db1, dW2, db2. 
    dZ2 = A2 - Y.T                                              # dZ2.shape = (n_y, m)
    dW2 = np.dot(dZ2, A1.T)/m                                   # dW2.shape = (n_y, n_h)
    db2 = np.sum(dZ2, axis=1, keepdims=True)/m                  # db2.shape = (n_y,)
    dZ1 = np.multiply(np.dot(W2.T, dZ2), (1 - np.power(A1, 2))) # dZ1.shape = (4, m)
    dW1 = np.dot(dZ1, X)                                        # dW1.shape = (n_h, n_x)
    db1 = np.sum(dZ1, axis=1, keepdims=True)/m                  # db1.shape = (n_h, 1)

    grads = {"dW1": dW1,
             "db1": db1,
             "dW2": dW2,
             "db2": db2}

    return grads

test backward_propagation function
测试数据：

parameters(同上）
cache
X_assess
Y_assess

X_assess = np.random.randn(3, 2)
Y_assess = np.random.randn(3, 1)
cache = {'A1': np.array([[-0.00616578,  0.0020626 ,  0.00349619],
                         [-0.05225116,  0.02725659, -0.02646251],
                         [-0.02009721,  0.0036869 ,  0.02883756],
                         [ 0.02152675, -0.01385234,  0.02599885]]),
         'A2': np.array([[ 0.5002307 ,  0.49985831,  0.50023963]]),
         'Z1': np.array([[-0.00616586,  0.0020626 ,  0.0034962 ],
                         [-0.05229879,  0.02726335, -0.02646869],
                         [-0.02009991,  0.00368692,  0.02884556],
                         [ 0.02153007, -0.01385322,  0.02600471]]),
         'Z2': np.array([[ 0.00092281, -0.00056678,  0.00095853]])}

grads = backward_propagation(parameters, cache, X_assess, Y_assess)
print ("dW1.shape = "+ str(grads["dW1"].shape))
print ("db1.shape = "+ str(grads["db1"].shape))
print ("dW2.shape = "+ str(grads["dW2"].shape))
print ("db2.shape = "+ str(grads["db2"].shape))

dW1.shape = (4, 2)
db1.shape = (4, 1)
dW2.shape = (1, 4)
db2.shape = (1, 1)

5.3.4 update parameters

Question:use (dW1, db1, dW2, db2)　update (W1, b1, W2, b2).

梯度下降:

$\theta = \theta - \alpha \frac{\partial J }{ \partial \theta }$
$\alpha$ is the learning rate and $\theta$ 超参数
$\alpha$ 选择很重要，好的参数，可以让模型更快的学习到最优的权重

# GRADED FUNCTION: update_parameters

def update_parameters(parameters, grads, lr = 1.2):
    # lr : learning_rate
    """
    Arguments:
    parameters -- python dictionary containing your parameters 
    grads -- python dictionary containing your gradients 
    Returns:
    parameters -- python dictionary containing your updated parameters 
    """
    W1 = parameters['W1']
    b1 = parameters['b1']
    W2 = parameters['W2']
    b2 = parameters['b2']

    dW1 = grads['dW1']
    db1 = grads['db1']
    dW2 = grads['dW2']
    db2 = grads['db2']

    W1 -= lr*dW1
    b1 -= lr*db1
    W2 -= lr*dW2
    b2 -= lr*db2

    parameters = {"W1": W1,
                  "b1": b1,
                  "W2": W2,
                  "b2": b2}

    return parameters

5.4 - Integrate parts 4.1, 4.2 and 4.3 in nn_model()

# GRADED FUNCTION: nn_model

def nn_model(X, Y, n_h, num_iterations = 10000, print_cost=500, flag=0, lr=1.2):
    """
    Arguments:
    X -- dataset of shape (2, number of examples)
    Y -- labels of shape (1, number of examples)
    n_h -- size of the hidden layer
    num_iterations -- Number of iterations in gradient descent loop
    print_cost -- if True, print the cost every 100 iterations
    flag -- parameters初始化方式，０：随即，１：全０
    lr -- learning rate
    Returns:
    parameters -- parameters learnt by the model. They can then be used to predict.
    """
    np.random.seed(3)
    n_x = layer_sizes(X, Y)[0]
    n_y = layer_sizes(X, Y)[2]

    parameters = initialize_parameters(n_x, n_h, n_y, flag)
    W1 = parameters['W1']
    b1 = parameters['b1']
    W2 = parameters['W2']
    b2 = parameters['b2']
    costs = []
    # Loop (gradient descent)
    for i in range(0, num_iterations):
        # Forward propagation. Inputs: "X, parameters". Outputs: "A2, cache".
        A2, cache = forward_propagation(X, parameters)
        # Cost function. Inputs: "A2, Y, parameters". Outputs: "cost".
        cost = compute_cost(A2, Y, parameters)
        # Backpropagation. Inputs: "parameters, cache, X, Y". Outputs: "grads".
        grads = backward_propagation(parameters, cache, X, Y)
        # Gradient descent parameter update. Inputs: "parameters, grads". Outputs: "parameters".
        parameters = update_parameters(parameters, grads, lr)
        costs.append(cost)
        # Print the cost every 1000 iterations
        if print_cost and i % print_cost == 0:
            print ("Cost after iteration %i: %f" %(i, cost))

    return parameters,costs

4.5 Predictions

Predictions:

用forward propagation来预测结果
predictions = $y_{prediction} = \mathbb 1 \text{{activation > 0.5}} = \begin{cases} 1 & \text{if}\ activation > 0.5 \ 0 & \text{otherwise} \end{cases}$

# GRADED FUNCTION: predict

def predict(parameters, X):
    """
    parameters -- python dictionary containing your parameters 
    X -- input data of size (n_x, m)

    Returns
    predictions -- vector of predictions of our model (red: 0 / blue: 1)
    """
    A2, cache = forward_propagation(X, parameters)
    predictions = (A2 > 0.5)*1.

    return predictions

6.Training model(全０初始化）

# Build a model with a n_h-dimensional hidden layer
parameters,cost = nn_model(X, Y, n_h = 4, num_iterations=20000, print_cost=1000,lr=1.0,flag=1)
# Plot the decision boundary
plot_decision_boundary(lambda x: predict(parameters, x), X, Y)
plt.title("Decision Boundary for hidden layer size " + str(4))

Cost after iteration 0: 0.693147
Cost after iteration 1000: 0.693147
Cost after iteration 2000: 0.693147
Cost after iteration 3000: 0.693147
Cost after iteration 4000: 0.693147
Cost after iteration 5000: 0.693147
Cost after iteration 6000: 0.693147
Cost after iteration 7000: 0.693147
Cost after iteration 8000: 0.693147
Cost after iteration 9000: 0.693147
Cost after iteration 10000: 0.693147
Cost after iteration 11000: 0.693147
Cost after iteration 12000: 0.693147
Cost after iteration 13000: 0.693147
Cost after iteration 14000: 0.693147
Cost after iteration 15000: 0.693147
Cost after iteration 16000: 0.693147
Cost after iteration 17000: 0.693147
Cost after iteration 18000: 0.693147
Cost after iteration 19000: 0.693147
Text(0.5,1,'Decision Boundary for hidden layer size 4')

这里写图片描述

plt.figure(figsize=(14,6))
plt.title('Lost curve')
plt.grid()
plt.plot(cost)

这里写图片描述

全０初始化模型，梯度不会下降

6.1Training model(随即初始化)

parameters,cost = nn_model(X, Y, n_h = 4, num_iterations=20000, print_cost=1000,lr=1.0)
# Plot the decision boundary
plot_decision_boundary(lambda x: predict(parameters, x), X, Y)
plt.title("Decision Boundary for hidden layer size " + str(4))

Cost after iteration 0: 0.693048
Cost after iteration 1000: 0.513391
Cost after iteration 2000: 0.517645
Cost after iteration 3000: 0.516130
Cost after iteration 4000: 0.515024
Cost after iteration 5000: 0.514154
Cost after iteration 6000: 0.513449
Cost after iteration 7000: 0.512871
Cost after iteration 8000: 0.512389
Cost after iteration 9000: 0.511984
Cost after iteration 10000: 0.511640
Cost after iteration 11000: 0.511343
Cost after iteration 12000: 0.511086
Cost after iteration 13000: 0.510861
Cost after iteration 14000: 0.510661
Cost after iteration 15000: 0.510484
Cost after iteration 16000: 0.510325
Cost after iteration 17000: 0.510182
Cost after iteration 18000: 0.510052
Cost after iteration 19000: 0.509933

Text(0.5,1,'Decision Boundary for hidden layer size 4')

plt.figure(figsize=(14,6))
plt.title('Lost curve')
plt.grid()
plt.plot(cost)

这里写图片描述

# Print accuracy
predictions = predict(parameters, X)
accuracy = float((np.dot(predictions,Y) + np.dot(1-predictions,1-Y))/float(Y.shape[0])*100)
print ('Accuracy: %d '%accuracy+'%')

Accuracy: 68 %

模型的准确率似乎不是很高，有三个参数可以进行调整：

增加训练的次数
增加隐藏层神经元的数量
寻找合适的学习效率

# 49种组合
n_hs = [3, 5, 7, 9, 10, 20 , 30]
lrs  = [1.5, 2.0, 2.5, 3.0, 3.3, 3.5, 4.0]
params = []
for n_h in n_hs:
    for lr in lrs:
        params.append((n_h, lr))

训练模型，并画出分类决策边界

# This may take about 2 minutes to run
plt.figure(figsize=(70,70))
Costs = []
for i, (n_h,lr) in enumerate(params):
    plt.subplot(7, 7, i+1)
    plt.title('n_b : %d,lr : %f' % (n_h,lr))
    parameters,costs = nn_model(X, Y, n_h, num_iterations = 10000, lr=lr,print_cost=0)
    Costs.append(costs)
    plot_decision_boundary(lambda x: predict(parameters, x), X, Y)
    predictions = predict(parameters, X)
    accuracy = float((np.dot(predictions,Y) + np.dot(1-predictions,1-Y))/float(Y.shape[0])*100)
    print ("Accuracy for {} hidden units, learning rate: {},{}%".format(n_h,lr, accuracy))

Accuracy for 3 hidden units, learning rate: 1.5,68.5%
Accuracy for 3 hidden units, learning rate: 2.0,67.0%
Accuracy for 3 hidden units, learning rate: 2.5,67.0%
Accuracy for 3 hidden units, learning rate: 3.0,67.0%
Accuracy for 3 hidden units, learning rate: 3.3,68.25%
Accuracy for 3 hidden units, learning rate: 3.5,67.25%
Accuracy for 3 hidden units, learning rate: 4.0,78.0%
Accuracy for 5 hidden units, learning rate: 1.5,67.25%
Accuracy for 5 hidden units, learning rate: 2.0,74.0%
Accuracy for 5 hidden units, learning rate: 2.5,68.75%
Accuracy for 5 hidden units, learning rate: 3.0,92.25%
Accuracy for 5 hidden units, learning rate: 3.3,91.75%
Accuracy for 5 hidden units, learning rate: 3.5,92.5%
Accuracy for 5 hidden units, learning rate: 4.0,92.5%
Accuracy for 7 hidden units, learning rate: 1.5,92.0%
Accuracy for 7 hidden units, learning rate: 2.0,92.75%
Accuracy for 7 hidden units, learning rate: 2.5,92.75%
Accuracy for 7 hidden units, learning rate: 3.0,92.75%
Accuracy for 7 hidden units, learning rate: 3.3,92.5%
Accuracy for 7 hidden units, learning rate: 3.5,92.25%
Accuracy for 7 hidden units, learning rate: 4.0,92.0%
Accuracy for 9 hidden units, learning rate: 1.5,92.0%
Accuracy for 9 hidden units, learning rate: 2.0,92.75%
Accuracy for 9 hidden units, learning rate: 2.5,92.75%
Accuracy for 9 hidden units, learning rate: 3.0,92.75%
Accuracy for 9 hidden units, learning rate: 3.3,92.5%
Accuracy for 9 hidden units, learning rate: 3.5,92.5%
Accuracy for 9 hidden units, learning rate: 4.0,92.75%
Accuracy for 10 hidden units, learning rate: 1.5,92.75%
Accuracy for 10 hidden units, learning rate: 2.0,92.75%
Accuracy for 10 hidden units, learning rate: 2.5,92.75%
Accuracy for 10 hidden units, learning rate: 3.0,92.75%
Accuracy for 10 hidden units, learning rate: 3.3,92.75%
Accuracy for 10 hidden units, learning rate: 3.5,92.75%
Accuracy for 10 hidden units, learning rate: 4.0,93.0%
Accuracy for 20 hidden units, learning rate: 1.5,93.75%
Accuracy for 20 hidden units, learning rate: 2.0,92.75%
Accuracy for 20 hidden units, learning rate: 2.5,93.0%
Accuracy for 20 hidden units, learning rate: 3.0,92.75%
Accuracy for 20 hidden units, learning rate: 3.3,91.0%
Accuracy for 20 hidden units, learning rate: 3.5,91.25%
Accuracy for 20 hidden units, learning rate: 4.0,89.5%
Accuracy for 30 hidden units, learning rate: 1.5,92.75%
Accuracy for 30 hidden units, learning rate: 2.0,93.75%
Accuracy for 30 hidden units, learning rate: 2.5,91.75%
Accuracy for 30 hidden units, learning rate: 3.0,91.5%
Accuracy for 30 hidden units, learning rate: 3.3,92.75%
Accuracy for 30 hidden units, learning rate: 3.5,92.0%
Accuracy for 30 hidden units, learning rate: 4.0,92.5%

下图，纵轴为n_hs,横轴为lrs:

n_hs = [3, 5, 7, 9, 10, 20, 30]
lrs = [1.5, 2.0, 2.5. 3.0, 3.3, 3.5, 4.0]

cost curve

plt.figure(figsize=(140,140))
for i, (n_h,lr) in enumerate(params):
    plt.subplot(7, 7, i+1)
    plt.grid()
    plt.ylim(0,1)
    plt.title('n_b : %d,lr : %f' % (n_h,lr))
    plt.plot(Costs[i],c='green')

这里写图片描述

上图，为49种组合的，cost curve,纵轴为n_hs,横轴为lrs:

n_hs = [3, 5, 7, 9, 10, 20, 30]
lrs = [1.5, 2.0, 2.5. 3.0, 3.3, 3.5, 4.0]

plt.figure(figsize=(14,56))
for i,n_h,n in zip(np.arange(0,49,7), n_hs, range(7)):
    plt.subplot(7,1, n+1)
    plt.ylim(0,0.7)
    plt.grid()
    plt.title('Hidden Layer of size %d' % n_h)
    for lr,cost in zip(lrs,Costs[i: i+7]):
        plt.plot(cost, label='lr=%.2f'%lr)
    plt.legend()

这里写图片描述

通过对上面数据的观察找到几组不错的参数(n_h, lr)

(7, 2.5)
(9, 2.5)
(10, 3.5)
(20, 2.0)

7.Find best combination

对选出的四组参数对应的cost curve进行可视化

plt.figure(figsize=(14,8))
plt.grid()
plt.ylim(0.1,0.25)
plt.title('Find best combination')
for n_h, lr in [(7,2.5),(9,2.5),(10,3.5),(20,2.0)]:
    i = n_hs.index(n_h)
    j = lrs.index(lr)
    plt.plot(Costs[i*7:(i+1)*7][j], label='n_h:%d,lr=%.2f'%(n_h,lr))
plt.legend()

这里写图片描述

黄色的曲线表现不错，在迭代5500次之后，很平滑，下降的也比较快．下面继续增加迭代的次数，看看，黄色的曲线是否还会出现震荡．

plt.figure(figsize=(14,8))
plt.grid()
plt.ylim(0.1,0.25)
for i, (n_h,lr) in enumerate([(7,2.0),(10,4.0),(20,1.5),(30,2.0)]):
    parameters,costs = nn_model(X, Y, n_h, num_iterations = 30000, lr=lr,print_cost=0)
    plt.plot(costs, label='n_h=%d,lr=%.2f'%(n_h,lr))
plt.legend()

这里写图片描述
　　

　　增加了训练次数后，除了红色的曲线依然有震荡，且周期也在变大．其他的曲线15000轮的迭代后基本趋于稳定，所以(9, 2.5)可能是最佳组合，对数据拟合比较好，但是可能出现过拟合下降．不过这次的内容先不涉及过拟合的问题．

n_h = 9
lr = 2.5
num_iteration:15000

DL-C1-week3-1(build a neural network with one hidden layer)多层感知机的简单实现

Build a neural network with one hidden layer

1 - Packages

２.Helper function

3. Create and overview dataset

3.1 generate data

3.2 visualize dataset

4. Simple Logistic Regression

４.1 train logistic regression classifier

４.2 plot decision boundary

5 - Neural Network model

５.1 - Defining the neural network structure

５.2 - Initialize the model’s parameters

5.3 - The Loop

5.3.1 forward propagation

5.3.2 cost function

5.3.3 backward propagation.

5.3.4 update parameters

5.4 - Integrate parts 4.1, 4.2 and 4.3 in nn_model()

4.5 Predictions

6.Training model(全０初始化）

全０初始化模型，梯度不会下降

6.1Training model(随即初始化)

cost curve

7.Find best combination

猜你喜欢