吴恩达深度学习2-Week1课后作业2-正则化

一、deeplearning-assignment

这一节作业的重点是理解各个正则化方法的原理，以及它们的优缺点，而不是去注重算法实现的具体末节。

问题陈述：希望你通过一个数据集训练一个合适的模型，从而帮助推荐法国守门员应该踢球的位置，这样法国队的球员可以用头打。法国过去10场比赛中的二维数据集如下：

每个点对应于法国守门员在足球场左侧击球之后，其他运动员用头将球击中的足球场上的位置。

如果这个点是蓝色的，这意味着这个法国球员设法用他/她的头击球
如果这个点是红色的，这意味着另一个队的球员用头撞球

你的目标：使用深度学习模式来找到守门员踢球的场地。

分析数据集：这个数据集有点杂乱，但貌似可以用一条对角线能区分开左上角（蓝色）与右下角（红色）的数据，效果还不错。

在本次作业中将会首先尝试一个非正则化的模型。然后学习如何正规化，并决定选择哪种模式来解决法国足球公司的问题。

二、相关算法代码

1.非正规化模型

def model(X, Y, learning_rate=0.3, num_iterations=30000, print_cost=True, lambd=0, keep_prob=1):
    """

    :param X:input data, of shape (input size, number of examples)
    :param Y:true "label" vector (1 for blue dot / 0 for red dot), of shape (output size, number of examples)
    :param learning_rate:learning rate of the optimization
    :param num_iterations:number of iterations of the optimization loop
    :param print_cost:If True, print the cost every 10000 iterations
    :param lambd:regularization hyperparameter, scalar
    :param keep_prob:probability of keeping a neuron active during drop-out, scalar.
    :return:parameters -- parameters learned by the model. They can then be used to predict.

    """

    grads = {}
    costs = []  # to keep track of the cost
    m = X.shape[1]  # number of examples is 211
    layers_dims = [X.shape[0], 20, 3, 1]

    parameters = initialize_parameters(layers_dims)

    for i in range(0, num_iterations):
        if keep_prob == 1:
            a3, cache = forward_propagation(X, parameters)
        elif keep_prob < 1:
            a3, cache = forward_propagation_with_dropout(X, parameters, keep_prob)

        if lambd == 0:
            cost = compute_cost(a3, Y)
        else:
            cost = compute_cost_with_regularization(a3, Y, parameters, lambd)

        assert (lambd == 0 or keep_prob == 1)

        if lambd == 0 and keep_prob == 1:
            grads = backward_propagation(X, Y, cache)
        elif lambd != 0:
            grads = backward_propagation_with_regularization(X, Y, cache, lambd)
        elif keep_prob < 1:
            grads = backward_propagation_with_dropout(X, Y, cache, keep_prob)

        parameters = update_parameters(parameters, grads, learning_rate)

        if print_cost and i % 10000 == 0:
            print("Cost after iteration {}: {}".format(i, cost))
        if print_cost and i % 1000 == 0:
            costs.append(cost)

    plt.plot(costs)
    plt.ylabel('cost')
    plt.xlabel('iterations (x1,000)')
    plt.title("Learning rate =" + str(learning_rate))
    plt.show()

    return parameters

从上述结果可以看出非正则化模型过度拟合了训练集，它拟合了部分杂乱数据，现在让我们看看可以减少过度拟合的两种技术。

2.L2正则化

为了避免过度拟合数据集，将cost通过改为第二个方程进行计算，从而减小高方差带来的影响。

def compute_cost_with_regularization(A3, Y, parameters, lambd):
    """

    :param A3:post-activation, output of forward propagation, of shape (output size, number of examples)
    :param Y:"true" labels vector, of shape (output size, number of examples)
    :param parameters:python dictionary containing parameters of the model
    :param lambd:regularization hyperparameter, scalar
    :return:cost - value of the regularized loss function

    """

    m = Y.shape[1]
    W1 = parameters["W1"]
    W2 = parameters["W2"]
    W3 = parameters["W3"]

    cross_entropy_cost = compute_cost(A3, Y)

    L2_regularization_cost = lambd / 2 * 1 / m * (np.sum(np.square(W1)) +
                                                  np.sum(np.square(W2)) + np.sum(np.square(W3)))

    cost = cross_entropy_cost + L2_regularization_cost

    return cost


def backward_propagation_with_regularization(X, Y, cache, lambd):
    """

    :param X:input dataset, of shape (input size, number of examples)
    :param Y:"true" labels vector, of shape (output size, number of examples)
    :param cache:cache output from forward_propagation()
    :param lambd:regularization hyperparameter, scalar
    :return:gradients -- A dictionary with the gradients with respect to each parameter, activation and pre-activation variables

    """

    m = X.shape[1]
    (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3) = cache

    dZ3 = A3 - Y

    dW3 = 1. / m * np.dot(dZ3, A2.T)
    db3 = 1. / m * np.sum(dZ3, axis=1, keepdims=True)

    dA2 = np.dot(W3.T, dZ3)
    dZ2 = np.multiply(dA2, np.int64(A2 > 0))

    dW2 = 1. / m * np.dot(dZ2, A1.T) + lambd / m * W2
    db2 = 1. / m * np.sum(dZ2, axis=1, keepdims=True)

    dA1 = np.dot(W2.T, dZ2)
    dZ1 = np.multiply(dA1, np.int64(A1 > 0))

    dW1 = 1. / m * np.dot(dZ1, X.T) + lambd / m * W1
    db1 = 1. / m * np.sum(dZ1, axis=1, keepdims=True)

    gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3, "dA2": dA2,
                 "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1,
                 "dZ1": dZ1, "dW1": dW1, "db1": db1}

    return gradients

从上图可以看出，L2正则化模型让分类的边界更为平滑。对于cost函数来说，它增加了一个正则化项，lambd越大，w的值会相应变小，从而导致减小过度拟合带来的影响。

3.dropout正则化

dropout对于每一层的隐藏层节点数，会随机地减小一些节点，从而使神经网络变得更简单，直观上表现为原来的网络变为某些隐藏层的节点很少但层数很深的神经网络。

def forward_propagation_with_dropout(X, parameters, keep_prob=0.5):
    """

    :param X:input dataset, of shape (2, number of examples)
    :param parameters:python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3":
            W1 -- weight matrix of shape (20, 2)
            b1 -- bias vector of shape (20, 1)
            W2 -- weight matrix of shape (3, 20)
            b2 -- bias vector of shape (3, 1)
            W3 -- weight matrix of shape (1, 3)
            b3 -- bias vector of shape (1, 1)
    :param keep_prob: probability of keeping a neuron active during drop-out, scalar
    :return:A3 -- last activation value, output of the forward propagation, of shape (1,1)
    cache -- tuple, information stored for computing the backward propagation

    """

    np.random.seed(1)

    # retrieve parameters
    W1 = parameters["W1"]
    b1 = parameters["b1"]
    W2 = parameters["W2"]
    b2 = parameters["b2"]
    W3 = parameters["W3"]
    b3 = parameters["b3"]

    # LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID
    Z1 = np.dot(W1, X) + b1
    A1 = relu(Z1)

    D1 = np.random.rand(A1.shape[0], A1.shape[1])  # Step 1: initialize matrix D1 = np.random.rand(..., ...)
    D1 = D1 < keep_prob  # Step 2: convert entries of D1 to 0 or 1 (using keep_prob as the threshold)
    A1 = np.multiply(D1, A1)  # Step 3: shut down some neurons of A1
    A1 = A1 / keep_prob  # Step 4: scale the value of neurons that haven't been shut down

    Z2 = np.dot(W2, A1) + b2
    A2 = relu(Z2)

    D2 = np.random.rand(A2.shape[0], A2.shape[1])  # Step 1: initialize matrix D2 = np.random.rand(..., ...)
    D2 = D2 < keep_prob  # Step 2: convert entries of D2 to 0 or 1 (using keep_prob as the threshold)
    A2 = np.multiply(D2, A2)  # Step 3: shut down some neurons of A2
    A2 = A2 / keep_prob  # Step 4: scale the value of neurons that haven't been shut down

    Z3 = np.dot(W3, A2) + b3
    A3 = sigmoid(Z3)

    cache = (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3)

    return A3, cache


def backward_propagation_with_dropout(X, Y, cache, keep_prob):
    """

    :param X:input dataset, of shape (2, number of examples)
    :param Y:"true" labels vector, of shape (output size, number of examples)
    :param cache:cache output from forward_propagation_with_dropout()
    :param keep_prob:probability of keeping a neuron active during drop-out, scalar
    :return:gradients -- A dictionary with the gradients with respect to each parameter, activation and pre-activation variables

    """

    m = X.shape[1]
    (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3) = cache

    dZ3 = A3 - Y
    dW3 = 1. / m * np.dot(dZ3, A2.T)
    db3 = 1. / m * np.sum(dZ3, axis=1, keepdims=True)
    dA2 = np.dot(W3.T, dZ3)

    dA2 = np.multiply(dA2, D2)  # Step 1: Apply mask D2 to shut down the same neurons as during the forward propagation
    dA2 = dA2 / keep_prob  # Step 2: Scale the value of neurons that haven't been shut down

    dZ2 = np.multiply(dA2, np.int64(A2 > 0))
    dW2 = 1. / m * np.dot(dZ2, A1.T)
    db2 = 1. / m * np.sum(dZ2, axis=1, keepdims=True)

    dA1 = np.dot(W2.T, dZ2)

    dA1 = np.multiply(dA1, D1)  # Step 1: Apply mask D1 to shut down the same neurons as during the forward propagation
    dA1 = dA1 / keep_prob  # Step 2: Scale the value of neurons that haven't been shut down

    dZ1 = np.multiply(dA1, np.int64(A1 > 0))
    dW1 = 1. / m * np.dot(dZ1, X.T)
    db1 = 1. / m * np.sum(dZ1, axis=1, keepdims=True)

    gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3, "dA2": dA2,
                 "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1,
                 "dZ1": dZ1, "dW1": dW1, "db1": db1}

    return gradients

需要注意的一点是，dropout只能应用于训练集，在测试的时候不应该用该模型。

从上面的代码可以看出，在训练期间,每个dropout层会除以keep_prob来保持相同的期望值。例如,如果keep_prob是0.5,意味着我们将平均关闭一半的节点,那么输出将除以0.5，除以0.5等于乘以2。因此,现在具有相同的预期值的输出。

三、总结

以下的表是上述三种模型的结果：

我们可以看到，正则化过程影响着训练集的准确度，同时也影响着测试集的精度。这是因为正则化能延缓模型对训练集的过度拟合程度，同时也能提高模型的泛化能力，提高测试集的准确度。

因此，通过这周作业的练习，我们知道了：

正则化过程能够帮助减小对数据的过拟合现象。
正则化能够让w变得更小。
L2正则化和Dropout是两种比较好的正则化方法。