Deep Learning Notes by Andrew Ng (2)

Since Ng left Baidu and started his own business, he has become very popular online recently, and his various courses have also become popular. This series is about Ng's course notes on deep learning and neural networks released on Coursera. I would like to share it with students who need to learn Chinese, and it will also help me sort out the knowledge network myself. If there are any omissions or errors, I hope the CSDN masters can point them out.

WEEK2

Binary Classification:

First of all, we talked about binary classification, which is what we often call the whether problem, so the problem we want to solve here is a classification problem about 0 and 1. Andrew's course uses logistic regression to deal with this problem. The course uses the example of determining whether a picture is a picture of a kitten.

Confirm input data:

The first thing we need is to input data. Skipping the problem of image layering, we simply take each pixel out of the image and reshape it into a new vector as input. Suppose we have m images as the training set, then The entire training set is: {(X(1),Y(1)),(X(2),Y(2)),...(X(m),Y(m))}. The structure of the input data is:
Write picture description here

Logistic regression:
The linear regression equation we are familiar with cannot be used in classification problems, because the prediction Y we need to get is a probability p, and p represents The predicted value is the probability of each output possibility. If it is the most likely, we will classify the predicted value into that category. So at this time, what is needed is a classification function. Here we use the logistic regression function for classification.

Write picture description here

First, let’s look at the picture above. This is an image I picked up from Baidu. It is an image of the sigmoid function. Logistic regression uses the sigmoid function. To put it simply, when doing a binary classification, for this picture, We stipulate that when the sigmoid function is greater than 0.5, the input classification is 1, and when it is less than 0.5, the input classification is 0. Similarly, as the activation function of the standard NN, the sigmoid function should appear many times in subsequent learning. To have an in-depth understanding of the sigmoid function, here is a link:
http://blog.jobbole.com/88521/

Loss function (cost function):
Loss function (Cost Function) is the focus of the first week. In fact, no matter which introductory course on machine learning, cost function All are top priorities. The loss function represents the quality of the current h. The smaller the loss function, the higher the quality of h. Machine learning is a learning process that leads to a common method for processing the same type of data. Then h is continuously learned and improved in this learning process. In this learning process, we need to use the loss function to adjust h.

The formula of the loss function is given below:
Write picture description here
Among them, yi^ is the predicted value, yi is the real value, J is the loss function, then it can be seen that J is the predicted value and The average of the sum of squares of the true values, and the smaller J, as mentioned above, the higher the prediction accuracy of the current function will be.

Andrew then gave two examples, choosing one that is more representative. Since what we are currently discussing are linear regression problems, this part uses the slope problem of junior high school mathematics.

A contour line is introduced here. The right side of the figure below is a contour line. The three green points in the figure represent three regressions with different slopes but the same loss value. equation line.
Write picture description here
By determining and comparing the loss values, we can see that the regression equation with the smallest loss value in the middle of the isoline is the optimal solution to the current problem. As shown below:
Write picture description here

Gradient descent:

Okay, now that we’ve talked about the loss function, let’s start talking about how to use the loss function to make our processing method h gradually become reasonable and close to the optimal solution.

Now we assume that our h=θ0+ θ1*x can construct the loss function J as a function that depends on θ0 and θ1. Then as shown in the figure below, we only need to randomly define a θ0 and θ1, and then find the tangent line of the point, and then according to the principle of gradient descent, find the optimal solution step by step through a step size α (that is, the learning rate).
Write picture description here
However, in this picture, you can see that in the heat map, there are two red arrow positions, both of which are blue optimal solutions, and the blue one on the left is obviously better than the one on the right. The color is darker, which means that the loss function value on the left is smaller, but through the current random point, we get a local optimal solution J2, not the global optimal solution J. This is a normal phenomenon. One of the shortcomings of gradient descent is that it is easy to fall into local optimality. The solution is to use simulated annealing. I guess Andrew will mention this in subsequent courses, and I will also write about it in subsequent blogs. This question needs to be clarified, but for now, you can keep reading with this question in mind.

The following is the equation of gradient descent:
Write picture description here
By continuously adjusting θj, we will eventually get an optimal solution θj.
Here Andrew emphasized the issue of setting the step size α. If α is too large, then we may fall into an infinite loop, wandering on both sides of the optimal solution each time and unable to reach it. The optimal solution, and if α is too small, it will take us too long to get the optimal solution. Then a good solution is to let α automatically decrease so that θj can approach the optimal solution in the fastest time. Optimize the solution without worrying about the first problem.
Write picture description here

Computation Graph:
What is a computation graph structure?
(1) The computation graph can be regarded as a In the language of functions, the nodes in the graph represent the input of the function, and the edges in the graph represent the operation of the function.
(2) After seeing an example of a calculation graph, we can easily write its corresponding function.

The calculation graph can be used to represent the function operation at each step, which naturally leads to forward propagation and reverse propagation.
Write picture description here
In the forward direction, we call the operation of b and c b*c, then u is obtained from the two input nodes bc, and pushed to the last step, the final output J=3*v, This is considered forward propagation, so how is back propagation represented in this picture?
Still using the simplest u=b*c, how much of the change in u is affected by b? At this time, you only need to find the partial derivative of the function u with respect to b. , is 2. In the same way, if we want to know how much of the change in output Y is affected by a certain input, we can find the partial derivative about the output.
Calculation formula:
Write picture description here
Describes how the partial derivative of function J with respect to a is calculated.

The last thing I want to talk about is the derivation of how to combine logistic regression and backpropagation:
Write picture description here
The above picture is the calculation diagram of logistic regression, so as a thought, you can think about it Think of the partial derivatives of function L with respect to the five input parameters x1, w1, x2, w2, and b.

WEEK2 END

Guess you like

Origin blog.csdn.net/jxsdq/article/details/78129706