Machine learning experience by Andrew Ng (1)

Since Ng left Baidu and started his own business, he has become very popular online recently, and his various courses have also become popular. So in this series of blogs, I will post Ng in the Machine Learning course on Coursera. The weekly courses and tasks are summarized to share with students who have Chinese learning needs. It also helps me sort out the knowledge network. If there are any omissions or errors, I hope the CSDN masters can point them out.

Then let’s start the text

WEEK 1:

first of all,

What is machine learning?

Andrew gave two explanations. The first one is: giving the computer the ability to learn without explicitly programming it. This explanation is the definition of machine learning given when machine learning was just born. Andrew prefers another, more modern definition: a computer program can learn from experience E through task T and perform it as method P (P here actually refers to probability), and then continuously improve method P through experience E. .

Supervised learning vs. unsupervised learning

Mainly let’s talk about the main difference between the two. The input data of supervised learning must be labeled. This machine learning method needs to learn or predict the subsequent data based on the existing labels. Common supervised learning is generally Regression algorithms and classification algorithms. The input data of unsupervised learning has no labels. Computers need to be used to process the values ​​of all data and perform corresponding judgments and operations. Common unsupervised learning is generally clustering algorithms, but the accuracy of existing clustering algorithms is It is not easy to measure, so for this type of algorithm, the difficulty lies in how to calculate the accuracy rather than designing the algorithm. Okay, back to the topic just now.

Example of supervised learning: data {real house prices of houses of different sizes in the market}, use the data to predict the house prices of houses of different sizes and obtain the accuracy. Example of unsupervised learning: data {1,000,000 different genes}, using the data to allow the computer to automatically classify these genes into several different categories.

For visual representation, the following figure shows the model representation of machine learning.
Write picture description here
X represents the input data input, y represents the predicted data output, and the h in the middle is the abbreviation of hypothesis, which means a method or function for processing data X. And h is obtained by machine learning algorithm.

loss function

Loss function (Cost Function) is the focus of the first week. In fact, no matter which introductory course on machine learning, cost function is the top priority. The loss function represents the quality of the current h. The smaller the loss function, the higher the quality of h. Machine learning is a learning process that leads to a common method for processing the same type of data. Then h is continuously learned and improved in this learning process. In this learning process, we need to use the loss function to adjust h.

The formula of the loss function is given below:
Write picture description here
Among them, yi^ is the predicted value, yi is the real value, J is the loss function, then it can be seen that J is the predicted value and The average of the sum of squares of the true values, and the smaller J, as mentioned above, the higher the prediction accuracy of the current function will be.

Andrew then gave two examples, choosing one that is more representative. Since what we are currently discussing are linear regression problems, this part uses the slope problem of junior high school mathematics.

A contour line is introduced here. The right side of the figure below is a contour line. The three green points in the figure represent three regressions with different slopes but the same loss value. equation line. Write picture description hereBy determining and comparing the loss values, we can see that the regression equation with the smallest loss value in the middle of the isoline is the optimal solution to the current problem. As shown below:Write picture description here

**

Gradient descent:

**
Okay, now that we have finished talking about the loss function, we will start talking about how to use the loss function to make our processing method h gradually become more reasonable and close to the optimal solution. .

Now we assume that our h=θ0+ θ1*x can construct the loss function J as a function that depends on θ0 and θ1. Then as shown in the figure below, we only need to randomly define a θ0 and θ1, and then find the tangent line of the point, and then according to the principle of gradient descent, find the optimal solution step by step through a step size α (that is, the learning rate).
Write picture description here
However, in this picture, you can see that in the heat map, there are two red arrow positions, both of which are blue optimal solutions, and the blue one on the left is obviously better than the one on the right. The color is darker, which means that the loss function value on the left is smaller, but through the current random point, we get a local optimal solution J2, not the global optimal solution J. This is a normal phenomenon. One of the shortcomings of gradient descent is that it is easy to fall into local optimality. The solution is to use simulated annealing. I guess Andrew will mention this in subsequent courses, and I will also write about it in subsequent blogs. This question needs to be clarified, but for now, you can keep reading with this question in mind.

The following is the equation of gradient descent:
Write picture description here
By continuously adjusting θj, we will eventually get an optimal solution θj.
Here Andrew emphasized the issue of setting the step size α. If α is too large, then we may fall into an infinite loop, wandering on both sides of the optimal solution each time and unable to reach it. The optimal solution, and if α is too small, it will take us too long to get the optimal solution. Then a good solution is to let α automatically decrease so that θj can approach the optimal solution in the fastest time. Optimize the solution without worrying about the first problem.
Write picture description here

So at the end of this lesson, we will get an example of the optimal equation of linear regression through gradient descent. It can be calculated through the above loss function and gradient descent.

OK,WEEK1 COMPLETED!

Guess you like

Origin blog.csdn.net/jxsdq/article/details/78091197