Introduction to Machine Learning: Linear Regression and Gradient Descent

Reprinted in: https://blog.csdn.net/xiazdong/article/details/7950084
 
This article will talk about:

(1) Definition of Linear Regression

(2) Univariate Linear Regression

(3) cost function: a method for evaluating whether linear regression fits the training set
(4) Gradient descent: one of the methods to solve linear regression
(5) feature scaling: a method to speed up the execution of gradient descent

(6) Multivariate Linear Regression


Linear Regression
 
 
Note one sentence: Feature Scaling must be done before multivariate linear regression!

Method: Linear regression belongs to supervised learning, so the method and supervised learning should be the same, first give a training set, learn a linear function according to this training set, and then test whether the function is trained well (that is, whether the function is suitable enough Combine the training set data), select the best function (minimum cost function);
Notice:
(1) Because it is a linear regression, the learned function is a linear function, that is, a straight line function;
(2) Because it is a single variable, there is only one x;

We can give a model for univariate linear regression:
 
 
We often call x feature, h(x) hypothesis;

From the above "method", we must have a question, how can we see whether the linear function is well fitted?
We need to use the Cost Function. The smaller the cost function, the better the linear regression (the better the fit with the training set), of course, the minimum is 0, that is, a complete fit;

As a practical example:

We want to predict the price of a house based on the size of the house, given the following dataset:

 

 

Based on the above data set, draw it on the graph, as shown in the following figure:

We need to fit a straight line based on these points to minimize the cost Function;


Although we don't know what the Cost Function is like, our goal is: given the input vector x, the output vector y, the theta vector, and the output Cost value;

Above we have described the general process of univariate linear regression;


Cost Function


Purpose of Cost Function: Evaluate the assumed function. The smaller the cost function, the better the fit of the training data;
The following figure details the role of the cost function when the cost function is a black box;
 

But we must be wondering what is the internal structure of the cost function? So we give the formula below:
 
in:
represents the ith element in the vector x;
represents the ith element in the vector y;
represents a known hypothetical function;
m is the number of training sets;

For example, given datasets (1,1), (2,2), (3,3)
, then x = [1;2;3], y = [1;2;3] (the syntax here is Octave language syntax, representing a 3*1 matrix)
If we predict theta0 = 0, theta1 = 1, then h(x) = x, then the cost function:
J(0,1) = 1/(2*3) * [( h(1)-1)^2+(h(2)-2)^2+(h(3)-3)^2] = 0;
if we predict theta0 = 0, theta1 = 0.5, then h(x ) = 0.5x, then the cost function:
J(0,0.5) = 1/(2*3) * [(h(1)-1)^2+(h(2)-2)^2+(h( 3)-3)^2] = 0.58;


If theta0 is always 0, the function of theta1 and J is:
 
If theta0 and theta1 are not fixed, the functions of theta0, theta1, and J are:
 

Of course, we can also use a two-dimensional map to represent it, that is, a contour map;
 


Note: If it is linear regression, the function of costfunctionJ and must be bowl-shaped, that is, there is only one minimum point;

Above we explained the definition and formula of cost function;


Gradient Descent


But another question leads to, although given a function, we can know whether the function fits well or not according to the cost function, but after all, there are so many functions, it is impossible to try them one by one, right?
So we lead to gradient descent: being able to find the minimum value of the cost function function;
Gradient descent principle: compare a function to a mountain, we stand on a certain hillside, look around, take a small step down from which direction, and can descend the fastest;

Of course, there are many ways to solve the problem, gradient descent is just one of them, and there is another method called Normal Equation;

Method :
(1) First determine the size of the steps to the next step, which we call the Learning rate;
(2) Any given initial value: ;
(3) Determine a downward direction, and walk down a predetermined pace, and update ;
(4) When the descending height is less than a certain defined value, stop descending;

Algorithm :



Features :
(1) The initial points are different, and the obtained minimum values ​​are also different, so the gradient descent only obtains the local minimum value;
(2) The closer to the minimum value, the slower the descending speed;

Question: If the initial value is at the location of the local minimum, how will it change?
Answer: Because is already in the local minimum position, the derivative must be 0, so it will not change;

If a correct value is obtained, the cost function should become smaller and smaller;
Question: How to get the value ?
Answer: Observe the value , if the cost function becomes smaller, then ok, otherwise, take a smaller value;

The following figure illustrates the process of gradient descent in detail:

 
As can be seen from the above figure: the initial point is different, the obtained minimum value is also different , so the gradient descent only obtains the local minimum value;

Note: The step size of the descent is very important, because if it is too small, the speed of finding the minimum value of the function is very slow, and if it is too large, the phenomenon of overshoot the minimum may occur;

The following figure is the overshoot minimum phenomenon:

 
If the value of the Learning rate is found to have increased, the value of the Learning rate needs to be reduced;


Integrating with Gradient Descent & Linear Regression


Gradient descent can find the minimum value of a function;
Linear regression needs to be found to minimize the cost function;

Therefore, we can apply gradient descent to the cost function, which integrates gradient descent and linear regression, as shown in the following figure:



Gradient descent is through continuous iteration, and we pay more attention to the number of iterations, because this is related to the execution speed of gradient descent. In order to reduce the number of iterations, Feature Scaling is introduced;


Feature Scaling


This method is applied to gradient descent in order to speed up the execution of gradient descent ;
Idea: Standardize the value of each feature so that the value range is roughly between -1<=x<=1;

The commonly used method is Mean Normalization, that is
 
or:
[X-mean(X)]/std(X);


As a practical example,

There are two Features:
(1) size, the value range is 0~2000;
(2)#bedroom, the value range is 0~5;
Then after feature scaling,
    



practice questions

We want to predict final exam grades from midterm start grades, and the equation we want to get is:

Given the following training set:

midterm exam (midterm exam) final exam
89 7921 96
72 5184 74
94 8836 87
69 4761 78
We want to perform feature scaling on (midterm exam)^2, what is the value after feature scaling?

max = 8836,min=4761,mean=6675.5,则x=(4761-6675.5)/(8836-4761) = -0.47;


Multivariate Linear Regression


We only introduced univariate linear regression earlier, that is, there is only one input variable, and the real world cannot be so simple, so here we will introduce multivariate linear regression;

for example:
房价其实由很多因素决定,比如size、number of bedrooms、number of floors、age of home等,这里我们假设房价由4个因素决定,如下图所示:



我们前面定义过单变量线性回归的模型:


 
这里我们可以定义出多变量线性回归的模型:



Cost function如下:
 

如果我们要用梯度下降解决多变量的线性回归,则我们还是可以用传统的梯度下降算法进行计算:




总练习题:

 


1.我们想要根据一个学生第一年的成绩预测第二年的成绩,x为第一年得到A的数量,y为第二年得到A的数量,给定以下数据集:

x y
3 4
2 1
4 3
0 1
(1)训练集的个数是多少?  4个;
(2)J(0,1)的结果是多少?
J(0,1) = 1/(2*4)*[(3-4)^2+(2-1)^2+(4-3)^2+(0-1)^2] = 1/8*(1+1+1+1) = 1/2 = 0.5;

我们也可以通过vectorization的方法快速算出J(0,1):





Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325568094&siteId=291194637