Linear regression (linear regression model)
linear regression model
Regression: can predict a number as output
is a special supervised learning model
Example: The price of feet can be obtained by fitting the curve with the known house price
Distinguish between regression and classification : the output of classification is generally discrete, and the number of
- the term
Training set (train set) :
Input variable: X (X=2014)
Output variable: Y : (Y=232)
Total number of training samples: M
A training sample: (X,Y)
( xi , yi ): i represents the first row in the training set (x^i,y^i): i represents the first few rows in the training set(xi,yi):i represents the first row in the training set
- generate rules
A function f is obtained through the training set , which can obtain the predicted y by inputting x
y^: is recorded as the predicted y
- representation of f
If the default is straight
f w , b ( x ) = w x + b f_{w,b}(x)=wx+b fw,b(x)=wx+b
Generate the image:
Linear regression equation with only one input variable
Cost function formula
In order to implement the regression algorithm, the first step is to construct the cost function
Different w and b represent different weights (input x):
Different w and b will constitute different functions, we need the function to fit better as possible
Cost function: Measures how well a line fits the training data
m: the number of samples in the training set
Multiply by 1/2m: in order to make the later variance smaller
J ( w , b ) = 1 / 2 m ∑ i = 0 m ( y ^ − y ) 2 : The most commonly used cost function formula J(w,b)=1/2m\sum_{i=0}^{m } (\hat{y}-y)^2: \text{The most commonly used cost function formula}J(w,b)=1/2 mi=0∑m(y^−y)2 :The most commonly used cost function formula
That is, it can be written as:
J ( w , b ) = 1 2 m ∑ i = 0 m ( fw , b ( x ( i ) ) − y ) 2 J(w,b)=\frac{1}{2m}\ sum_{i=0}^{m} (f_{w,b}(x^{(i)})-y)^2J(w,b)=2 m1i=0∑m(fw,b(x(i))−y)2
Understand the cost function
First, set b=0 and observe how w affects f(x) and j(w)
f(x): Given a w, the variable is x
The image when w=1 is as follows:
The w=0.5 image is as follows:
j(w): the variable is w
Calculate different j(w) through the above figure
…
Then plot j(w) according to the value
According to the image, the smaller j(w) is, the better the function fit
So the purpose of linear regression : to find the parameter w/b, so that the value of j(w) is the smallest
Visualize the cost function
Observe how w,b affect fw,b(x) and j(w,b)
f(x): Given a w&b, the variable is x
The image when w=0.06, b=50 is as follows:
Then plot j(w,b) against the values:
x-axis (b), y-axis (w), z-axis (J)
The above image can be converted into a contour map (two-dimensional), with the center point being the minimum value:
For example: take b=800, w=-0.15, the cost function is larger
The fitting function of the above values is not very good
Through the above, we understand that the cost function is related to w and b. Next, how to find w and b to minimize the cost function . The method is: gradient descent algorithm
gradient descent
is an algorithm that can minimize an arbitrary function such that the function falls in the way that the slope falls the fastest
For example, the figure below: Simulate a villain, and find the lowest point step by step from the highest point of J (cost function)
And choosing different starting points may find different local optimal solutions
Implementation of gradient descent method
- Formula (same applies to b)
α: Learning rate (0~1), representing the magnitude of decline
w = w − α d J ( w , b ) dww=w-\alpha {d{J(w,b)}\over dw}w=w−adwdJ(w,b)
Fine-tune the current w
d J ( w , b ) dw : represents which direction to descend from {d{J(w,b)}\over dw}:\textcolor{red}{represents which direction to descend from}dwdJ(w,b):In which direction does it descend
Note: The updates of w and b need to be synchronized
Understanding Gradient Descent Algorithm
You can set b to 0, then the cost function is a curve:
d J ( w , b ) dw : the slope of the tangent {d{J(w,b)}\over dw}:\textcolor{red}{the slope of the tangent}dwdJ(w,b):slope of tangent
If the slope is positive , then the value of w will gradually decrease , and the value will move to the left, gradually approaching the minimum value in the middle
Understanding Learning Rate α
- If the learning rate is too small , it takes many steps to reach the minimum point of w
- If the learning rate is too large , the minimum point may be missed , or the interference to noise is not strong
Also, if the gradient descent algorithm has reached a local minimum , it terminates
Because its current slope is already 0
Gradient Descent for Linear Regression
Default expression for linear regression:
Substituting into the gradient descent formula, we can get:
The following formula: use the above J(w,b) to find the partial differential
This function can then be substituted into the
The actual simulated image can be as shown in the figure:
It can be seen that given any value of w and b, this method will find the **local minimum value of the cost function (fitting locally is the best)** to fit the curve
- Glossary
- Batch Gradient Descent : Considering all training set samples