ML learning two-univariate linear regression

2-1 Model description

We want to use a data set that contains housing prices in Portland, Oregon. Here, I want to draw my data set based on the prices sold for different house sizes. For example, if your friend ’s house is 1,250 square feet, you have to tell them how much the house can sell.

One thing you can do is to build a model, maybe a straight line. From this data model, maybe you can tell your friend that he can sell the house for about 220,000 (USD). This is an example of a supervised learning algorithm.

 

 

I will use lowercase m for the number of training samples throughout the course .

Taking the previous housing transaction problem as an example, if we return to the training set (Training Set) of the problem as shown in the table below

 

 

The labels we will use to describe this regression problem are as follows:

m represents the number of instances in the training set

x stands for feature / input variable

y represents target variable / output variable

(x, y) represents an instance in the training set

Represents the  first observation instance

h represents the solution or function of the learning algorithm can also be called hypothesis

 

We can see that there is the house price in our training set. We give it to our learning algorithm, the work of the learning algorithm, and then output a function, usually expressed as a lowercase   h.

Represents hypothesis (hypothesis), a function, the input is the size of the house, just like the house your friend wants to sell, so h derives the y  value  based on the input x  value, and the y  value corresponds to the price of the house. Therefore, h  is a  function mapping from x  to y .

So, how do we express h for our house price prediction problem  ?

One possible way of expression is: Because it contains only one feature / input variable (the area of ​​the house), such a problem is called a univariate linear regression problem.

2-2 Cost function

We will define the concept of the cost function, which helps us figure out how to fit the most likely straight line to our data .

 

 In linear regression we have a training set like this, m  represents the number of training samples, for example m = 47 . And our hypothetical function is the form of the function used to make predictions:

 

 We will introduce some terminology. What we have to do now is to choose appropriate parameters for our model . In the case of the housing price problem, it is the slope of the straight line and  the intercept on the y- axis.

 Modeling error The parameter we choose determines the accuracy of the straight line we get relative to our training set, and the gap between the value predicted by the model and the actual value in the training set (the blue line in the figure below)

 Our goal is to select the model parameters that can minimize the sum of squared modeling errors.

We draw a contour map, the three coordinates are θ 0 and  θ 1 and J (θ 0, θ1) :

 

 

It can be seen that there is a minimum point in three-dimensional space .

The cost function is also referred to as the squared error function, and sometimes referred to as the squared error cost function. The reason why we require the sum of squared errors is because the error squared cost function is a reasonable choice for most problems, especially regression problems.

2-3 Understanding the cost function (1)

Let's get some intuitive feelings through some examples and see what the cost function is doing.

 

Next, our example is the analysis when θ 0 is 0

 

 

 2-4 Understanding the cost function (2)

The appearance of the cost function, the contour plot, shows that there is a point in three-dimensional space that minimizes J (θ 0, θ1) .

Suppose that the straight line we fit is as shown in the figure, now we have two parameters θ 0, θ1, and his cost function is as follows

 

 

 

The figure below shows a contour map. Each point on the contour map has a corresponding slope and intercept. The closer to the center, the better the straight line fit.

 

 

 Through these graphics, I hope you can better understand the cost function J  values expressed what they correspond assumption is what, and what assumptions corresponding point, closer to the cost function J  's The minimum value.

2-5 gradient descent

Gradient descent is an algorithm used to find the minimum value of the function. We will use the gradient descent algorithm to find the minimum value of the cost function J (θ 0, θ 1 )  .

The idea behind gradient descent is: at the beginning we randomly choose a combination of parameters , calculate the cost function, and then we look for the next parameter combination that will make the cost function value drop the most. We continue to do this until we reach a local minimum, because we have not tried all the parameter combinations, so we are not sure whether the local minimum we get is the global minimum. Choosing different initial parameter combinations may find different Local minimum of.

Imagine that you are standing on this point of the mountain, standing on the red mountain in the park you imagine. In the gradient descent algorithm, all we have to do is rotate 360 ​​degrees, look around us, and ask ourselves to be in a certain place. In this direction, go down the mountain with small steps. Think about every step you take until you are close to the local lowest point

The formula of the batch gradient descent algorithm is

 

Where α is the learning rate, which determines how many steps we take down the direction in which the cost function can decrease the most. The larger the α, the greater the learning rate, and the greater the steps taken when going down the mountain. In batch gradient descent, every time we subtract all learning parameters at the same time by the learning rate multiplied by the derivative of the cost function.

 

 In the gradient descent algorithm, this is the correct way to achieve simultaneous updates.

2-6 Summary of knowledge points of gradient descent

 

Assigning  θ  makes J (θ)  proceed in the fastest direction of gradient descent, and iterates all the way to finally obtain the local minimum. Among them is the learning rate α ( learning rate ), which determines how much steps we take in the direction that can make the cost function decrease the most.

 

Now, this line has a positive slope, which means it has a positive derivative, so I get the new [theta] 1 , after the update is equal to minus a positive number multiplied by [theta] 1 .

Let's see what happens if α is  too small or too large:

If α is  too small, that is, my learning rate is too small, the result is that I can only move a little bit like a baby, trying to get close to the lowest point, so it takes many steps to reach the lowest point.

 

 

If α is  too large, then the gradient descent method may cross the lowest point, and may even fail to converge. The next iteration has moved a big step, over once, over again, over the lowest point again and again, until you find that it is actually away from the bottom The point is getting farther and farther, so if it is too large, it will guide

As a result, it cannot converge or even diverge.

 

If your parameter is already at the local lowest point and the slope is 0, then the gradient descent method update does nothing, it will not change the value of the parameter. This also explains why the gradient descent can converge to the local lowest point even when the learning rate α  remains unchanged.

 

 

In the gradient descent method, when we are close to the local minimum, the gradient descent method will automatically take a smaller amplitude, because when we are close to the local minimum, it is clear that the derivative is equal to zero at the local minimum, so when we are close to the local At the lowest point, the derivative value will automatically become smaller and smaller, so the gradient descent will automatically take a smaller amplitude, this is the method of gradient descent.

2-7 Gradient descent of linear regression

The comparison between gradient descent algorithm and linear regression algorithm is as follows:

 

The key to applying the gradient descent method to our previous linear regression problem is to find the derivative of the cost function, namely:

 

 

The algorithm we just used is sometimes called batch gradient descent. It means that in each step of gradient descent, we use all the training samples m. In gradient descent, when calculating the differential derivative term, we need to perform a summation operation, so, in each individual gradient descent In the end, we have to calculate such a thing, this item needs to sum all m  training samples.

Now that we have mastered gradient descent, we can use the gradient descent method in different environments, and we will also use it extensively in different machine learning problems.

Guess you like

Origin www.cnblogs.com/lmr7/p/12684547.html