2. Learning regression - predicting clicks based on advertising costs

The three-foot sword of wind and rain outside the mountain

Take your sword and go down the mountain if you have something to do

A House of Flowers and Birds in the Cloud Book

Worry-free flip book sages come

1. Setup Questions

Take the relationship between Web advertisements and clicks as an example to learn regression.

Premise: The more advertising fees invested, the more clicks on the ad.

According to past experience data, the following figure can be obtained:

So suppose I want to invest 200 yuan in advertising fees, how many clicks will it bring?

That is about 500 times.

This is machine learning. What you are doing is learning from known data and then giving predictions .

2. Define the model

As long as you know the function form of each point in the graph, you can know the number of clicks based on the advertising fee.

Note: "hits contain noise", so it is impossible for the function to pass through all points.

A function can be obtained:

One is the intercept and one is the slope.

It can be seen from this function that x is the advertising fee and y is the number of clicks.

The above linear function can be imagined as a model that can predict the number of clicks based on the investment in advertising costs, and θ0 and θ1 are the parameter values ​​of the model. The quality of the model (that is, the accuracy of the predicted results) is closely related to the parameter values ​​of the model.

Now we don't know how many values ​​of these two parameters are optimal for this model.

According to the hypothesis method in mathematics, since you don't know, just make a random hypothesis first. First assume that θ0=1, θ1=2, then the above primary function will become:

According to this hypothetical model, now I want to invest x=100 yuan in advertising fees, then the predicted number of clicks:

y = 1 + 2 * 100 =201

Look at the actual situation of the previous data:

Therefore, the assumed θ0=1 and θ1=2 are completely incorrect, and the model obtained based on these two parameters cannot obtain correct results.

Next, we are going to use machine learning to find the correct θ0 and θ1.

3 least square method

First, convert the expression of the function once:

In this way, it can be seen at a glance that this is a function that has a parameter θ and is related to the variable x.

We have previously had the actual data between the advertising cost and the click volume, as follows:

Represented in the figure:

According to the parameters we just assumed casually, we can get fθ(x)=1+2x

Substitute the actual advertising cost into the predicted click volume:

It can be seen that the predicted value based on the randomly assumed parameters is very different from the actual situation, and there is a deviation between the value calculated with the randomly determined parameters and the actual value. 

Ideally, our predicted value is consistent with the actual value, that is, y-fθ(x)=0. This means that the error between y and fθ(x) is 0.

However, it is impossible to make all the errors equal to 0, so what we have to do is to make the sum of the errors of all points as small as possible.

Expressed as an expression:

 

This expression is called the objective function (core) , and the E of E(θ) is the first letter of the English word Error. This type of problem is called an optimization problem .

important point:

1. The i in x(i) and y(i) does not mean the power of i, but refers to the i-th training data.

2. To calculate the square of the error, exclude the case where the error is negative.

3. Multiply by 1/2, the expression is better differentiated, and it does not affect the point where the function itself takes the minimum value.

Substituting the initial advertising fee and click volume data into it can get:

If the error is too large, what we need to do is to make the error smaller so that the predicted value is close to the actual value. This method of finding the error is also called the least squares method .

To make E(θ) smaller and smaller, it is necessary to keep the value of the parameter θ, and then compare and modify it with the actual value again and again. This is too troublesome. The correct way is to use differentiation.

Example:

There is a quadratic function as g(x)=(x-1)*2

First differentiate g(x):

You can know its increase and decrease table:

At x=3, in order to make the value of g(x) smaller, we need to move x to the left, that is, we must reduce x

At the point of x=-1 on the other side, in order to make the value of g(x) smaller, we need to move x to the right, that is, we must increase x

That is to say, the direction of moving x is determined according to the sign of the derivative, and when moving in the opposite direction to the sign of the derivative, g(x) will advance in the direction of the minimum value.

To sum up what was said above, express it with an expression:

This is the method of steepest descent or gradient descent .

A:=B This way of writing means that A is defined through B, and the parameters are automatically updated . In simple terms, the previous x is used to define the new x.

η is called the learning rate , pronounced "ita". Depending on the learning rate, the number of updates to reach the minimum will also vary. In other words, the convergence speed will be different. Sometimes there is even a situation where it is completely impossible to converge and diverges all the time.

So the value of η is very important, such as η=1, starting from x=3

 

This will lead to an endless loop.

Let η=0.1, also start from x=3

If η is large, then x:=x-η(2x-2) will jump around two values, possibly even far from the minimum. This is the divergent state. And when η is small, the amount of movement is also small, and the number of updates will increase, but the value will indeed go in the direction of convergence.

Now let's discuss the objective function of advertising cost and click volume:

But this objective function contains fθ(x), and fθ(x) has two parameters, θ0 and θ1. That is to say, this objective function is a bivariate function with θ0 and θ1, so we cannot use ordinary differential, but partial differential .

Find the partial differential of the expression θ0:

There is fθ(x) in E(θ), and θ0 in fθ(x), so a compound function can be used

Differentiate stepwise

Start the calculation from the place where u is differentiated from v

Then v differentiates θ0

v replaces back fθ(x) 

Then v differentiates θ0

Finally, the update expressions of the parameters θ0 and θ1 are obtained:

As long as θ0 and θ1 are updated according to this expression, the correct linear function fθ(x) can be found.

Use this method to find the correct fθ(x), and then input any advertising fee to get the corresponding clicks. In this way, we can predict the number of clicks based on advertising costs.

4 Polynomial regression

Previously, the image was fitted to a straight line, but in reality, the curve fits better than the straight line.

The quadratic function corresponding to the curve:

Or an expression with a higher degree is also possible:

However, it is not that the better the function order, the better the fitting, and the problem of overfitting may occur.

The final parameter update expression:

The analysis method of increasing the degree of the polynomial in the function like this, and then using the function is called polynomial regression .

5 multiple regression

In the past, we predicted the number of clicks based on the advertising fee, that is, there was only one variable (advertising fee x), but in practice, many problems to be solved were complex problems with more than two variables.

For example, in addition to the advertising fee, the number of clicks is also affected by multiple factors such as the position of the advertisement and the size of the advertisement page, that is, multiple variables x.

In order to make the problem as simple as possible, this time we only consider the size of the advertisement page, assuming that the advertisement fee is x1, the width of the advertisement column is x2, and the height of the advertisement column is x3, then fθ can be expressed as follows:

Then you can differentiate the function, but before that, you can understand the simplified form of the expression.

Simplification method: treat the parameter θ and the variable x as a vector

For ease of calculation, align both sides

After transposing θ, the result of multiplying it with x

Therefore, in actual programming, you only need to express it with a normal one-dimensional array:

The differentiation is the same as before, so only the differential of v to θj is required.

Then the update expression for the jth parameter:

Such regression involving multiple variables is called multiple regression .

6 Stochastic Gradient Descent

The gradient descent algorithm mentioned above has two shortcomings. One is that it takes a long time to update x again and again, and the other is that it is easy to fall into a local optimal solution .

For example, the following function:

Choosing the initial x from different places will fall into a local optimal solution.

Stochastic gradient descent is based on the steepest descent method.

The parameter update expression of the steepest descent method:

This expression uses the error of all training data. Whereas in stochastic gradient descent a training data is randomly selected and used to update the parameters.

The k in the expression is the index of the randomly selected data. The time for the steepest descent method to update the parameters once, and the stochastic gradient descent method can be updated n times. In addition, the stochastic gradient descent method is not easy to fall into the local optimal solution of the objective function because the training data is randomly selected, and the gradient used when updating the parameters is the gradient when the data is selected.

In addition, there is a method of randomly selecting m training data to update parameters:

Let the index set of randomly selected m training data be K

Assuming that there are 100 training data, then when m=10, create a set of indexes with 10 random numbers, for example K={61, 53, 59, 16, 30, 21, 85,31, 51, 10} , and then repeatedly update the parameters on the line.

This method is called mini-batch gradient descent .

Guess you like

Origin blog.csdn.net/2301_76354366/article/details/131716210