01 Gradient descent, learning rate, loss function

Concept introduction

Based on an independent variable x, such as time, we can get its corresponding observation value y, such as temperature. By continuously observing, we can get a series of true correspondences: (the true value of time and temperature), namely (x1, y1), (x2, y2), ..., (xn, yn).

Now we know that there is a direct proportional relationship between temperature and time, that is, y = k*x. Obviously, we don't know how much k should be taken, but we can assume.

Now we assume k=2, that is, we guess the relationship between temperature and time is y=2*x. Then from the real correspondence obtained above, we can get a series of predicted correspondences: (time, temperature predicted value), namely (x1, 2*x1), (x2, 2*x2), ..., (xn , 2*xn).

Then when k=2, we can get a loss, which is the error between the predicted value and the true value. Each observation point will cause errors, so each error should be recorded, so the total error should be $\sum_{i=1}^{n}(yi-2*xi)^{2}$ , the average error is $\sum_{i=1}^{n}(yi-2*xi)^{2}/n$ , the reason for adding the square here is to eliminate the positive and negative, for example, the result of y1-2*x1 is negative , And then add it to the total error. Doesn't it make the total error smaller? Of course it won't work.

Above we assumed a value of k and got a loss. But why should we assume the value of k? Because we know temperature=k*time, but we don’t know the value of k, so we can only assume. But how to make the value of k approach the real value of k? First of all, we are very clear that the true value of k will definitely make the error 0, and all others are greater than 0. Then our purpose becomes very clear, we have to keep the error reduced until it is zero. Of course, 0 is the most ideal state, which is basically not reached, so as long as the loss is small enough, the predicted value of k will meet our needs.

Gradient descent, learning rate, loss function

Let's look at the average error function or loss function $averagLoss = \sum_{i=1}^{n}(yi - k*xi)^{2} /n$ . Our goal is to make averageLoss smaller and smaller, and then we find that there is only one unknown quantity in this function, which is k, which is a quadratic equation of one variable! How to find the minimum value of the equation? Don't look at it as a one-variable quadratic equation for the minimum value. You will think of junior high school knowledge instantly. Because the minimum value method in junior high school cannot solve the minimum value of the binary quadratic equation, the gradient derivation can. What is gradient derivation? It's very simple. For a one-variable function, it is a tangent. For a multivariate function, we can get the tangent through partial derivatives. Those who have learned a lot about this specific content should be very clear.

Next is my personal thinking:

01 Gradient descent, learning rate, loss function

Concept introduction

Gradient descent, learning rate, loss function

Guess you like