The difference between gradient descent and least squares method (detailed explanation)

1. Common points:

(1) Essentially the same. Both methods calculate a hypothetical function for the dependent variable on the premise of given known data (independent variable and dependent variable), and then optimize the optimal parameters of the hypothetical function.

(2) The goal is the same. Both methods are within the framework of known data, so that the total square difference between the estimated value and the actual value is as small as possible (in fact, in gradient descent, it is more inclined to use the mean square error, that is, the total variance divided by the number of samples, To avoid the problem of excessive loss value. Since the point with the smallest mean square error is the lowest point of the total variance, this transformation does not affect the optimization of the loss function).

2. Differences:

(1) The implementation methods are different. The least squares is to derive the derivative by mathematical transformation of the independent variable and the dependent variable, and directly reach the lowest point without the need for transparency (the optimal θ is directly obtained without giving the value of the parameter θ); while the gradient descent is to estimate a set of parameters, and then modify the parameters according to the opposite direction of the gradient, and iterate repeatedly to obtain the lowest point (given the value of the parameter θ, and gradually obtain the optimal θ).

(2) The results are different. Least squares is a problem of 1 (find a solution) or 0 (matrix is ​​irreversible, no solution); while gradient descent is a problem of 0.x (gradually approaching 1 for the exact solution).

(3) Applicability is different. Least squares is only suitable for problems where the partial derivative of the loss function relative to the regression coefficient can directly use mathematical transformation to obtain an analytical solution, such as linear regression; while gradient descent is more applicable, as long as the loss function can be obtained numerically. A little partial derivative can be used.

(4) The solution obtained by least squares is the global optimal solution; the solution obtained by gradient descent may be the local optimal solution; if the loss function is a convex function, the solution obtained by gradient descent must be the global optimal solution.

Guess you like

Origin blog.csdn.net/hu_666666/article/details/127204192