Why does the loss function of linear regression use least squares instead of likelihood function?

Simply put, it is because when the square form is used, the idea of ​​" least squares " is used. The " squares " here refers to the use of squares to measure the distance (far and near) between the observation point and the estimated point, " minimum " Refers to the parameter value to ensure that the sum of the squares of the distances between each observation point and the estimated point is minimized.

The least squares method uses the sum of the squares of the estimated value and the observed value as the loss function. Under the premise that the error obeys a normal distribution, the idea is essentially the same as that of the maximum likelihood estimation. We usually think that ε obeys a normal distribution, and by extrapolating the maximum likelihood formula, the result is really the least squares formula.

In the actual task, we will learn a model f(x) from the dataset {(Xi,Yi)}, i = 1, 2...., n,. The dataset can be thought of as being sampled from an ideal model F(x) and formed by adding Gaussian noise.

      From this point of view, each point (Xi, Yi) in the dataset is subject to a Gaussian distribution with a mean of f(Xi) and a fixed variance. So the data (Xi, Yi) probability is as follows:

           

      To judge whether a model is close enough to the ideal model, you can compare the probability of the data set appearing under the current model, which is the familiar maximum likelihood estimation. So, our goal is to maximize the log-likelihood of the dataset. We will not continue to expand here, and the derivation below is the general maximum likelihood method. After simplification, we will find that maximizing the log-likelihood function of the data set is actually equivalent to minimizing the sum of squares of the difference between the label Yi and the model predicted value f(Xi) on the data set.

As can be seen from the mathematical background of the square sum error function, the choice of the square sum error function is actually based on the maximum likelihood method. The maximum likelihood method is inherently overfitting. So this is also the essential reason why the model is prone to overfitting when pursuing the accuracy of the model on the training set during the training phase. Controlling the complexity of the model, adjusting parameters, etc., is a trade-off between overfitting and accuracy.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324691102&siteId=291194637