A comprehensive explanation of the method of least squares

We won't say much about the common least squares method. The following mainly introduces some advanced methods of the least squares method.

  • regularized least squares

When using the common least squares method for regression analysis, it often encounters the problem of overfitting, that is, it performs well on the training data set, but performs poorly on the test data set. At this time, it is necessary to introduce a regularization term into the least squares method. There are two common types of regularization.

L2 regularization (Ridge regression):

arg min_{w\in D}L(W)=\sum_{i=1}^{n}(y_i-Wx_i)^2+\lambda\sum_{j=1}^{d}W^2_j \\ =||y-Wx||^2_2+\lambda||W||^2_2

L1 regularization (Lasso regression):

arg min_{w\in D}L(W)=\sum_{i=1}^{n}(y_i-Wx_i)^2+\lambda\sum_{j=1}^{d}|W_j| \\ =||y-Wx||^2_2+\lambda||W||_1

Explain regularization from a probabilistic perspective: regularization is equivalent to a prior distribution of the parameter W. If the distribution is \in=0a Gaussian distribution, it is L2 regularization; if the distribution is \in=0a Laplace distribution, it is L1 regularization. By adding regularization to limit the parameter space, control the complexity of the model, thereby preventing overfitting.

  • Damped least squares method (Levenberg–Marquardt algorithm, LMA)

        The least squares method we commonly use is to fit linear equations y=Wx, but for nonlinear functions, we need to use the damped least squares method, which is essentially an iterative solution process. The basic idea is to use Taylor expansion to linearize the nonlinear function .

        Let the equation y=f(x;c)where x is the variable and c is the parameter to be fitted. We want to find a set of c such that:

arg min_{c\in D}L(W)=\sum_{i=1}^{n}(y_i-f(x_i;c))^2

Expanding the function f(x;c)Taylor, keeping only the first-order term, we can get:

Y=f(x;c_0)+J\Delta c

where Jis the Jacobian matrix:

J=\begin{pmatrix} \frac{\partial f(x_1)}{\partial c_1} &\frac{\partial f(x_1)}{\partial c_2} & ...&\frac{\partial f(x_1)}{\partial c_n} \\ \frac{\partial f(x_2)}{\partial c_1}& \frac{\partial f(x_2)}{\partial c_2} & ... & \frac{\partial f(x_2)}{\partial c_n}\\ ...& ... & ... &... \\ \frac{\partial f(x_n)}{\partial c_1}& \frac{\partial f(x_n)}{\partial c_2}& .... & \frac{\partial f(x_n)}{\partial c_n} \end{pmatrix}

Thereby J \Delta c = Y-F, it can be solved \Delta c =(J^TJ)^{-1}J^T( Y-F), and iteratively updated c=c+\Delta cuntil \Delta c <\xi.

 

Guess you like

Origin blog.csdn.net/bulletstart/article/details/132155852