Thorough understanding of regularization (Regularization)

First, we must understand the loss function visualization. For

the loss function contour map in the parameter space, there are infinitely many solutions that meet the same loss value .

The general form of adding regular terms to the loss function:
L = ∑ i = 1 n [yi − ∑ j = 1 p (wj ∗ xi) − b] + λ ∑ j = 1 p ∣ wj ∣ q L=\sum_{i= 1}^n[y_i-\sum_{j=1}^p(w_j*x_i)-b]+λ\sum_{j=1}^p|w_j|^qL=i=1n[ andij=1p(wjxi)b]+λj=1pwjq

We make ∑ j = 1 p ∣ wj ∣ q \sum_{j=1}^p|w_j|^qj=1pwjq in differentqqThe image under q :

In theloss function visualization, we drew the loss function∑ i = 1 n [yi − ∑ j = 1 p (wi ∗ xi) − b] \sum_{i=1}^n[y_i-\sum_ {j=1}^p(w_i*x_i)-b]i=1n[ andij=1p(wixi)b ] contour form, if the loss function image without the regular term and the regular image are put together:

Take the outermost circle of the left picture, when the loss function value reaches the value corresponding to the outermost circle, thenw 1 w_1w1And w 2 w_2w2There are infinitely many, so if we add l 1 l_1l1Regular term, which means to choose one from the infinite number, this one is from the infinite number w 1 + w 2 w_1+w_2w1+w2The smallest one. If a straight line intersects multiple contour lines, then w 1 + w 2 w_1+w_2 at these intersectionsw1+w2They are all equal. At this time, the solution with the smallest contour is selected, as shown in point 5 in the figure.

The picture on the right is the same l 2 l_2l2Regular w 1 w_1 corresponding to the point tangent to the contour for the first timew1And w 2 w_2w2Is an infinite number of solutions w 1 2 + w 2 2 w_1^2+w_2^2w12+w22The smallest one.

So it can be concluded that the first intersection of the contour and the regular term is the optimal solution. The regular term reduces the scope of the parameter space we solve.

https://blog.csdn.net/zandaoguang/article/details/107970123
http://freemind.pluskid.org/machine-learning/sparsity-and-some-basics-of-l1-regularization/#ed61992b37932e208ae114be75e42a3e6dc34cb3http://

In-depth understanding of regularization from a Bayesian perspective
-must see

Why not use L0 as a regular term?
From a theoretical point of view, L0 is indeed the best regular term for sparse solutions, but the dimension of features in machine learning is often very large. You can also understand that there are many coefficients, and then solving L0 is an NP-hard problem again, so in It is extremely limited in practical engineering applications and is not feasible.

Why do we have to get a sparse solution?
This question is not absolute. Statistically speaking, sparse solution can alleviate the overfitting problem of the model. After all, it can reduce the complexity of the model-that is, make some attributes invalid.

Guess you like

Origin blog.csdn.net/weixin_38052918/article/details/107814978