Simple understanding of the principles and functions of L0, L1, L2 norms

https://blog.csdn.net/qq_42109740/article/details/104779538 This blog post provides a very easy-to-understand explanation of the principles of L0, L1, and L2 norms and their role in machine learning. After understanding the relevant concepts for bloggers Self-understanding, I believe that readers who have read this analysis will be of great help in understanding the role of these norms. Before reading, it is recommended to read the more systematic introduction written by the blogger below, and then read my article ; If you have an understanding of the application of norms in machine learning, you can directly read my article. https://blog.csdn.net/qq_42109740/article/details/104779538 The above is the link to the original text, this article is reprinted for your convenience

1. Put the conclusion first


I believe everyone already understands the definition of L0, L1, L2 norms and their role in machine learning, specifically:

(1) The L0 norm refers to the number of non-zero elements in the vector. Its role can improve the sparsity of model parameters, but the L0 norm is difficult to optimize and solve.

(2) The L1 norm refers to the sum of the absolute values ​​of each element in the vector. Its function can also improve the sparsity of model parameters, the effect is not as good as the L0 norm, but it is easier to solve and more commonly used.

(3) The L2 norm refers to the sum of the squares of the elements of the vector and then the square root. Its function is to reduce the size of all parameters of the model, which can prevent the model from overfitting and is also very commonly used.

2. Foreword


First of all, let's understand the concept of sparsity. The simple and intuitive explanation is that for a set of data (assumed to be x1, x2, x3, ..., x1000), there are only parts of it, such as (x100, x200 ..., x1000) The size of these ten sets of data The larger value is 1, and the others are 0 or close to 0, which means that this set of data is sparse. So why should we consider the sparsity of data? Many people may also think of "compressed sensing", but I will cite an intuitive example: If there are 100 judgment indicators for judging whether a patient has a certain disease, but 5 of them are very important, if the doctor is asked to evaluate the 100 indicators Considering each of them will undoubtedly be a large workload, and a lot of work on the other 95 indicators is useless, so here we only need to consider those 5 indicators, which is data sparsity processing. These norms can be used. Next, I will explain in detail my understanding of the L0, L1, and L2 norms mainly from the aspect of sparsity.

3. Analysis


Consider a first-order linear regression problem:

For this type of problem, we hope to find an equation of y=wx+b to fit these points. The method is also very simple. The method of least squares is: min{∑(yiactual-yi)2} is min{ ∑(yi-(wxi+b))2}, for this simple problem, a definite solution w0 and b0 can be obtained, and this is for a parameter w. At this time, if the equation is set as y= w1x+w2x+b, that is, the parameters of w are w1 and w2. Needless to say, there will be w0=w1+w2 here. It is very important to remember this equation!

At this time, what if I require w1 and w2 to satisfy sparsity? That is, it is best to have one as 0 and the other as w0. Ok, let's continue to consider using the least squares equation to get the w1 and w2 values: min{∑(yireal-(w1xi+w2xi+b))2}, in fact, this result will still satisfy w1+w2=w0 , how much w1 and w2 are equal to, cannot be determined, and the sparsity cannot be guaranteed.

3.1 L0 norm:


If you add constraints at this time: min{∑(yiactual-(w1xi+w2xi+b))2+λ||w||0}, where ||w||0 is the L0 norm of w, and λ is the constraint Item coefficient, that is, to solve min{∑(yiactual-(w1xi+w2xi+b))2+λ("the number of non-zeros in w1 and w2")} at this time, if you want to ensure its minimum, you need the above Both terms are relatively minimal. For the second term, the best result is to satisfy one of the w parameters is 0, that is, w1 or w2 is 0, and the other parameter is equal to w0 (actually according to the above formula, it is slightly smaller than w0, It is assumed here and later that it is equal to w0, which does not affect the analysis). So the L0 norm achieves parameter sparsity.

3.2 L1 norm


In the same way, the formula of the least squares method after using the L1 norm is: min{∑(yiactual-(w1xi+w2xi+b))2+λ(|w1|+|w2|)}, here is a picture to understand :

For case a, that is, both w1 and w2 are positive, at this time the above formula becomes: min{∑(yireal-(w1xi+w2xi+b))2+λ(w1+w2)}, since w0=w1+ w2, so it is min{∑(yiactual-(w1xi+w2xi+b))2+λw0}, where w0 is a fixed value, so it does not play a role in parameter sparseness, but for case b, it is a One is positive and the other is negative, which makes (|w1|+|w2|) bigger, and in order to make it the smallest, the result is that w1 is 0, and w2=w0, so it also plays a role of sparseness.

3.3 L2 norm


Similarly, after introducing the L2 norm, the formula of the least squares method is obtained: min{∑(yireal-(w1xi+w2xi+b))2+λ(w12+w22)1/2}, after converting the formula into :min{∑(yiactually-(w1xi+w2xi+b))2+λ((w1+w2)2-2w1w2)1/2}, bring in w0=w1+w2, equal to: min{∑(yi In fact -(w1xi+w2xi+b))2+λ(w02-2w1w2)1/2}, in order to make the second term: λ(w02-2w1w2)1/2 the smallest, it is necessary to satisfy the minimum of w02-2w1w2, and also That is, -w1w2 is the smallest, let w2=w0-w1, and get -w1w2=w12-w1w0. Refer to the figure below to understand:

It can be seen that when w1 is equal to w0/2, its minimum is satisfied, that is to say, the greater effect of the L2 norm is to evenly distribute the parameter value of w0 to w1 and w2 to make the parameters smaller. When the number of w is large , which is equivalent to what other bloggers said tends to 0, but it is not equal to 0.

The analysis is here, let's continue to look at the properties of the L3 norm, why only see the analysis of the L0, 1, 2 norm, but not the L3 norm, continue to use the above method, the minimization formula is: min{∑ (yi real-(w1xi+w2xi+b))2+λ(w13+w23)1/3}, transformed into: min{∑(yi real-(w1xi+w2xi+b))2+λ((w1+ w2)3-3w1w2(w1+w2))1/3}, for the second item: ((w0)3-3w1w2(w0))1/3, we can continue to transform to find: -3w1w2(w0) ​​minimum value, That is to find the minimum value of -w1w2, which returns to the L2 norm problem.

4. Summary


After the above derivation and understanding, it can be seen that L0 is to directly reduce the number of effective parameters. For L1, only parameters with different symbols can make some of the parameters become 0. Both L0 and L1 norms can make the model parameters sparse. ; For L2, the parameter cannot be set to 0, but the overall value can be made smaller.
 

Guess you like

Origin blog.csdn.net/qq_46703208/article/details/129844307