Support vector machine (SVM)(1)

background

For the two types of problems, given the data, using a linear function, how to classify it?
Insert picture description here
Option 1 (just look at the black dividing line, don't care about red)
Insert picture description here
Option 2 (just look at the black dividing line, don't care about red)
Insert picture description here
Which one is better?
The second one is better, separating the two types of data more separately, and:
Insert picture description here

Modeling

After we have a goal, we need to mathematically describe the better interface above: that is, we want to have a larger interval . The interval is the width of the red area.
Mathematical description:
The straight line of the interface is set as:
Insert picture description here
The distance from the sample point x to the interface is (using the point-to-straight distance formula): Is
Insert picture description here
the straight line chosen well? Look at the interval.
Insert picture description here
To explain a little bit, we generally think that the interval is twice the width of the above formula, that is, the width of the red area, but our goal is that the larger the interval, the better. In fact, the half of the interval is as large as possible, so we can also not multiply by 2. Doesn't affect anything.

Finally, after calculating the interval, we hope that the interval of the straight line ( w,bdetermined) we choose is the largest. That is: the
Insert picture description here
popular explanation is that we first enumerate a w, b (set a straight line), and then look at the interval, see whether it is big or not, then enumerate the next parameter w, b until the largest interval is found , And return the corresponding parameters w, b at this time. In this way, the solution is completed.
Insert picture description here
Far from being completed, we forgot the constraints. We focused on the largest interval, and forgot to test whether the straight line corresponding to the largest interval can match our sample.
In order to facilitate mathematical processing, all our points are preprocessed into (xi, yi) (x_i,y_i)(xi,Yi) , whereyi y_iYiIs a label, we preprocess it into + 1 +1+ 1 or− 1 -11 . One represents the positive category and one represents the negative category. Therefore, the straight line can be divided into pairs of samples, which is:
Insert picture description here
together is:
Insert picture description here
finally get:
Insert picture description here
Consider changing the above constraints into:
Insert picture description here

Some people say, how can this be changed to 1? The answer is: yes, and it is equivalent in the case of linear separability.
Explanation: Let’s ignore the equal sign in the two figures above and become a greater than sign, because in the case of completely linearly separable, the two types of samples are on both sides of the straight line, and will not be on the straight line, so they will not be 0. Do not take the equal sign.
Then suppose that the best parameters selected in the previous picture are w, b, and yi (wxi + b) = m> 0 y_i(wx_i+b)=m>0Yi(wxi+b)=m>0 , then we can find a relatively large positive numberk, soW=kw,B=kb, and then we can prove that this is the optimal solution of the previous picture.
The proof is very simple.
First, the coefficient is multiplied by a multiple, and the result of the inner formula is not affected by the multiple of the coefficient, and the numerator and denominator will be reduced, so the original is the maximum interval, but now it is still the maximum interval.
Insert picture description here
Second, because we multiply a relatively large coefficient, we haveyi (W xi + B) = km> 1 y_i(Wx_i+B)=km>1Yi(Wxi+B)=km>1. Satisfy the second constraint of the previous picture.
Conclusion: Afterthe constraints of the previous picture are transformed into the constraints of the previous picture, the optimal straight lines obtained respectively are closely related, and their corresponding straight lines are the same (consider x + 2 = 0, 2 x + 4 = 0 x +2=0,2x+4=0x+2=0,2 x+4=0 , the interval does not increase. But the latter multiplied by multiples can make
Insert picture description here
that:
Insert picture description here
then we will only use the second form in the future.
Insert picture description here
Insert picture description here

Continuing to derive, we have the
Insert picture description here
wonderful time now, if we let ∣ ∣ w ∣ ∣ ||w||w ∣ is very small, then the right side of the above equation will be very large, so if any sample is brought in, the left side will be very large. Choose the smallest among all such large ones, that is, the interval, and it will naturally change It's very big. Based on this idea, we changed the objective function.
Insert picture description here
Of course,there is a geometric explanationfor the above,please see below.
Insert picture description here
We have to choose the one that satisfies the constraintsw,b. That is, as shown in the figure above, we need to make the pairing, and
Insert picture description here
the straight line in the above figure is the right sample, but there are many that can be paired, which is the largest interval we want, this is our goal. We found that these two lines ( w T x + b = 1, w T x + b = − 1 w^Tx+b=1,w^Tx+b=-1wTx+b=1,wTx+b=1 ) The distance between (using the distance formula of two parallel lines) is:
Insert picture description here
That is to say:this is the interval,so we have to maximize it, but is it not to minimize the following:
Insert picture description here
So it's back again!
In addition, introduce a
概念:支持向量(support vector), which represents the front line at both ends of the dividing line. For example, the 3 points of the circle drawn below are the support vectors.
含义:You can remove all other data points, and then use these 3 points for SVM training and solving, you will find that the dividing line is still this one, which is also the origin of the name: support vector. Deleting any of these support vectors may change the dividing line.

Insert picture description here

Solve

Insert picture description here
Can this be solved? Yes, this is a two-time plan, just leave it to mathematics. In order to popularize science, a form of secondary planning is given.
Insert picture description here

Small summary

If linearly separable, all the problems have been solved above. The above is generally called a hard-margin support vector machine (hard-margin SVM) , which means that the data must be linearly separable, but it cannot do anything about linear inseparability. And it is very sensitive to outliers, because it may become a support vector. For example, in the picture below, we want 2 to be the dividing line, not 1 as the dividing line:
Insert picture description here

Soft interval SVM

The following figure is a linearly inseparable sample point. Using hard-spaced SVM for classification, there is nothing that can be done and no solution can be obtained.
Insert picture description here
Therefore, let's change the model a bit. We now give some slack to all samples. For example,
Insert picture description here
it was originally required to be strict >=1, but now it is not required, that is, for those abnormal points, the error is allowed, and the value is allowed to be less than 0. This point corresponds to ε \varepsilonε is a positive number greater than 1.
However, this is not enough for us, we should treat allε \varepsilonε is properly constrained, otherwise we takeε \varepsilonε is infinite, then the following inequalities will be established, so there is no optimization meaning.
We hope thatε \varepsilonin the objective functionε is as small as possible. That is, the objective function becomes the following, and find the minimum.
Insert picture description here
At the same time, we have to constrainε \varepsilonε is greater than or equal to 0, because there is no well-divided sample, it should enter the second penalty. The well-divided sample should have >=1. Assuming that a well-divided sample is brought into 100, then the model will chooseε \ varepsilonε is -99, so the loss function is instantaneously small, which will greatly affect the classification of other samples, because the model may lose other samples and get a sample into 1000, so the loss function is super small.
Obviously this is not what we want, soε \varepsilon is requiredε is greater than or equal to 0. When it is equal to 0, isn't it our previous hard interval?
We find: the aboveε \varepsilonε is the loss of each sample, and it is 0 if it is right. This is the picture below.
Insert picture description here
The above figure shows that after a sample is brought in, if it is greater than or equal to 1, then the loss is 0, otherwise in the case of the wrong division, the more the wrong division, the greater the loss.
One supplement: the connection and difference with 0-1loss:
Insert picture description here
Explanation: 0-1loss means that the penalty is 0 if the score is right, and the penalty is 1 if the score is wrong, and the more wrong the hinge loss is, the greater the penalty.
So back to the problem of solving, is this problem still a quadratic programming problem?
We find that we have N more variablesε \varepsilonε , but it’s okay. We can also regard this as an optimization goal. We find that the goal is still quadratic and the inequality is still linear (you can putε \varepsilonε is moved to the left and is also used as an optimized variable), then the usual inequality is no different.

Guess you like

Origin blog.csdn.net/qq_43391414/article/details/111698371
Recommended