background
For the two types of problems, given the data, using a linear function, how to classify it?
Option 1 (just look at the black dividing line, don't care about red)
Option 2 (just look at the black dividing line, don't care about red)
Which one is better?
The second one is better, separating the two types of data more separately, and:
Modeling
After we have a goal, we need to mathematically describe the better interface above: that is, we want to have a larger interval . The interval is the width of the red area.
Mathematical description:
The straight line of the interface is set as:
The distance from the sample point x to the interface is (using the point-to-straight distance formula): Is
the straight line chosen well? Look at the interval.
To explain a little bit, we generally think that the interval is twice the width of the above formula, that is, the width of the red area, but our goal is that the larger the interval, the better. In fact, the half of the interval is as large as possible, so we can also not multiply by 2. Doesn't affect anything.
Finally, after calculating the interval, we hope that the interval of the straight line ( w,b
determined) we choose is the largest. That is: the
popular explanation is that we first enumerate a w, b (set a straight line), and then look at the interval, see whether it is big or not, then enumerate the next parameter w, b until the largest interval is found , And return the corresponding parameters w, b at this time. In this way, the solution is completed.
Far from being completed, we forgot the constraints. We focused on the largest interval, and forgot to test whether the straight line corresponding to the largest interval can match our sample.
In order to facilitate mathematical processing, all our points are preprocessed into (xi, yi) (x_i,y_i)(xi,Yi) , whereyi y_iYiIs a label, we preprocess it into + 1 +1+ 1 or− 1 -1− 1 . One represents the positive category and one represents the negative category. Therefore, the straight line can be divided into pairs of samples, which is:
together is:
finally get:
Consider changing the above constraints into:
Some people say, how can this be changed to 1? The answer is: yes, and it is equivalent in the case of linear separability.
Explanation: Let’s ignore the equal sign in the two figures above and become a greater than sign, because in the case of completely linearly separable, the two types of samples are on both sides of the straight line, and will not be on the straight line, so they will not be 0. Do not take the equal sign.
Then suppose that the best parameters selected in the previous picture are w, b, and yi (wxi + b) = m> 0 y_i(wx_i+b)=m>0Yi(wxi+b)=m>0 , then we can find a relatively large positive numberk
, soW=kw,B=kb
, and then we can prove that this is the optimal solution of the previous picture.
The proof is very simple.
First, the coefficient is multiplied by a multiple, and the result of the inner formula is not affected by the multiple of the coefficient, and the numerator and denominator will be reduced, so the original is the maximum interval, but now it is still the maximum interval.
Second, because we multiply a relatively large coefficient, we haveyi (W xi + B) = km> 1 y_i(Wx_i+B)=km>1Yi(Wxi+B)=km>1. Satisfy the second constraint of the previous picture.
Conclusion: Afterthe constraints of the previous picture are transformed into the constraints of the previous picture, the optimal straight lines obtained respectively are closely related, and their corresponding straight lines are the same (consider x + 2 = 0, 2 x + 4 = 0 x +2=0,2x+4=0x+2=0,2 x+4=0 , the interval does not increase. But the latter multiplied by multiples can make
that:
then we will only use the second form in the future.
Continuing to derive, we have the
wonderful time now, if we let ∣ ∣ w ∣ ∣ ||w||∣ ∣ w ∣ ∣ is very small, then the right side of the above equation will be very large, so if any sample is brought in, the left side will be very large. Choose the smallest among all such large ones, that is, the interval, and it will naturally change It's very big. Based on this idea, we changed the objective function.
Of course,there is a geometric explanationfor the above,please see below.
We have to choose the one that satisfies the constraintsw,b
. That is, as shown in the figure above, we need to make the pairing, and
the straight line in the above figure is the right sample, but there are many that can be paired, which is the largest interval we want, this is our goal. We found that these two lines ( w T x + b = 1, w T x + b = − 1 w^Tx+b=1,w^Tx+b=-1wTx+b=1,wTx+b=− 1 ) The distance between (using the distance formula of two parallel lines) is:
That is to say:this is the interval,so we have to maximize it, but is it not to minimize the following:
So it's back again!
In addition, introduce a
概念
:支持向量
(support vector), which represents the front line at both ends of the dividing line. For example, the 3 points of the circle drawn below are the support vectors.
含义:
You can remove all other data points, and then use these 3 points for SVM training and solving, you will find that the dividing line is still this one, which is also the origin of the name: support vector. Deleting any of these support vectors may change the dividing line.
Solve
Can this be solved? Yes, this is a two-time plan, just leave it to mathematics. In order to popularize science, a form of secondary planning is given.
Small summary
If linearly separable, all the problems have been solved above. The above is generally called a hard-margin support vector machine (hard-margin SVM) , which means that the data must be linearly separable, but it cannot do anything about linear inseparability. And it is very sensitive to outliers, because it may become a support vector. For example, in the picture below, we want 2 to be the dividing line, not 1 as the dividing line:
Soft interval SVM
The following figure is a linearly inseparable sample point. Using hard-spaced SVM for classification, there is nothing that can be done and no solution can be obtained.
Therefore, let's change the model a bit. We now give some slack to all samples. For example,
it was originally required to be strict >=1, but now it is not required, that is, for those abnormal points, the error is allowed, and the value is allowed to be less than 0. This point corresponds to ε \varepsilonε is a positive number greater than 1.
However, this is not enough for us, we should treat allε \varepsilonε is properly constrained, otherwise we takeε \varepsilonε is infinite, then the following inequalities will be established, so there is no optimization meaning.
We hope thatε \varepsilonin the objective functionε is as small as possible. That is, the objective function becomes the following, and find the minimum.
At the same time, we have to constrainε \varepsilonε is greater than or equal to 0, because there is no well-divided sample, it should enter the second penalty. The well-divided sample should have >=1. Assuming that a well-divided sample is brought into 100, then the model will chooseε \ varepsilonε is -99, so the loss function is instantaneously small, which will greatly affect the classification of other samples, because the model may lose other samples and get a sample into 1000, so the loss function is super small.
Obviously this is not what we want, soε \varepsilon is requiredε is greater than or equal to 0. When it is equal to 0, isn't it our previous hard interval?
We find: the aboveε \varepsilonε is the loss of each sample, and it is 0 if it is right. This is the picture below.
The above figure shows that after a sample is brought in, if it is greater than or equal to 1, then the loss is 0, otherwise in the case of the wrong division, the more the wrong division, the greater the loss.
One supplement: the connection and difference with 0-1loss:
Explanation: 0-1loss means that the penalty is 0 if the score is right, and the penalty is 1 if the score is wrong, and the more wrong the hinge loss is, the greater the penalty.
So back to the problem of solving, is this problem still a quadratic programming problem?
We find that we have N more variablesε \varepsilonε , but it’s okay. We can also regard this as an optimization goal. We find that the goal is still quadratic and the inequality is still linear (you can putε \varepsilonε is moved to the left and is also used as an optimized variable), then the usual inequality is no different.