Lin Xuantian's Machine Learning Techniques Notes (1)
Lin Xuantian's Machine Learning Techniques' Notes (3)
Dual Support Vector Machine
P6 2.1
L1 talks about linear support vector machines, and then L2 talks about dual support vector machines.
The previous section talked about the method of finding non-linear SVM. When converting to z space, the QP problem will have d ~ + 1 variables (and N constants) to solve. To solve d ~ is very large, even infinite Let the SVM not depend on d ~ :
We can convert the original SVM into an equivalent SVM.
This is the dual problem:
we can follow the previous regularization, introduce λ, and convert the conditional problem into an unconditional problem, and the individual of λ The number is N
to define the Lagrangian function. Related literature usually writes λ as α
and converts SVM into the right formula.
If the (b,w) of st cannot be satisfied, then 1-yn(wTzn+b) is an integer. If max is selected, It will reach infinity, because in the end it will be min, which will filter out the (b,w) that does not satisfy st.
If it is satisfied, yn(wTzn+b) will be a non-negative number, because there is a max and a >=0, so yn(wTzn+b)=0 (note that ∑ \sum∑ , because a>=0, the sum can only be equal to 0 if each item is equal to 0), then the formula is1 2 w T w \frac{1}{2} w^Tw21wT w.
In this way, the data that does not satisfy st can be effectively screened out, and the smallest1 2 w T w \frac{1}{2} w^Tw21wTw。
In the previous section of P7 2.2
, the SVM was transformed into a Lagrangian formula, so how to find the lower limit of the formula? For any (b,w), there is this:
because it is true for any, it is still true to take the largest right-hand formula:
the right-hand formula becomes a Lagrangian dual (dual) problem, if this is solved The problem is to find the lower bound of the SVM.
Because the three conditions of green are met, it is a strong relationship (for the QP problem), so it can be directly equated, and it also shows that there are groups (b, w, α) that satisfy both sides of the equation: there are no restrictions now, so
start Solve this:
because it is min, it is required:
so we can add this restriction and simplify the formula:
it can be seen that the last item is b*0, so it becomes:
similarly, because of min, we need to give L Find the partial derivative of w = 0, get w to be a fixed number, and then start to simplify. Min can be ignored, because after max has the following series of regulations, there are no b and w in the formula, and the rest Only need to consider α.
Finally, the four conditions that satisfy the optimization are KKT. Add: the fourth point (Harry Potter and Voldemort must live one), if yn(wTzn+b)=1 (the point is just on the dividing line, these α>=0 points are SV), the formula is natural If it is 0, >1, according to the last figure in 2.1, the formula in the figure in 2.1 takes min, then αn can only take 0, so the final formula here is also 0.
Finally, there is a small funtime exercise to consolidate, which feels quite interesting. ②To look back at the definition of L(b,w,α), you will know that yn and zn=1, and then w= ∑ α nynzn \ sumα_ny_nz_n∑anynznit came out. ③It is because each item of sigma must be 0 (under KKT), so it is = 0. For the problem of α2(w-3), I feel that I can ignore the specific w, yn, and how to make zn. In short, the whole should be 0 That's it.
P8 2.3
Simplify the formula in the previous section, max->min, and then square it. The condition of not adding w = ... is because the cross focus is on αn. Then found that this is a convex (convex) QP problem, there are N variables (αn), and then N+1 conditions (constraint) (N αn must be greater than zero, 1 ∑ n = 1 N yn α n = 0 \ sum_{n=1}^N y_nα_n=0∑n=1Nynan=0 , a total of N+1), and then start to set QP.
Note: Generally, when inputting QP, you don’t need to split "=" into two inequalities, just write it directly, and then write the range bound directly.
However, note that q is a dense, dense matrix, that is, many values in it are not non-zero, and the amount of calculation and storage is large, so a method specially designed for SVM is used.
Through the 4 conditions of KKT, we can introduce w and b. In particular, whenα n > 0 α_n>0an>0 ,1 − yn ∗ ( w T zn + b ) = 1 1-y_n*(w^Tz_n+b) = 11−yn∗(wTzn+b)=1 , and =1 just means that the point is on the fat boundary of SVM (fat boundary), as for why. . It is estimated that we have to look at the hyperplane again.
P9 2.4
When we know α > 0 in the previous section, the point is on the boundary. However, the points on the classification line do not necessarily support vectors (there may be α = 0), so now the points with α>0 are called support vectors (SV), and only these SVs (that is, α>0) are studied. The scope may be narrowed down a bit.
Therefore, both w and b can be calculated only by SV, because if it is not SV, that is, if α = 0, they are meaningless.
The formulas of SVM and PLA are very similar, they are both ynzn y_nz_nynznThe linear combination of other w is similar. It can be said that w is represented by the data. The w in SVM is only represented by SV, and PLA is represented by the point where the error occurred. Philosophically, we need to know what to use to express our w.
Comparing the two representations of SVM: primal and dual, hard-margin means that ooxx's strict classification cannot make mistakes. Generally, Dual SVM is used.
Finally: Although it is said that dual svm is only related to N, d ~ is actually hidden in q. Next, we will explain how to avoid this d ~ .
The final summary: