This article mainly refers to the SVM explained by Abella, the main spring of station B, and organizes the learning of SVM. In addition, the whiteboard derivation of station B is also recommended to explain the mathematical formulas in more detail.
Optimization problem
Under certain conditions, solve the problem of finding the maximum / minimum value of the function.
Lagrange multiplier method
The Lagrange multiplier calculation model is as follows. k is the number of constraints, and hk (x) is the constraint.
The extreme values are taken when they are tangent. At this time, the gradients of the two functions are collinear, so the two functions should be linear after derivation.
Turn the constrained problem into an unconstrained problem, and then directly derive the unconstrained problem.
KKT algorithm
The mathematical model when dealing with inequality constraints.
The Lagrange function is constructed using KKT conditions.
The specific KKT conditions are as follows:
1. For the constructed Lagrangian function, after derivation of x, the result of the formula is 0.
2. The coefficient before the inequality is greater than or equal to 0.
Dual problem
If the original optimization problem is not easy to solve, then the dual problem can be solved. Find the maximum value first, then the minimum value. When the KKT condition is satisfied, this is equivalent to finding the minimum value of the original f (x).
Proof about why the two are equivalent.
Definition of duality. It turned out to be the minimum for w, first for a and b, then for w to be the smallest. After the duality problem, it is just the other way around. Find the minimum of w first, and then the maximum of a and b.
min (max ())> = max (min ()), for example, house price. min (max ()) is to find the smallest among a bunch of high housing prices, such as Beijing; max (min ()) is to find the largest among a bunch of low housing prices, such as the county town of the 18th line. The former must be> = the latter. The following is a strict mathematical proof: the
normal is d <= p, and the two are equal only when the KKT condition is established.
Support Vector Machines
Generally used to solve the binary classification model. For the graph on the left, you can find countless lines to solve, but for the dotted line, when the sample increases, it is easy to judge the error. On the contrary, the effect of the solid line will be much better.
The support vector machine is to find the solid line of optimal differentiation and maximize the distance to the support vectors on both sides. (The distance between the two types of samples is maximized.)
SVM mathematical model
The purpose is to find the solid line function, that is, w and b, wx + b = 0.
Here, the division lines of the two categories are wx + b = 1 and wx + b = -1. The result here can also be 1, If it is 100, the two sides are divided by 100 at the same time, the result is still 1, and the w and b on the other side are only linearly transformed, which will not affect the result.
Find the maximum value by subtraction.
Using the cosine theorem, the inner product formula of two vectors is converted.
Finally, the value of d1 + d2 can be maximized.
Previously, w was on the denominator, seeking maximum, but now it is shifting to finding the minimum value of w.
The condition is that the real result and the predicted result have the same sign, that is, the prediction is correct.
Use Lagrange to turn into an unconstrained problem.
Turn to the calculation of the dual problem.
The final calculation is the sample points on the support vector, and others are not involved in the calculation.
Mathematical model with slack variables
The above SVM is for completely separable data, but it does not consider abnormal points. Adding slack variables can avoid outliers.
In a certain category, including other categories, you must relax the restrictions.
C is the penalty for the slack variable. The larger the C, the smaller the slack variable and the smaller the distance of the dashed line; the smaller the C, the larger the slack variable and the larger the distance of the dashed line.
Differentiate them separately so that their derivatives are 0.
Bring the solution back to the original formula.
Dual form and KKT conditions.
Kernel function
For linearly inseparable samples, a kernel function is used.
The problem directly extending to the high dimension:
another way to measure the similarity of the two data.
Add kernel function.
Necessary and sufficient condition kernels :
a common kernel :
SMO algorithm for SVM
Find the y-axis first, then the x-axis, then the y-axis ... fix one variable and solve for the other.
SMO selects two variables and performs calculations while the remaining variables do not move.
Because other variables are fixed, it can be regarded as a constant
parameter upper and lower bound proof:
SVM multi-category sample classification
OVR
Think of the overall sample as two categories, one belongs to A, the other belongs to non-A, one belongs to B, and one belongs to non-B ... In this way, when there are K categories, K SVMs need to be established. Determine which category has the highest probability. Generally use this.
THIS
Train the classifiers A and B, B and C, A and C to classify. Use the counting method to determine which category the data belongs to the most, and then determine which category the data belongs to.
Steps to use SVM
2. The data is standardized, because it is measured by distance, so the standard should be unified.
SVM parameters.
SVM advantages and disadvantages
Another disadvantage is that the hyperparameters of SVM have a great influence on the results, because the adjustment of hyperparameters directly affects the selection of support vectors. Generally, C is larger than gamma. C takes a number above 1+, and gamma takes a number below 1-.
Information Reference
https://www.bilibili.com/video/BV1ZE411p73x?p=7