Data mining: model selection-SVM

This article mainly refers to the SVM explained by Abella, the main spring of station B, and organizes the learning of SVM. In addition, the whiteboard derivation of station B is also recommended to explain the mathematical formulas in more detail.

Optimization problem

Under certain conditions, solve the problem of finding the maximum / minimum value of the function.
Insert picture description here

Lagrange multiplier method

The Lagrange multiplier calculation model is as follows. k is the number of constraints, and hk (x) is the constraint.
Insert picture description here
The extreme values ​​are taken when they are tangent. At this time, the gradients of the two functions are collinear, so the two functions should be linear after derivation.
Turn the constrained problem into an unconstrained problem, and then directly derive the unconstrained problem.
Insert picture description here

KKT algorithm

The mathematical model when dealing with inequality constraints.
Insert picture description here
The Lagrange function is constructed using KKT conditions.
Insert picture description here
The specific KKT conditions are as follows:
1. For the constructed Lagrangian function, after derivation of x, the result of the formula is 0.
2. The coefficient before the inequality is greater than or equal to 0.
Insert picture description here

Dual problem

If the original optimization problem is not easy to solve, then the dual problem can be solved. Find the maximum value first, then the minimum value. When the KKT condition is satisfied, this is equivalent to finding the minimum value of the original f (x).
Insert picture description here
Proof about why the two are equivalent.
Insert picture description here
Definition of duality. It turned out to be the minimum for w, first for a and b, then for w to be the smallest. After the duality problem, it is just the other way around. Find the minimum of w first, and then the maximum of a and b.
Insert picture description here
min (max ())> = max (min ()), for example, house price. min (max ()) is to find the smallest among a bunch of high housing prices, such as Beijing; max (min ()) is to find the largest among a bunch of low housing prices, such as the county town of the 18th line. The former must be> = the latter. The following is a strict mathematical proof: the
Insert picture description here
normal is d <= p, and the two are equal only when the KKT condition is established.
Insert picture description here

Support Vector Machines

Generally used to solve the binary classification model. For the graph on the left, you can find countless lines to solve, but for the dotted line, when the sample increases, it is easy to judge the error. On the contrary, the effect of the solid line will be much better.
The support vector machine is to find the solid line of optimal differentiation and maximize the distance to the support vectors on both sides. (The distance between the two types of samples is maximized.)
Insert picture description here

SVM mathematical model

The purpose is to find the solid line function, that is, w and b, wx + b = 0.
Here, the division lines of the two categories are wx + b = 1 and wx + b = -1. The result here can also be 1, If it is 100, the two sides are divided by 100 at the same time, the result is still 1, and the w and b on the other side are only linearly transformed, which will not affect the result.
Find the maximum value by subtraction.
Using the cosine theorem, the inner product formula of two vectors is converted.
Finally, the value of d1 + d2 can be maximized.
Insert picture description here
Previously, w was on the denominator, seeking maximum, but now it is shifting to finding the minimum value of w.
The condition is that the real result and the predicted result have the same sign, that is, the prediction is correct.
Use Lagrange to turn into an unconstrained problem.
Insert picture description here
Turn to the calculation of the dual problem.
Insert picture description here
The final calculation is the sample points on the support vector, and others are not involved in the calculation.
Insert picture description here

Mathematical model with slack variables

The above SVM is for completely separable data, but it does not consider abnormal points. Adding slack variables can avoid outliers.
In a certain category, including other categories, you must relax the restrictions.
Insert picture description here
C is the penalty for the slack variable. The larger the C, the smaller the slack variable and the smaller the distance of the dashed line; the smaller the C, the larger the slack variable and the larger the distance of the dashed line.
Insert picture description here
Differentiate them separately so that their derivatives are 0.
Insert picture description here
Bring the solution back to the original formula.
Insert picture description here
Dual form and KKT conditions.
Insert picture description here

Kernel function

For linearly inseparable samples, a kernel function is used.
Insert picture description here
The problem directly extending to the high dimension:
Insert picture description here
another way to measure the similarity of the two data.
Insert picture description here
Add kernel function.
Insert picture description here
Necessary and sufficient condition kernels :
Insert picture description here
a common kernel :
Insert picture description here
Insert picture description here

SMO algorithm for SVM

Find the y-axis first, then the x-axis, then the y-axis ... fix one variable and solve for the other.
Insert picture description here
Insert picture description here
SMO selects two variables and performs calculations while the remaining variables do not move.
Insert picture description here
Because other variables are fixed, it can be regarded as a constant
Insert picture description here
Insert picture description here
Insert picture description here
parameter upper and lower bound proof:
Insert picture description here
Insert picture description here

SVM multi-category sample classification

OVR

Think of the overall sample as two categories, one belongs to A, the other belongs to non-A, one belongs to B, and one belongs to non-B ... In this way, when there are K categories, K SVMs need to be established. Determine which category has the highest probability. Generally use this.
Insert picture description here

THIS

Train the classifiers A and B, B and C, A and C to classify. Use the counting method to determine which category the data belongs to the most, and then determine which category the data belongs to.
Insert picture description here

Steps to use SVM

2. The data is standardized, because it is measured by distance, so the standard should be unified.
Insert picture description here
SVM parameters.
Insert picture description here

SVM advantages and disadvantages

Another disadvantage is that the hyperparameters of SVM have a great influence on the results, because the adjustment of hyperparameters directly affects the selection of support vectors. Generally, C is larger than gamma. C takes a number above 1+, and gamma takes a number below 1-.
Insert picture description here

Information Reference

https://www.bilibili.com/video/BV1ZE411p73x?p=7

Published 33 original articles · Liked 45 · Visitors 20,000+

Guess you like

Origin blog.csdn.net/AvenueCyy/article/details/105324988