02-33 non-linear support vector machine

Newer and more full of "machine learning" to update the site, more python, go, data structures and algorithms, reptiles, artificial intelligence teaching waiting for you: https://www.cnblogs.com/nickchen121/

Non-linear support vector machine

SVM is divided into three, linearly separable support vector machine and linear support vector machines are made of linearly separable data processing, linear support vector machine is only to deal with outliers do not truly nonlinear do data processing can be divided into, the next will be described the third method support vector machines nonlinear SVM (non-linear support vector machine).

A non-linear support vector machine learning objectives

  1. Nuclear Techniques
  2. Positive definite kernel function
  3. Commonly used four kernel
  4. Step nonlinear support vector machine

Second, the non-linear support vector machine Detailed

2.1 non-linear polynomial regression and support vector machine

He has been mentioned linear regression polynomial regression, where only a brief review.

House prices and assuming that the side length of a cube of the house, the volume of the house and the house of an area of a relationship, but now only the side length of the house (x_1 \) \ a feature, if you just take characteristics to predict the price of the house, there is very likely poor fitting, it is necessary to increase the area and volume of the house two features, namely
\ [\ hat {y} = \ omega_1x_1 + \ omega_2 {x_1} ^ 2 + \ omega_3 {x_1} ^ 3 + b \]
order \ (x_1 = x_1, x_2 = {x_1 ^ 2}, {X_3 = x_1 ^}. 3 \) , the linear polynomial model becomes
\ [\ hat {y} = \ omega_1x_1 + \ omega_2x_2 + \ omega_3x_3 + b \]
can then be seen after converting linear regression polynomial regression changed back to linear regression, i.e., not a one-dimensional linear data, maps it to the above-described three-dimensional transformation, the data becomes a linear data.

In fact, non-linear support vector machine that is using this idea that non-linear support vector machine maps the low-dimensional features to high-dimensional data, so data becomes linearly separable, then the question becomes linearly separable data classification .

2.2 kernel trick

2.2.1 kernel introduced

First review the optimization objective function is linear support vector machines.
\ [\ Begin {align} & \ underbrace {\ min} _ {\ alpha} {{\ frac {1} {2}} \ sum_ {i = 1} ^ m \ sum_ {j = 1 } ^ m \ alpha_i \ alpha_jy_iy_j (x_ix_j) - \ sum_ {i = 1} ^ m \ alpha_i} \\ & st \ quad \ sum_ {i = 1} ^ m \ alpha_iy_i = 0 \\ & \ quad \ quad 0 \
leq \ alpha_i \ geq {C} \ end {align} \] from the formula can be found in the objective function for the linear SVM is a process wherein \ (x_ix_j \) in the form of at this time if a low-dimensional feature space is defined to a high dimensional feature space mapping function \ (\ Phi (X) \) , characterized in that all map to a higher dimension, so that data is linearly separable, and therefore may continue the objective function according to the method of linear support vector machine, and further obtains the separating hyperplane decision function classification, i.e. a linear objective function of optimization SVM into a
\ [\ begin {align} & \ underbrace {\ min} _ {\ alpha} {{\ frac {1} {2}} \ sum_ {i = 1} ^ m \ sum_ {j = 1} ^ m \ alpha_i \ alpha_jy_iy_j (\ phi (x_i) \ phi (x_j)) - \ sum_ {i = 1} ^ m \ alpha_i} \\ & st \ quad \ sum_ {i = 1} ^ m \ alpha_iy_i = 0 \\ & \ quad \ quad 0 \ leq \ alpha_i \ geq {C} \ end {ali gn} \]
Can be found using this method looks like the perfect solution to the problem, and because only changes the characteristic dimensions of the data, but the data has been used before all the test data, if it to production, it is a feature may not, two, but thousands, if the characteristics of the mapping process, the rapid increase in the dimensions of the feature, calculate the strength also increases, and encounter the infinite-dimensional feature, then simply can not be calculated, so this does not very reasonable. The kernel is a good solution to large computational problems.

2.2.2 Kernel

Set \ (X-\) is the (low-dimensional Euclidean space input space \ (R ^ n \) or a subset of the set of discrete), \ (H \) for the high dimensional feature space (Hilbert space), the presence Ruoguo one from \ (X-\) to \ (H \) mapping \ (\ phi (x): X \ rightarrow H \) such that for all \ (X, Z \ in {X-} \) , the function \ (K (x, z) \) satisfy the condition
\ [K (x, z)
= \ phi (x) \ phi (z) \] called \ (K (x, z) \) is the kernel function, \ (\ Phi (X) \) is the mapping function, in the formula \ (\ phi (x) \ phi (z) \) of \ (\ phi (x) \ ) and \ (\ phi (z) \ ) is the inner product.

Because of \ (X, Z \ in {X-} \) , calculating \ (K (x, z) \) when in the low-dimensional input space \ (R ^ n \) are calculated directly, rather than through \ (\ phi (x) \ phi (z) \) is calculated \ (K (X, Z) \) , since \ (\ Phi \) is the input space \ (R ^ n \) in the feature space \ (H \ ) mapping feature space \ (H \) are generally high-dimensional, or even infinite dimensional, and even given the nuclear \ (K (X, Z) \) , wherein the spatial mapping function is not the only, you can have a variety of different mapping relationship feature space that is different even in the same feature space can take different mappings.

All in all the benefits of nuclear function is that it is calculated on the low-dimensional space, and the effect of the classification that is substantially within the computing performance and high-dimensional space, thus avoiding complex calculations directly in the high dimension.

By using a kernel function \ (K (x, z) \) non-linear support vector machine after separating hyperplane is
\ [\ sum_ {i = 1 } ^ m \ sum_ {i = 1} ^ m {\ alpha_i} ^ * y_iK (x, x_i) +
b ^ * = 0 \] classification decision function
\ [f (x) = sign (\ sum_ {i = 1} ^ m {\ alpha_i} ^ * y_iK (x, x_i) + b ^ *) \]

2.2.3 kernel, for example

Assuming that the input space is \ (R & lt ^ 2 \) , kernel function \ (K (X, Z) = (an xz) ^ 2 \) , it can be found by the above-described information feature space \ (H \) and mapping \ (\ Phi (X): R & lt ^ 2 \ rightarrow {H} \) .

Take feature space \ (H = R & lt ^. 3 \) , since the input is two-dimensional space, denoted \ (X = (x_1, x_2) ^ T \) , \ (Z = (Z_1, Z_2) ^ T \) , because of
\ [(xz) ^ 2 =
(x_1z_1 + x_2z_2) ^ 2 = (x_1z_1) ^ 2 + 2x_1z_1x_2z_2 + (x_2z_2) ^ 2 \] to obtain mapped to
\ [\ phi (x) = ((x_1) ^ 2, \ sqrt {2} x_1x_2, (
x_2) ^ 2) ^ T \] is easy to verify
\ [\ phi (x) \
phi (z) = (xz) ^ 2 = K (x, z) \] If the feature space or is \ (H = R & lt ^. 3 \) , can be mapped to obtain
\ [\ phi (x) = {\ frac {1} {\ sqrt {2}}} ((x_1) ^ 2- (x_2) ^ 2, 2x_1x_2, (x_1) ^ 2 +
(x_2) ^ 2) ^ T \] as easily verify
\ [\ phi (x) \
phi (z) = (xz) ^ 2 = K (x, z) \] If wherein space \ (H = R & lt ^. 4 \) , can be mapped to obtain
\ [\ phi (x) = ((x_1) ^ 2, x_1x_2, x_1x_2, (x_2) ^ 2) ^ T \]

2.3 positive definite kernel function

If a known mapping \ (\ phi (x) \ ) , by \ (\ phi (x) \ ) and \ (\ phi (z) \ ) is the inner product determined kernel, but can not map directly configured determine a given function is not a core function? The following will describe the function what conditions have to be met in order to become a core function.

Are generally referred to as positive definite kernel function kernel function (positive definite kernel function), here described directly Sufficient Conditions definite kernel function. Wants to be a positive definite function a kernel function, it must be satisfied at any point inside the Gram matrix set is formed of semi-definite, i.e., for any \ (x_i \ in {X} , \ quad i = 1,2, \ ldots, m \) , \ (K (x_i, x_j) \) corresponding Gram matrix \ (K = [K (x_i , x_j)] _ {m * m} \) is a semi-definite matrix, then \ (K ( x, x) \) is positive definite kernel function.

Since the process of finding a function of whether the kernel is very difficult, not many go into details here. And cattle have been found a lot of kernel function, but in practical problems common kernel only a few, the next will only introduce several commonly used in the industry of kernel function, while these core functions and only a few sklearn library a kernel function.

2.4 linear kernel

Linear kernel (linear kernel) is essentially linear support vector machine, it may be referred to as the linear support vector machine using the nonlinear SVM linear kernel function, the expression for the linear kernel function
\ [K (x, z) = xz \]
in this case, the classification decision function
\ [f (x) = sign (\ sum_ {i = 1} ^ m {\ alpha_i} ^ * y_i (xx_i) + b ^ *) \]

Polynomial kernel 2.5

Polynomial kernel function (polynomial kernel) is one of the commonly used linear SVM kernel function, expression is
\ [K (x, z)
= (\ gamma {xz} + r) ^ d \] wherein \ (\ Gamma, r, d \) are hyper-parameters.

In this case, the classification decision function
\ [f (x) = sign (\ sum_ {i = 1} ^ m {\ alpha_i} ^ * y_i ((\ gamma {xx_i} + r) ^ d) + b ^ *) \]

2.6 Gaussian kernel

Gaussian kernel (Gaussian kernel) corresponding to the support vector machine is a radial basis function (radial basis function, RBF) classifier, which is the mainstream nonlinear SVM kernel function, expression is
\ [K (x, z ) = e ^ {- \ gamma
{|| xz ||} ^ 2}, \ quad \ gamma> 0 \] wherein \ (\ Gamma \) is the hyper-parameters.

In this case, the classification decision function
\ [f (x) = sign (\ sum_ {i = 1} ^ m {\ alpha_i} ^ * y_i (e ^ {\ gamma {|| x-x_j ||} ^ 2}) + b ^ *) \]

2.7 Sigmoid kernel function

Sigmoid kernel function (sigmoid kernel) is one of the common linear support vector machine kernel function, expression is
\ [K (x, z)
= \ tanh (\ gamma {xz} + r) \] where \ (\ Gamma, r \) are hyper-parameters, \ (\ tanh () \) is the hyperbolic tangent function (Note: this function is similar to the graphics Sigmoid function, as to why called Sigmoid kernel function, you can ask sklearn author) .

In this case, the classification decision function
\ [f (x) = sign (\ sum_ {i = 1} ^ m {\ alpha_i} ^ * y_i (\ tanh \ gamma {xx_j} + r) + b ^ *) \]

2.7.1 tanh () function

# tanh()函数图例
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

x = np.linspace(-5, 5, 666)
y = np.tanh(x)

plt.plot(x, y, label='tanh()')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.show()

png

Select the 2.8 Kernel

  1. Feature If the number is large, with about the same number of samples, the choice this time LR or Linear Kernel of SVM
  2. Feature If the number is relatively small, the number of samples in general, not too big not too small, the choice of SVM + Gaussian Kernel
  3. Feature If the number is relatively small, and the number of samples a lot, you need to manually add some feature to become the first case
  4. If you can not determine which kernel to use all of the core functions can test a few times, and then determine which kernel more suitable

Third, the non-linear support vector machine processes

3.1 input

There \ (m \) linearly separable training set samples \ (T = \ {(x_1, Y_1), (x_2, Y_2), \ cdots, (x_m, Y_M) \} \) , where \ (x_i \ ) of \ (n-\) dimensional feature vector, \ (y_i \) binary output value that is \ (1 \) or \ (- 1 \) .

3.2 Output

Separating hyperplane parameter \ (w ^ * \) and \ (b ^ * \) and the classification decision function

3.3 Process

  1. Select the appropriate kernel function \ (K (x, z) \) and a penalty coefficient \ (C> 0 \) , configured constrained optimization problem
    \ [\ begin {align} & \ underbrace {\ min} _ {\ alpha } {\ frac {1} { 2}} \ sum_ {i = 1} ^ m \ sum_ {j = 1} ^ m \ alpha_i \ alpha_jy_iy_j (K (x_i, x_j)) - \ sum_ {i = 1} ^ m \ alpha_i \\ & st \ sum_ {i = 1} ^ m \ alpha_iy_i = 0 \\ & \ quad 0 \ geq \ alpha_i \ leq {C}, \ quad i = 1,2, \ ldots, m \ end {align} \]
  2. SMO algorithm calculated using the formula corresponding to a minimum \ (\ alpha ^ * \)
  3. Calculation \ (w ^ * \) to
    \ [w ^ * = \ sum_ {i = 1} ^ m \ alpha ^ * y_i \ phi (x_i) \]
  4. Find all \ (S \) a support vector, i.e. satisfying \ (0 <{\ alpha_i} ^ * <C \) sample \ ((X_s, Y_S) \) , by \ (y_s (\ sum_ {i = 1} ^ S {\ alpha_i} ^ * y_iK (x_i, x_s) + b ^ *) - 1 = 0 \) is calculated for each corresponding support vector \ (B ^ * \) , all of these calculated \ ( b ^ * \) is the average value is the final \ (B ^ * = {\ FRAC. 1} {S} {} \ sum_ = {I}. 1 ^ Sb ^ * \) .
  5. Separating hyperplane is determined
    \ [\ sum_ {i = 1 } ^ m {\ alpha_i} ^ * y_iK (x, x_i) + b ^ * = 0 \]
  6. Determined classification decision function
    \ [f (x) = sign
    (\ sum_ {i = 1} ^ m {\ alpha_i} ^ * y_iK (x, x_i) + b ^ *) \] linear support vector machine and the step of step linearly separable support vector machine is substantially the same, since the linear support vector machine using a relaxation factor, the biggest difference between the two lies in the \ (b ^ * \) considered value.

Fourth, the advantages and disadvantages of non-linear support vector machine

4.1 advantage

  1. Data can be non-linear separable classification

4.2 shortcomings

  1. Only supports binary classification, for multi-classification problems require the use of other helper methods OvR
  2. Support does not support the classification Regression

V. Summary

Non-linear support vector machine borrows the idea of ​​polynomial regression, the data is mapped to a high-dimensional space, so that data can not be separated from linear to become linearly separable. Also, because the use of nuclear function, it does not need to map the dataset to calculate the relationship between features to go after the high-dimensional, but it is possible to calculate the relationship between features before the data mapping, which is the core function of clever place.

Linear support vector machine but it is still a big problem, people are always striving for perfection, so if the algorithm is capable of supporting both 13 classification, thereby supporting the regression Would not it be God, and this algorithm is definitely there, That SVR (support vector regression, SVR).

Guess you like

Origin www.cnblogs.com/nickchen121/p/11686751.html