Machine Learning Practical Tutorial (11): Support Vector Machine SVM

What is SVM?

The full English name of VM is Support Vector Machines, we call it support vector machine. Support vector machine is an algorithm we use for classification. Let us start our SVM journey in the form of a short story.

On Valentine's Day a long time ago, a hero went to save his lover, but the devil in the sky played a game with him.

The devil seemed to place balls of two colors regularly on the table and said: "You use a stick to separate them? Requirement: Try to still apply it after putting more balls." So the hero put them like this, did he do a good job
Insert image description here
?
Insert image description here
Then the devil placed more balls on the table, and one ball seemed to be on the wrong side. Obviously, the hero needs to make adjustments to the stick.
Insert image description here
SVM is trying to place the stick in the best position so that there is as much space as possible on both sides of the stick. This gap is the distance from the ball to the stick.
Insert image description here
Well now, even if the Devils put up more balls, the sticks are still a good dividing line.
Insert image description here
Now, the hero has no stick that can help him separate the two kinds of balls. What should he do now? Of course, like in all martial arts movies, the hero slaps the table and the ball flies into the air. Then, relying on the hero's lightness, the hero grabbed a piece of paper and inserted it between the two kinds of balls.
Insert image description here
Now, looking at these balls from the perspective of the devil in the air, these balls look like they are separated by a curve.
Insert image description here
Later, boring adults called these balls data, the sticks classifiers, the trick of finding the maximum gap called optimization, hitting the table called kernelling, and the piece of paper called hyperplane.

To give an overview:

When it is a classification problem and the data is linearly separable, that is, when two kinds of balls can be separated by a stick, we only need to position the stick at a position that maximizes the distance between the ball and the stick. , the process of finding this maximum interval is called optimization. However, the reality is often very cruel. The general data is linearly inseparable, that is, there is no stick that can classify the two kinds of balls well. At this time, we need to be like a hero, pick up the balls, and use a piece of paper instead of a stick to sort the balls. To make data fly, what we need is a kernel function (kernel) , and the paper used to cut the ball is a hyperplane .

mathematical modeling

Support Vector Machine (SVM) is a type of generalized linear classification of data based on supervised learning. Its decision boundary is the maximum-margin hyperplane that solves the learning sample. SVM can perform nonlinear classification through kernel functions and is one of the common kernel learning methods.

Support vector machines (support vector machines) find a hyperplane to segment the sample. The principle of segmentation is to maximize the interval, and finally transform it into a convex quadratic programming problem to solve. It can solve both linear separability and linear impossibility, and can solve both classification problems and regression problems.

Maximize the interval

When the training samples are linearly separable, use hard margin maximization (Hard Margin SVM) or when approximately linearly separable, use software maximization (Soft Margin SVM). Kernel function and soft margin maximization are used when the training samples are linearly inseparable.
Insert image description here
In practical problems, there are often situations where the decision boundary is not unique, which is an ill-posed problem. Given the training sample set
D = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , … , ( xm , ym ) } , yi ∈ { − 1 , 1 } D = \{(x_{1} , y_{1}), (x_{2}, y_{2}), \dots , (x_{m}, y_{m})\}, y_{i} \in \{-1, 1\}D={(x1,y1),(x2,y2),,(xm,ym)},yi{ 1,1 } The basic idea of ​​the classification algorithm is to find a dividing hyperplane in the sample space based on the training set. However, there may be many dividing hyperplanes that can separate the training samples. Therefore, which one should we work hard to find?
Insert image description here
The straight line found by svm hopes to be as far away from the nearest red point and blue point as possible from the decision boundary, so as to ensure the generalization ability of the model. SVM tries to find an optimal decision boundary, which is the farthest from the nearest samples of the two categories. The three points in the figure are the same distance from the decision boundary. These three points are called support vectors. The distance between two straight lines parallel to the decision boundary is the margin. SVM is to maximize the margin, so this problem is transformed into an optimization problem.

Optimization problem

Classification interval equation

Svm maximizes margin, margin=2d, which corresponds to maximizing d, that is, the distance from the point to the straight line is the largest. Recall from analytic geometry that point (x, y) (x,y) in two-dimensional space(x,y ) to the straight lineA x + B y + C = 0 Ax+By+C=0Ax+By+C=The distance formula for 0 : ∣ A x + B y + C ∣ A 2 + B 2 \frac{\mid Ax+By+C \mid} {\sqrt{A^2+B^2}}A2+B2 Ax+By+C
Extend it to n-dimensional θ T ⋅ xb = 0 \theta^T \cdot x_{b}=0iTxb=0
insideθ \thetaθ contains the intercept and coefficients,xb x_bxbJust in xxAdd a row of constant 1 to the x sample, which is the same as in the previous linear regression.

If the intercept is taken out, it is w T x + b = 0 w^Tx + b = 0wTx+b=0The
w among them is to assign a weight to each data in the sample. These are two different ways of expressing the equation of a straight line. Then, we can get the new point-to-line distance equation:
∣ w T x + b ∣ ∥ w ∥ \frac{\mid w^Tx + b \mid}{\parallel w \parallel}wwTx+b,其中∥ w ∥ = w 1 2 + w 2 2 + ⋯ + wn 2 \parallel w \parallel = \sqrt {w_{1}^2 + w_{2}^2 + \dots + w_{n}^2 }w∥=w12+w22++wn2

Restrictions

It seems that we have successfully obtained the mathematical form of the objective function. But in order to find the maximum value of w. We have to face the following problems:

  1. How do we tell whether the hyperplane correctly classifies the sample points?
  2. We know that the maximum value of distance d is required. We first need to find the point on the support vector. How to select the point on the support vector among many points?
    The problem we need to face above is the constraint condition, which means that the value range of the variable d we optimize is limited and constrained. In fact, constraints have always been the most troublesome thing in optimization problems.
    But now that we know that these constraints do exist, we have to use mathematical language to describe them. Insert picture description here . However, the SVM algorithm uses some clever tricks to integrate these constraints into an inequality.

There are two kinds of points on this two-dimensional plane, and we mark them respectively:

The red dot is marked as 1, which we artificially specify as a positive sample; the
blue five-pointed star is marked as -1, which we artificially specify as a negative sample.
Add a category label yi to each sample point xi:
Insert image description here

Skill
The straight line in the middle is the decision boundary w T x + b = 0 w^Tx+b=0wTx+b=0 , and the upper and lower straight lines mean that the distance must be greater than d, so
{ w T x ( i ) + b ∥ w ∥ ≥ d ∀ y ( i ) = 1 w T x ( i ) + b ∥ w ∥ ≤ − d ∀ y ( i ) = − 1 \begin {cases} \frac{w^Tx^{(i)} + b} {\parallel w \parallel} \geq d \qquad \forall y^{(i)} = 1 \\ \frac{w^Tx^{(i)} + b}{\parallel w \parallel} \leq -d \qquad \forall y^{(i)} = -1 \\ \end {cases}{ wwTx(i)+bdy(i)=1wwTx(i)+bdy(i)=1
At this time, assume that the two types of samples are 1 and -1 respectively. Then divide both the left and right sides of the above formula by d at the same time to get
{ w T x ( i ) + b ∥ w ∥ d ≥ 1 ∀ y ( i ) = 1 w T x ( i ) + b ∥ w ∥ d ≤ − 1 ∀ y ( i ) = − 1 \begin {cases} \frac{w^Tx^{(i)} + b} {\parallel w \parallel d} \geq 1 \qquad \forall y^{(i )} = 1 \\ \frac{w^Tx^{(i)} + b}{\parallel w \parallel d} \leq -1 \qquad \forall y^{(i)} = -1 \\ \ end {cases}{ wdwTx(i)+b1y(i)=1wdwTx(i)+b1y(i)=1

Where |w| and d are both scalars (do not affect the calculation of the limit value). At this time, the denominator is eliminated, and
{ w T x ( i ) + b ≥ 1 ∀ y ( i ) = 1 w T x ( i ) + b ≤ − 1 ∀ y ( i ) = − 1 \begin {cases} {w^Tx^{(i)} + b} \geq 1 \qquad \forall y^{(i)} = 1 \\ {w^ Tx^{(i)} + b} \leq -1 \qquad \forall y^{(i)} = -1 \\ \end {cases}{ wTx(i)+b1y(i)=1wTx(i)+b1y(i)=1
At this point, two equations are obtained, and then a little trick is used to merge the two equations into one.
yi ( w T x ( i ) + b ) ≥ 1 \boxed {y_{i} (w^Tx^{(i)} + b) \geq 1}yi(wTx(i)+b)1

Maximize objective function

Now to put our thoughts together, we have got our objective function:
max ∣ w T x + b ∣ ∥ w ∥ max \frac{\mid w^Tx +b \mid}{\parallel w \parallel}maxwwTx+b
Our optimization goal is to maximize d. We have already said that we use sample points on the support vector to solve the maximization problem of d. So what are the characteristics of the sample points on the support vector?
∣ w T x + b ∣ = 1 |w^Tx+b|=1wTx+b=1So
it translates into maximizingmax 1 ∥ w ∥ max \frac{1}{\parallel w \parallel}maxw1That is to minimize min ∥ w ∥ min\parallel w \parallelminw∥But
often for the convenience of derivation,min 1 2 ∥ w ∥ 2 \boxed{min\frac{1}{2} \parallel w \parallel^2}min21w2
, but this is an optimization problem with constraints, which is to satisfy s . t . yi ( w T x ( i ) + b ) ≥ 1 , i = 1 , 2 , … , m st \quad y_{i} (w^Tx^{(i)} + b) \geq 1, \quad i=1,2,\dots,ms.t.yi(wTx(i)+b)1,i=1,2,,m
here m is the total number of sample points, and the abbreviation st means "Subject to", which means "subject to certain conditions". The above formula describes a typical quadratic function optimization problem under inequality constraints, and is also the basic mathematical model of the support vector machine.
Solving optimization problems with constraints requires using the Lagrange multiplier method to obtain the dual problem. Then the problem function:
L ( w , b , α ) = 1 2 ∥ w ∥ 2 + ∑ i = 1 m α i ( 1 − yi ( w T xi + b ) ) \boxed{L(w, b, \ alpha) = \frac{1}{2} \parallel w \parallel^2 + \sum_{i=1}^m \alpha_{i} (1-y_{i}(w^Tx_{i}+b) )}L(w,b,a )=21w2+i=1mai(1yi(wTxi+b))

where α i \alpha_iaiIt is the Lagrange multiplier. This is Hard Margin SVM.

Soft Margin and SVM regularization

If a blue point appears near a red point during actual application, that is, the two categories are relatively close, but the overall difference from the blue point is obvious, it can be regarded as a special point or a wrong singular point. This It will be misleading, causing the final hard margin classification boundary to be a straight line 1. At this time, the generalization ability of the model is questionable. Normally, the extremely special blue point should be ignored like straight line 2 to ensure that most of the data is the farthest away from the straight line. Maybe this is the best classification boundary, which means higher generalization ability. This can also indirectly explain that if the accuracy of the model is too high, it may lead to overfitting of the model.

Insert image description here
There is a more general example, that is, if a blue point is mixed into a red point, this will cause the data set to be linearly inseparable. There is no straight line that can separate the two categories. In this case, Hard margin is no longer a question of whether the generalization ability is strong, but that there is no straight line to separate them.
Insert image description here
Therefore, no matter which angle you start from the above two situations, you should consider giving the svm model partial fault tolerance. This leads to Soft Margin SVM.
min 1 2 ∥ w ∥ 2 , s . t . yi ( w T x ( i ) + b ) ≥ 1 − ζ i , ζ i > 0 , i = 1 , 2 , … , m min\frac{1}{ 2} \parallel w \parallel^2, \quad st \quad y_{i} (w^Tx^{(i)} + b) \geq 1 - \zeta_{i}, \\ \quad \zeta_{i } >0, \quad i=1,2,\dots,mmin21w2s.t.yi(wTx(i)+b)1gi,gi>0,i=1,2,,Compared with Hard Margin SVM, m
is equivalent to making the conditions more relaxed. To understand intuitively from the graph, it is equivalent to relaxing the above straight line to a dotted line. In addition, the more important point isζ i \zeta_{i}giIt is not a fixed value, but corresponds to each sample xi x_ixihave a ζ i \zeta_{i}gi, but for this ζ i \zeta_{i}giCertain restrictions are also required, which does not mean that this condition can be relaxed indefinitely. That is to say, this fault tolerance space cannot be too large, so it needs to be limited.
Insert image description here
L1 regular
min ( 1 2 ∥ w ∥ 2 + C ∑ i = 1 m ζ i ) , s . t . yi ( w T x ( i ) + b ) ≥ 1 − ζ i , ζ i ≥ 0 , i = 1 , 2 , … , m \boxed { min(\frac{1}{2} \parallel w \parallel^2 +C \sum_{i=1}^m \zeta_{i}), \quad st \quad y_ {i} (w^Tx^{(i)} + b) \geq 1 - \zeta_{i}, \\ \qquad \qquad \qquad \qquad \zeta_{i} \geq 0, \quad i=1 ,2,\dots,m }min(21w2+Ci=1mgi)s.t.yi(wTx(i)+b)1gi,gi0,i=1,2,,m
Introducing a new hyperparameter C to weigh the fault tolerance and the objective function. In fact, this can also be understood as the fact that we have added an L1 regular term to it to prevent the model from developing in an extreme direction, making it less sensitive to extreme data sets and to the unknown. The data has better generalization ability. This so-called L1 regular is different from the previous L1 regular in that it has no absolute value, just because it has restricted ζ i ≥ 0 \zeta_{i} \geq 0gi0 , so it is reasonable not to add the absolute value. The larger C is, the closer it is to a Hard Margin SVM. The smaller C means there is more room for fault tolerance. The difference between the regularization of svm and the regularization of linear regression is that the position of C is different. I will update after I have a deeper understanding of the specific reasons.
Then there is L1 regularity, and correspondingly there is L2 regularity, as follows.
min ( 1 2 ∥ w ∥ 2 + C ∑ i = 1 m ζ i 2 ) , s . t . yi ( w T x ( i ) + b ) ≥ 1 − ζ i , ζ i > 0 , i = 1 , 2 , … , m \boxed { min(\frac{1}{2} \parallel w \parallel^2 +C \sum_{i=1}^m \zeta_{i}^2), \quad st \quad y_{i} (w^Tx^{(i)} + b) \geq 1 - \zeta_{i}, \\ \qquad \qquad \qquad \qquad \zeta_{i} >0, \quad i=1 ,2,\dots,m }min(21w2+Ci=1mgi2)s.t.yi(wTx(i)+b)1gi,gi>0,i=1,2,,m

Sklearn的svm

When actually using SVM, you need to standardize the data just like KNN, because both involve distance. Because when the data scale is too different, for example, the horizontal axis 0-1 and the vertical axis 0-10000 in the figure below. So standardization first is necessary.

The first step is to prepare a simple binary classification data set:

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
"""
load_iris是一个经典的机器学习数据集,它包含了150个样本
这个数据集中的四个特征分别是花萼长度(sepal length)、花萼宽度(sepal width)、花瓣长度(petal length)和花瓣宽度(petal width),
它们都是以厘米(cm)为单位测量的。目标变量是鸢尾花的种类,
有三种不同的种类:Setosa、Versicolour和Virginica。
它们的中文名分别是山鸢尾、杂色鸢尾和维吉尼亚鸢尾。
"""
iris = datasets.load_iris()

x = iris.data
y = iris.target
# 只做一个简单的二分类,获取分类是山鸢尾、杂色鸢尾的数据,同时取2维的特征就行了
x = x[y<2, :2]
y = y[y<2]
#分别绘制出分类是0和1的点,不同的scatter颜色不一样
plt.scatter(x[y==0, 0], x[y==0, 1])
plt.scatter(x[y==1, 0], x[y==1, 1])
plt.show()

Insert image description here
To implement svm, first use a relatively large C.


# 标准化数据
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
#数据归一化
standardscaler = StandardScaler()
standardscaler.fit(x)
x_standard = standardscaler.transform(x)
svc = LinearSVC(C=1e9)
svc.fit(x_standard, y)

def plot_decision_boundary(model, axis):
    x0, x1 = np.meshgrid(np.linspace(axis[0], axis[1], int((axis[1] - axis[0])*100)).reshape(1, -1),
                         np.linspace(axis[2], axis[3], int((axis[3] - axis[2])*100)).reshape(1, -1),)
    x_new = np.c_[x0.ravel(), x1.ravel()]
    y_predict = model.predict(x_new)
    zz = y_predict.reshape(x0.shape)

    from matplotlib.colors import ListedColormap
    custom_cmap = ListedColormap(['#EF9A9A', '#FFF59D', '#90CAF9'])

    plt.contourf(x0, x1, zz, linewidth=5, cmap=custom_cmap)
    w = model.coef_[0]
    b = model.intercept_[0]
    # w0*x0 + w1*x1 + b = 0
    # x1 = -w0/w1 * x0 - b/w1
    plot_x = np.linspace(axis[0], axis[1], 200)
    up_y = -w[0]/w[1] * plot_x - b/w[1] + 1/w[1]
    down_y = -w[0]/w[1] * plot_x - b/w[1] - 1/w[1]
    
    up_index = (up_y >= axis[2]) & (up_y <= axis[3])
    down_index = (down_y >= axis[2]) & (down_y <= axis[3])
    
    plt.plot(plot_x[up_index], up_y[up_index], color='black')
    plt.plot(plot_x[down_index], down_y[down_index], color='black')
plot_decision_boundary(svc, axis=[-3, 3, -3, 3])
plt.scatter(x_standard[y==0, 0], x_standard[y==0, 1], color='red')
plt.scatter(x_standard[y==1, 0], x_standard[y==1, 1], color='blue')
plt.show()

Insert image description here
Use a smaller C and compare the effects of different values ​​of C.

svc2 = LinearSVC(C=0.01)
svc2.fit(x_standard, y)

plot_decision_boundary(svc2, axis=[-3, 3, -3, 3])
plt.scatter(x_standard[y==0, 0], x_standard[y==0, 1], color='red')
plt.scatter(x_standard[y==1, 0], x_standard[y==1, 1], color='blue')
plt.show()

Insert image description here
Comparing the two pictures, we can find that when C is smaller, a red point is mistakenly divided into blue. This once again verifies that when C is smaller, it means there is more room for error.

Using polynomial features in SVM

What we have been talking about before is linear svm. For svm, it can also solve nonlinear problems. By analogy with the idea of ​​​​linear regression to nonlinear regression, polynomial features are first used.

First generate the dataset:

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets

x, y = datasets.make_moons()
x.shape
# (100, 2)
y.shape
# (100,)
plt.scatter(x[y==0, 0], x[y==0, 1])
plt.scatter(x[y==1, 0], x[y==1, 1])
plt.show()

Insert image description here
Next add some random noise to the data:

x, y = datasets.make_moons(noise=0.15, random_state=666)
plt.scatter(x[y==0, 0], x[y==0, 1])
plt.scatter(x[y==1, 0], x[y==1, 1])
plt.show()

Insert image description here
Use polynomial, normalization, linear svm

from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline

def PolynomiaSVC(degree, C=1.0):
    return Pipeline([
        ('poly', PolynomialFeatures(degree=degree)),
        ('std_scale', StandardScaler()),
        ('linear_svc', LinearSVC(C=C))
    ])

poly_svc = PolynomiaSVC(degree=3)
poly_svc.fit(x, y)

def plot_decision_boundary(model, axis):
    x0, x1 = np.meshgrid(np.linspace(axis[0], axis[1], int((axis[1] - axis[0])*100)).reshape(1, -1),
                         np.linspace(axis[2], axis[3], int((axis[3] - axis[2])*100)).reshape(1, -1),)
    x_new = np.c_[x0.ravel(), x1.ravel()]
    y_predict = model.predict(x_new)
    zz = y_predict.reshape(x0.shape)

    from matplotlib.colors import ListedColormap
    custom_cmap = ListedColormap(['#EF9A9A', '#FFF59D', '#90CAF9'])

    plt.contourf(x0, x1, zz, linewidth=5, cmap=custom_cmap)

plot_decision_boundary(poly_svc, axis=[-1.5, 2.5, -1.0, 1.5])
plt.scatter(x1[y1==0, 0], x1[y1==0, 1])
plt.scatter(x1[y1==1, 0], x1[y1==1, 1])
plt.show()

Insert image description here
In addition to using this method to add polynomial features and then feeding them into linear svc, there is another way to achieve similar functions.


from sklearn.svm import SVC

# 这种方法训练的过程并不完全是先将数据进行标准化,再使用linearSVC这么一个过程
# SVC中默认的C=0
def PolynomiaKernelSVC(degree, C=1.0):
    return Pipeline([
        ('std_scale', StandardScaler()),
        ('kernel_svc', SVC(kernel='poly', degree=degree, C=C))
    ])

poly_kernel_svc = PolynomiaKernelSVC(degree=3)
poly_kernel_svc.fit(x1, y1)
# Pipeline(memory=None,
#     steps=[('std_scale', StandardScaler(copy=True, with_mean=True, with_std=True)),
#  ('kernel_svc', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
#   decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
#   kernel='poly', max_iter=-1, probability=False, random_state=None,
#   shrinking=True, tol=0.001, verbose=False))])

plot_decision_boundary(poly_kernel_svc, axis=[-1.5, 2.5, -1.0, 1.5])
plt.scatter(x1[y1==0, 0], x1[y1==0, 1])
plt.scatter(x1[y1==1, 0], x1[y1==1, 1])
plt.show()

Insert image description here
This method is the kernel function in svm. Next, the kernel function will be explained in detail.

What is kernel function

In real tasks, the original sample space may not have a hyperplane that can correctly divide the two categories. For such a problem, the sample can be mapped from the original space to a higher-dimensional feature space, so that the sample is in this feature space. Internally linearly separable. Therefore, the role of the kernel function is to make the originally linearly inseparable data linearly separable. Below is a procedure using polynomial kernel mapping.
Insert image description here
After converting to the following function, the data becomes linearly separable.
Insert image description here
Commonly used kernel functions:
1. Linear kernel: k (xi, xj) = xi T xj + ck(x_{i}, x_{j}) = x_{i}^T x_{j} + ck(xi,xj)=xiTxj+c
2. Polynomial kernel:k (xi, xj) = (xi T xj + c) dk(x_{i}, x_{j}) = (x_{i}^T x_{j} + c)^dk(xi,xj)=(xiTxj+c)d , when d=1, it degenerates into a linear kernel.
3. Gaussian kernel RBF:k ( xi , xj ) = exp ( − ∥ xi − xj ∥ 2 2 σ 2 ) , σ > 0 k(x_{i}, x_{j}) = exp(-\frac{\ parallel x_{i} - x_{j} \parallel ^2}{2\sigma ^2}), \sigma >0k(xi,xj)=exp(2 p2xixj2),p>0 is the bandwidth RBF kernel of the Gaussian kernel: Radial Basis Function Kernel
4. Laplacian kernel:k ( xi , xj ) = exp ( − ∥ xi − xj ∥ σ ) , σ > 0 k(x_{i}, x_ {j}) = exp(-\frac{\parallel x_{i} - x_{j} \parallel}{\sigma}), \sigma > 0k(xi,xj)=exp(pxixj) , p>0
5、Sigmoid核: k ( x i , x j ) = t a n h ( β x i T x j + θ ) k(x_{i}, x_{j}) = tanh(\beta x_{i}^T x_{j} + \theta) k(xi,xj)=t he ( β xiTxj+i )

Gaussian kernel function

Gaussian kernel function is a commonly used kernel function, usually used in machine learning algorithms such as support vector machine (SVM). It can map data from the original space to a higher-dimensional space, so that originally inseparable samples can be separated in the new space.

In layman's terms, the Gaussian kernel function is like a "similarity measure" that can calculate the similarity between two samples. When using the Gaussian kernel function, we first select a center point, then calculate the distance between each sample point and the center point, and use the distance as a measure of similarity. This distance is usually weighted by a Gaussian distribution function, which is where the name Gaussian kernel function comes from.
More specifically, the Gaussian kernel function can convert two sample points xi x_ixiand xj x_jxjMap to a higher dimensional space, calculate their inner product in the new space, and get the following formula:
K ( xi , xj ) = exp ⁡ ( − γ ∣ ∣ xi − xj ∣ ∣ 2 ) K(x_i, x_j) = \exp(-\gamma ||x_i - x_j||^2)K(xi,xj)=exp(γ∣∣xixj2)

Among them, γ \gammaγ is a parameter that controls the width of the Gaussian distribution,∣ ∣ xi − xj ∣ ∣ 2 ||x_i - x_j||^2∣∣xixj2 is the sample pointxi x_ixiand xj x_jxjThe square of the Euclidean distance between When γ \gammaWhen the value of γ is large, the peak value of the Gaussian distribution will become narrower, and the similarity measure will pay more attention to the distance between the two sample points; whenγ \gammaWhen the value of γ is small, the peak value of the Gaussian distribution will become wider, and the similarity measure will be smoother, focusing on the overall similarity between the two sample points.

Overall, the Gaussian kernel function is a very flexible and powerful similarity measure that can be used in many machine learning algorithms, especially when it comes to nonlinear classification problems.

Insert image description here
Insert image description here
Next, we can understand the entire mapping process more intuitively through Gaussian kernel function mapping.

import numpy as np
import matplotlib.pyplot as plt

x = np.arange(-4, 5, 1)
# array([-4, -3, -2, -1,  0,  1,  2,  3,  4])
y = np.array((x >= -2) & (x <= 2), dtype='int')
# array([0, 0, 1, 1, 1, 1, 1, 0, 0])
plt.scatter(x[y==0], [0] * len(x[y==0]))
plt.scatter(x[y==1], [0] * len(x[y==1]))
plt.show()

Insert image description here
After Gaussian kernel

def gaussian(x, l):
    gamma = 1.0
    return np.exp(-gamma *(x-l)**2)

l1, l2 = -1, 1

x_new = np.empty((len(x), 2))
for i,data in enumerate(x):
    x_new[i, 0] = gaussian(data, l1)
    x_new[i, 1] = gaussian(data, l2)

plt.scatter(x_new[y==0, 0], x_new[y==0, 1])
plt.scatter(x_new[y==1, 0], x_new[y==1, 1])
plt.show()

Insert image description here
In fact, the process of realizing the real Gaussian kernel function is not fixed γ \gammaγ , but for each data point it isγ \gammaγ ,determine the radius gamma = 1.0 and have
Insert image description here
the following:g ( x ) = 1 σ 2 π e − 1 2 ( x − µ σ ) g(x)= \frac{1}{\sigma \sqrt{2 \pi } } e^{-\frac{1}{2}(\frac{x - \mu}{\sigma})}g(x)=p2 p.m 1e21(px m)
whereμ \muμ determines the center position of the function,σ \sigmaσ determines the degree of closeness of the entire graph,σ \sigmaThe smaller σ is, the higher the image is and the image is relatively concentrated. Oppositeσ \sigmaThe larger σ , the more dispersed the graph is.
K ( xi , xj ) = e − γ ∥ xi − xj ∥ 2 K(x_{i},x_{j}) = e^{-\gamma \parallel x_{i} - x_{j} \parallel ^2 }K(xi,xj)=eγxixj2
γ \gamma The larger γ is, the wider the Gaussian distribution is;

γ \gamma The smaller γ is, the narrower the Gaussian distribution is.

Next, use the Gaussian kernel function encapsulated in sklearn:

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets

x, y = datasets.make_moons(noise=0.15, random_state=666)

from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline

def RBFKernelSVC(gamma=1.0):
    return Pipeline([
        ('std_scale', StandardScaler()),
        ('svc', SVC(kernel='rbf', gamma=gamma))
    ])

svc = RBFKernelSVC(gamma=1.0)
svc.fit(x, y)

def plot_decision_boundary(model, axis):
    x0, x1 = np.meshgrid(np.linspace(axis[0], axis[1], int((axis[1] - axis[0])*100)).reshape(1, -1),
                         np.linspace(axis[2], axis[3], int((axis[3] - axis[2])*100)).reshape(1, -1),)
    x_new = np.c_[x0.ravel(), x1.ravel()]
    y_predict = model.predict(x_new)
    zz = y_predict.reshape(x0.shape)

    from matplotlib.colors import ListedColormap
    custom_cmap = ListedColormap(['#EF9A9A', '#FFF59D', '#90CAF9'])

    plt.contourf(x0, x1, zz, linewidth=5, cmap=custom_cmap)

plot_decision_boundary(svc, axis=[-1.5, 2.5, -1.0, 1.5])
plt.scatter(x[y==0, 0], x[y==0, 1])
plt.scatter(x[y==1, 0], x[y==1, 1])
plt.show()

Insert image description here

svc_gamma100 = RBFKernelSVC(gamma=100)
svc_gamma100.fit(x, y)

plot_decision_boundary(svc_gamma100, axis=[-1.5, 2.5, -1.0, 1.5])
plt.scatter(x[y==0, 0], x[y==0, 1])
plt.scatter(x[y==1, 0], x[y==1, 1])
plt.show()

Insert image description here

svc_gamma10 = RBFKernelSVC(gamma=10)
svc_gamma10.fit(x, y)

plot_decision_boundary(svc_gamma10, axis=[-1.5, 2.5, -1.0, 1.5])
plt.scatter(x[y==0, 0], x[y==0, 1])
plt.scatter(x[y==1, 0], x[y==1, 1])
plt.show()

Insert image description here

svc_gamma01 = RBFKernelSVC(gamma=0.1)
svc_gamma01.fit(x, y)

plot_decision_boundary(svc_gamma01, axis=[-1.5, 2.5, -1.0, 1.5])
plt.scatter(x[y==0, 0], x[y==0, 1])
plt.scatter(x[y==1, 0], x[y==1, 1])
plt.show()

Insert image description here
Gamma is equivalent to adjusting the complexity of the model. The smaller the gamma, the lower the complexity of the model. The higher the gamma, the higher the complexity of the model. Therefore, it is necessary to adjust the hyperparameter gamma to balance overfitting and underfitting.

Source of text and examples for this article:

  1. https://cuijiahua.com/blog/2017/11/ml_8_svm_1.html
  2. https://zhuanlan.zhihu.com/p/79679104

Guess you like

Origin blog.csdn.net/liaomin416100569/article/details/130483107