Support vector machine (SVM) algorithm classification practice

Python support vector machine (SVM) algorithm classification practice

Introduction to Algorithm

SVM

Before, we used many linear algorithms to make predictive models, such as Logistic Regression , lasso , and Ridge Regression . But in real life, many things are not linearly separable (that is, they can be classified by drawing a straight line), and SVM is to treat linearly inseparable and transform the classification problem into a flat classification problem. In this algorithm, we treat each data item as a point, and in the n-dimensional space (where n is the number of features you have) as a point, each feature value is a value of a specific coordinate. Then, we classify by finding the hyperplane that distinguishes the two classes.

We use a graph to illustrate this point:
Insert picture description here
in actual use, we don’t need to understand the principle of determining the best classification plane, we just need to remember a rule of thumb to determine the correct hyperplane: “Choose to better isolate Two types of hyperplanes". In this scene, the blue hyperplane "A" did this job well.

Data Sources

Fetal Health Classification: https://www.kaggle.com/andrewmvd/fetal-health-classification

Insert picture description here

The data contains fetal electrocardiogram, fetal movement, uterine contraction and other characteristic values, and what we need to do is to classify the fetal health (fetal_health) through these characteristic values.

Insert picture description here

The data set contains 2126 feature records extracted from electrocardiogram examinations, which are then divided into 3 categories by three obstetric experts, and represented by numbers: 1-general, 2-suspected pathology, 3-determined pathology.

Data mining

1. Import third-party libraries and read files

import pandas as pd
import numpy as np
import winreg
from sklearn.model_selection import train_test_split#划分数据集与测试集
from sklearn import svm#导入算法模块
from sklearn.metrics import accuracy_score#导入评分模块
###################
real_address = winreg.OpenKey(winreg.HKEY_CURRENT_USER,r'Software\Microsoft\Windows\CurrentVersion\Explorer\Shell Folders',)
file_address=winreg.QueryValueEx(real_address, "Desktop")[0]
file_address+='\\'
file_origin=file_address+"\\源数据-分析\\fetal_health.csv"
health=pd.read_csv(file_origin)
#设立桌面绝对路径,读取源数据文件,这样将数据直接下载到桌面上就可以了,省得还要去找
###################

The old rule is to first import the modules needed for modeling in turn, and read the files.

2. Clean the data

Find missing values:
Insert picture description here
From the above results, there are no missing values ​​in the data.

3. Modeling

train=health.drop(["fetal_health"],axis=1)
X_train,X_test,y_train,y_test=train_test_split(train,health["fetal_health"],random_state=1)
###考虑到接下来可能需要进行其他的操作,所以定了一个随机种子,保证接下来的train和test是同一组数

The column index is divided into feature values ​​and predicted values, and the data is divided into training set and test set.

svm_linear=svm.SVC(C=10,kernel="linear",decision_function_shape="ovr")#参数部分会在下面进行讲解
svm_linear.fit(X_train,y_train)
print("SVM训练模型评分:"+str(accuracy_score(y_train,svm_linear.predict(X_train))))
print("SVM待测模型评分:"+str(accuracy_score(y_test,svm_linear.predict(X_test))))

The SVM algorithm is introduced, and the parameters in the algorithm are set up in sequence. After modeling, the accuracy of the test set is scored. The results obtained are as follows: It
Insert picture description here
can be seen that the accuracy of the model is about 89%.

4. Parameters

Here we only explain a few important parameters, for other parameters, friends can explore by themselves.

sklearn.svm.SVC(C,kernel,degree,gamma,coef0,shrinking,probability,tol,cache_size,class_weight,verbose,max_iter,decision_function_shape,random_state)

1. C : Penalty parameter, usually defaults to 1. The larger the C is, the less classification error is allowed, but the larger the C is, it may cause over-fitting and the generalization effect is too low. The smaller the C, the stronger the regularization, and the classification will not pay attention to whether the classification is correct. It only requires that the larger the interval, the better, and the classification is meaningless at this time. Therefore, this requires friends to make some adjustments.

2. Kernel (kernel function) : The introduction of kernel function is to solve the problem of linear inseparability. After the classification point is mapped to a high-dimensional space, it is transformed into a problem that can be linearly divided.

We use a table to illustrate several parameters of the kernel function:

Insert picture description here
The most commonly used kernel functions are Linear and RBF :

(1) Linear core : Mainly used for linearly separable cases, with few parameters and fast speed. For general data, the classification effect is already ideal.

(2) RBF core : It is mainly used in the case of inseparable linearity. Compared with other linear algorithms, this is also a very prominent advantage. Regardless of whether it is a small sample or a large sample, high-dimensional or low-dimensional, the RBF kernel function is applicable.

Next, use a classification boundary function to clearly show the difference between the two:

import numpy as np
from sklearn.datasets import make_moons
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use("fivethirtyeight")
np.random.seed(0)
X, y = make_moons(200, noise=0.20)
plt.scatter(X[:,0], X[:,1], s=40, c=y, cmap=plt.cm.Spectral)
plt.show()# 手动生成一个随机的平面点分布,并画出来
def plot_decision_boundary(pred_func):
	x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
    y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
    h = 0.01
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    Z = pred_func(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    plt.contourf(xx, yy, Z, cmap=plt.cm.Spectral)
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Spectral)

Manually set a random plane distribution point

from sklearn import svm
clf = svm.SVC(C=1,kernel="linear",decision_function_shape="ovo")
clf.fit(X, y)
plot_decision_boundary(lambda x: clf.predict(x))
plt.title("SVM-Linearly")
plt.show()

Let's first take a look at the effect of linear kernel classification on its classification:
Insert picture description here
Let's take a look at the rbf kernel for nonlinear classification:

clf = svm.SVC(C=1,kernel="rbf",decision_function_shape="ovo")
clf.fit(X, y)
plot_decision_boundary(lambda x: clf.predict(x))
plt.title("SVM-Nonlinearity")
plt.show()

The results are as follows:
Insert picture description here
The difference between the linear kernel and the rbf kernel can be easily seen from the above two figures. In the above two results, it is obvious that the model accuracy of the rbf kernel function is better. But in the process of running, it can be felt that the running speed of the rbf kernel function is slightly slower than that of the linear kernel function, and as the amount of data increases, the running time will always increase.

3. decision_function_shape : The original svm is only used for two classification problems. If it is extended to multi-classification problems, a certain fusion strategy must be adopted.'ovo' one-to-one is to divide between two, 'ovr' One-to-many is the division of one category from other categories.
Here we will focus on one-to-many : learn a two-category model for each category, and separate this category from other categories as much as possible, so that as many two-category models as the number of categories are generated. Run all binary classifiers on test points to make predictions. The classifier with the highest score in the corresponding category wins, and the category label is returned as the prediction result.

summary

SVM is a two-category model, and the processed data can be divided into three categories:

1. Linear separable, by maximizing the hard interval, learning a linear classifier, corresponding to a straight line on the plane
2. Approximately linear separable, by maximizing the soft interval, learning linear classifier
3. Linear inseparability, through the kernel function and soft interval Maximize, learn a nonlinear classifier, correspond to the curve on the plane

One point that svm is better than linear models is that it can handle non-linear classification data, even higher-dimensional data, such as this data classification:
Insert picture description here

There are many places that are not doing very well. Netizens are welcome to make suggestions, and I hope to meet some friends to discuss together.

Guess you like

Origin blog.csdn.net/weixin_43580339/article/details/115350097