Step 29 Machine Learning Classification in Action: Support Vector Machine (SVM) Modeling


foreword

Support vector machine (SVM) modeling.


1. Data preprocessing

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('X disease code fs.csv')
X = dataset.iloc[:, 1:14].values
Y = dataset.iloc[:, 0].values

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.30, random_state = 666)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

2. The parameter adjustment strategy of SVM

First review the parameters (portal), the parameters that need to be adjusted are:
① kernel: {'linear', 'poly', 'rbf', 'sigmoid', 'precomputed'}, the default is 'rbf'. The kernel function used must be one of "linear", "poly", "rbf", "sigmoid", "precomputed" or "callable".
② c: floating point number, the default is 1.0. Regularization parameters. The strength of regularization is inversely proportional to C. Must be strictly positive. Generally, it can be selected as: 10^t, t=[- 4, 4] is 0.0001 to 10000.
③ gamma: Only when the kernel coefficient is 'rbf', 'poly' and 'sigmoid' can it be set. You can choose the reciprocal of the following numbers: 0.1, 0.2, 0.4, 0.6, 0.8, 1.6, 3.2, 6.4, 12.8.
④ degree: integer, default 3. The degree of the polynomial kernel function (' poly '). Will be ignored by other kernels. .

3. SVM tuning demonstration

(A) Let the default parameters go first:

from sklearn.svm import SVC
classifier = SVC(random_state = 0, probability=True)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
y_testprba = classifier.predict_proba(X_test)[:,1] 
y_trainpred = classifier.predict(X_train)
y_trainprba = classifier.predict_proba(X_train)[:,1]
from sklearn.metrics import confusion_matrix
cm_test = confusion_matrix(y_test, y_pred)
cm_train = confusion_matrix(y_train, y_trainpred)
print(cm_train)
print(cm_test)

As the saying goes, a lean camel is bigger than a horse, and SVM is powerful:
insert image description here
insert image description here
adjust the parameters to see if you can counterattack those integrated models, and we adjust them separately according to the kernel:


(B) kernel='rbf', only need to adjust gamma:

from sklearn.svm import SVC
param_grid=[{
    
    
            'gamma':[10, 5, 2.5, 1.5, 1.25, 0.625, 0.3125, 0.15, 0.05, 0.025, 0.0125],              
            },
           ]
boost = SVC(kernel='rbf', random_state = 0, probability=True)
from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(boost, param_grid, n_jobs = -1, verbose = 1, cv=10)      
grid_search.fit(X_train, y_train)    
classifier = grid_search.best_estimator_  
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
y_testprba = classifier.predict_proba(X_test)[:,1] 
y_trainpred = classifier.predict(X_train)
y_trainprba = classifier.predict_proba(X_train)[:,1]
from sklearn.metrics import confusion_matrix
cm_test = confusion_matrix(y_test, y_pred)
cm_train = confusion_matrix(y_train, y_trainpred)
print(cm_train)
print(cm_test)

Look at the results, not bad:
insert image description hereinsert image description here


(C) kernel='linear', no parameters:

from sklearn.svm import SVC
classifier = SVC(kernel='linear', random_state = 0, probability=True)

Looking at the results, it's not bad:
insert image description hereinsert image description here


(D) kernel='sigmoid', adjust parameters C (capital C) and gamma:

from sklearn.svm import SVC
param_grid=[{
    
    
            'gamma':[1,2,3,4,5,6,7,8,9,10],   
            'C':[0.0001,0.001,0.01,0.1,1,10,100,1000,10000],  
            },
           ]
boost = SVC(kernel='sigmoid', random_state = 0, probability=True)

Results, slightly worse:
insert image description here
insert image description here
Optimal model:
SVC(C=0.01, gamma=4, kernel='sigmoid', probability=True, random_state=0)


(E) kernel='poly', adjust parameters C (capital C), gamma and d:
say one more thing, how to see what parameters each core has:

from sklearn.svm import SVC
classifier = SVC(kernel='poly', random_state = 0, probability=True)
classifier.get_params().keys()

Then output all parameters corresponding to kernel='poly':

dict_keys(['C', 'break_ties', 'cache_size', 'class_weight', 'coef0', 'decision_function_shape', 'degree', 'gamma', 'kernel', 'max_iter', 'probability', 'random_state', 'shrinking', 'tol', 'verbose'])

(a) Adjust the parameters C (uppercase C) and gamma first:

from sklearn.svm import SVC
param_grid=[{
    
    
            'gamma':[1,2,3,4,5,6,7,8,9,10],   
            'C':[0.0001,0.001,0.01,0.1,1,10,100,1000,10000], 
            },
           ]
boost = SVC(kernel='poly', random_state = 0, probability=True)

Look at the results:
insert image description hereinsert image description here
(b) Adjust the degree again:

from sklearn.svm import SVC
param_grid=[{
    
    
            "degree":[1,2,3,4,5,6,7,8,9,10], 
            },
           ]
boost = SVC(C=0.0001, gamma=2, kernel='poly', probability=True, random_state=0)

In the end, the degree is still 3.

Summarize

On the whole, the performance of the rbf core and the liner core may be better. SVM is still powerful, at least in this data set, it is easy and fast to adjust parameters, and the performance is also in place.

Guess you like

Origin blog.csdn.net/qq_30452897/article/details/130403337