关于SVM中线性核函数和高斯核函数的选择

关于SVM中线性核函数和高斯核函数的选择

SVM中常用核函数一般是线性核函数和高斯核函数。以sklearn中的SVC，提供的’linear’和’rbf’做说明。面向[n,m]原始数据集，一般的选取准则：

相对于n，m很大。比如m≥n, m=10000, n=10~1000,即(m/n)>10。
考虑’linear’
m很小，n一般大小。比如m=1-1000, n=10~10000,即(m/n)在[0.0001,100].
考虑’rbf’
m很小，n很大。比如n=1-1000，m=50000+，即(m/n)在[~,0.02].
增加m的量，考虑’linear’

补充：logistic约等同于’linear’的选择

#!/usr/bin/python
# encoding: utf-8


"""
@author : jack_lu
@contact : my@google
@File : SVM
@time : 2018/12/12 12:12
"""

# 练习所用数据集
from sklearn.datasets import fetch_lfw_people,olivetti_faces

# 特征提取方法
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
# 特征转换方法
from sklearn.preprocessing import StandardScaler

# sklearn模型方法
from sklearn.model_selection import train_test_split

# metric方法
from sklearn.metrics import accuracy_score

# 机器学习模型
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC

1.基本数据准备

print('#'*50 + '  1.基本数据准备  ' + '#'*50)
lfw_people2 = fetch_lfw_people(min_faces_per_person=70, resize=0.4)  # 需要通过翻墙下载，C:\Users\Administrator\scikit_learn_data\lfw_home\joblib\sklearn\datasets\lfw\_fetch_lfw_people

##################################################  1.基本数据准备  ##################################################

n_samples, h, w = lfw_people2.images.shape
X = lfw_people2.data
y = lfw_people2.target
n_features = X.shape[1]

target_names = lfw_people2.target_names
n_class = target_names.shape[0]

print('#'*20,'数据集基本情况','#'*20)
print('**样本数量**：%d' %(X.shape[0]))
print('**特征维度**：%d' %(X.shape[1]))
print('**目标类别数**：%d' %(n_class))
print('#'*20,'数据集基本情况','#'*20)

#################### 数据集基本情况 ####################
**样本数量**：1288
**特征维度**：1850
**目标类别数**：7
#################### 数据集基本情况 ####################

X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.25,random_state=1)

print('#'*20,'训练数据集基本情况','#'*20)
print('**训练样本数量**：%d' %(X_train.shape[0]))
print('**训练特征维度**：%d' %(X_train.shape[1]))
print('**目标类别数**：%d' %(n_class))
print('#'*20,'训练数据集基本情况','#'*20)

#################### 训练数据集基本情况 ####################
**训练样本数量**：966
**训练特征维度**：1850
**目标类别数**：7
#################### 训练数据集基本情况 ####################

2.各情况对比

print('#'*50 + '  2.建模与比较  ' + '#'*50)

##################################################  2.建模与比较  ##################################################

1. SVM(kernel=‘linear’)：直接采用数据集[966,1850]

svm_origin = SVC(kernel='linear', C=1000, decision_function_shape='ovo')  # 根据官方说明，对于多分类任务宜采用'ovo'即onevsone策略
svm_origin.fit(X_train, y_train)
y_pred = svm_origin.predict(X_test)
print('**情况1-linear的准确率**: %s' %(accuracy_score(y_pred=y_pred, y_true=y_test)))

**情况1-linear的准确率**: 0.832298136646

2. SVM(kernel=‘rbf’)：直接采用数据集[966,1850]

svm_rbf = SVC(kernel='rbf', C=1000, decision_function_shape='ovo')  # 根据官方说明，对于多分类任务宜采用'ovo'即onevsone策略
svm_rbf.fit(X_train, y_train)
y_pred = svm_rbf.predict(X_test)
print('**情况2-rbf的准确率**: %s' %(accuracy_score(y_pred=y_pred, y_true=y_test)))

**情况2-rbf的准确率**: 0.44099378882

3. LR：直接采用数据集[966,1850]

lr_origin = LogisticRegression()  # 对于多分类任务，multi_class可选择'ovr'或者'auto'自动选择，这里按照默认'auto'
lr_origin.fit(X_train, y_train)
y_pred = lr_origin.predict(X_test)
print('**情况3-LR的准确率**: %s' %(accuracy_score(y_pred=y_pred, y_true=y_test)))

**情况3-LR的准确率**: 0.826086956522

4. 降维之后

print('#'*20,'维度由1850减少到150之后','#'*20)

#################### 维度由1850减少到150之后 ####################

def namestr(obj, namespace):
    return [name for name in namespace if namespace[name] is obj]
print(namestr(lr_origin,globals()),'\n',
namestr(lr_origin,globals())[0])

['lr_origin', 'model'] 
 lr_origin

def small_feature_model(model,X_train=X_train,y_train=y_train,X_test=X_test, y_test=y_test):
    pca = PCA(n_components=150,random_state=0,whiten=True)
    pipeline = Pipeline([('scale',StandardScaler()),('pca',pca)])
    processing = pipeline.fit(X_train)
    X_train = processing.transform(X_train)
    X_test = processing.transform(X_test)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
#     print(namestr(model,globals()))
    print('**small-%s的准确率**: %.3f' %(namestr(model, globals())[0],accuracy_score(y_pred=y_pred, y_true=y_test)))

for model in [svm_origin, svm_rbf, lr_origin]:
    small_feature_model(model)

**small-svm_origin的准确率**: 0.789
**small-svm_rbf的准确率**: 0.811
**small-lr_origin的准确率**: 0.835

print('#'*50 + '  完成  ' + '#'*50)

##################################################  完成  ##################################################

3.小结

从结果看到：

将维度减少到150之后，选择kernel='rbf’的效果>‘linear’;
在没有调参情况下，LR的效果还不错,初期建模值得先尝试。

当然，上面是指定了特定的参数，更主要的目的是对比SVM两种核方法在n和m的比例当中的效果。在具体问题当中，在计算力有限条件下，建议还是通过网格搜索的方法对比选出最优的kernel。

SVM的核函数之线性和高斯的选择

Table of Contents