[四]机器学习之支持向量机SVM

4.1 实验数据

本数据集来源于UCI的Adult数据集，并对其进行处理得到的。数据集下载地址：http://archive.ics.uci.edu/ml/datasets/Adult。本实验使用LIBSVM包对该数据进行分类。

原始数据集每条数据有14个特征，分别为age,workclass,fnlwgt(final weight),education,education-num,marital-status,occupation,relationship,race,sex,captital-gain,captital-loss,hours-per-week和native-country。其中有6个特征是连续值，包括age,fnlwgt.education-num,captital-gain,captital-loss,hours-per-week;其它8个特征是离散的。本数据首先要做的处理是：将连续特征离散化，将有M个类别的离散特征转换为M个二进制特征。

本数据集共有48842条数据，每条数据从原始特征的14个转换成123个，并以2：1的比例分为训练集和测试集，其中a9a为训练集，用来训练分类器模型；a9a-t是测试集，用来预测模型的分类效果。它共有两个类别，标签分别用-1和1表示，标签的含义是一个人一年的薪资是否超过50K，1表示超过50K，-1表示不超过50K。

变换后的数据下载地址：https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#a9a

每个特征转换方式如下：

（1）age：连续值，拓展为5位，即第1-5维，采用one-hot方式，划分标准如下

1.age<=25,第1维为1；

2.26<=age<=32,第2维为1；

3.33<=age<=40,第3维为1；

4.41<=age<=49,第4维为1；

5.age>=50,第5维为1；

（2）workclass：离散值，取值为Private,Self-emp-not-inc,Self-emp-inc,Federal-gov,Local-gov,State-gov,Without-pay,Never-worked,共8个取值，扩展为8位，即6-13维

（3）fnlwgt：连续值，扩展为5位，即14-18维，划分标准如下

1.fnlwgt<=110000,第14维为1；

2.110000<=fnlwgt<=159999,第15维为1；

2.160000<=fnlwgt<=196335,第16维为1；

2.196336<=fnlwgt<=259865,第17维为1；

2.fnlwgt>=259866,第18维为1；

（4）education：离散值，取值有：Bachelors，Some-college，11th，HS-grad，Prof-school，Assoc-acdm，Assoc-voc，9th，7th-8th，12th，Masters，1st-4th，10th，5-6th，Preschool共16个，扩展为16位，即19-34维。

（5）education-num：连续值，扩展为5位，即35-39维，划分标准如下

1.11th，9th，7-8th，12th，1st-4th，10th，5th-6th，Preschool：第35维为1；

2.HS-grad：第36维为1；

3.Some-college：第37维为1；

4.Assoc-acdm，Assoc-voc：第38维为1；

5.Bachelors，Prof-school，Masters，Doctorate：第39维为1。

（6）marital-status：离散值，取值有：Married-civ-spouse，Divorced，Never-married，Separated，Wideowed，Married-spouse-absent，Married-AF-spouse，扩展为7位，即40-46维。

（7）occupation：离散值，取值有：Tech-support，Craft-repair，Other-service，Sales，Exec-managerial，Prof-specialty，Handlers-cleaners，Machine-op-inspct，Adm-clerical，Farming-fishing，Transport-moving，Priv-house-serv，Protective-serv，Armed-Forces共14个，扩展为14位，即47-60维。

（8）relationship：离散值，取值为Wife，Own-Child，Husband，Not-in-family，Other-relative，Unmarrie共6个，扩展为6位，即61-66维。

（9）race：离散值，取值有：White，Asian-Pac-Islander，Amer-Indian-Eskimo，Other，Black共5个，扩展为5位，即67-71维。

（10）sex：离散值，取值有Female，Male共2个，扩展为2位，即72-73维。

（11）captital-gain：连续值，扩展为2位，即74-75维，划分标准如下

1.captital-gain=0：第74维为1；

2.captital-gain≠0：第75维为1.

（12）captital-loss：连续值，扩展为两位，即76-77维，划分标准如下

1.captital-loss=0：第76维为1；

2.captital-loss≠0：第77维为1

（13）hours-per-week：连续值，扩展为5位，即78-82维，划分标准如下

1.hours-per-week<=34：第78维为1；

2.35<=hours-per-week<=39：第79维为1；

3.hours-per-week=40：第80维为1；

4.41<=hours-per-week<=47：第81维为1；

5.hours-per-week>=48：第82维为1；

（14）native-country：离散值，取值有：United-States，Cambodia，England，Puerto-Rico，Canada，Germany，Outlying-US(Guam-USVI-etc)，India，Japan，Greece，South，China，Cuba，Iran，Honduras，Philippines，Italy，Poland，Jamaica，Vietnam，Mexico，Portugal，Ireland，France，Dominican-Republic，Laos，Ecuador，Taiwan，Haiti，Columbia，Hungary，Guatemala，Nicaragua，Scotland，Thailand，Yugoslavia，EI-Salvador，Trinidad&Tobago，Peru，Hong，Holand-Netherlands共41个，扩展为41位，即83-123维。

4.2 LIBSVM简介

LIBSVM是台湾大学林智仁教授等开发的一个简单、易于使用和快速有效的SVM模式识别与回归软件包，它是一个开源库，能够对SVM模型进行训练，给出预测，并利用数据集对预测结果进行测试。LIBSVM还提供了针对径向基函数和许多其他类型的核函数的支持。

LIBSVM下载地址：https://www.csie.ntu.edu.tw/~cjlin/libsvm/

4.3 LIBSVM调用过程

（1）确定本机python版本：

（2）下载zip或tar.gz压缩文件

（3）解压文件

（4）调用LIBSVM来训练分类器模型

import os
os.chdir('C:\Users\Administrator\Desktop\机器学习\libsvm-3.23\python')
from svmutil import *
y,x = svm_read_problem('a9a')
m = svm_train(y,x,'-c 5')

参数选项：

options:
-s svm_type : set type of SVM (default 0)
	0 -- C-SVC
	1 -- nu-SVC
	2 -- one-class SVM
	3 -- epsilon-SVR
	4 -- nu-SVR
-t kernel_type : set type of kernel function (default 2)
	0 -- linear: u'*v
	1 -- polynomial: (gamma*u'*v + coef0)^degree
	2 -- radial basis function: exp(-gamma*|u-v|^2)
	3 -- sigmoid: tanh(gamma*u'*v + coef0)
-d degree : set degree in kernel function (default 3)
-g gamma : set gamma in kernel function (default 1/num_features)
-r coef0 : set coef0 in kernel function (default 0)
-c cost : set the parameter C of C-SVC, epsilon-SVR, and nu-SVR (default 1)
-n nu : set the parameter nu of nu-SVC, one-class SVM, and nu-SVR (default 0.5)
-p epsilon : set the epsilon in loss function of epsilon-SVR (default 0.1)
-m cachesize : set cache memory size in MB (default 100)
-e epsilon : set tolerance of termination criterion (default 0.001)
-h shrinking: whether to use the shrinking heuristics, 0 or 1 (default 1)
-b probability_estimates: whether to train a SVC or SVR model for probability estimates, 0 or 1 (default 0)
-wi weight: set the parameter C of class i to weight*C, for C-SVC (default 1)

The k in the -g option means the number of attributes in the input data.

调用LIBSVM来测试分类器模型的好坏：

test_y,test_x = svm_read_problem('a9a.t')
p_label,p_acc,p_val = svm_predict(test_y,test_x,m)

4.4 实验效果分析

从上面结果可知，利用LIBSVM和a9a的训练集训练得到的分类器模型在a9a测试集上的分类准确率约为84.97%。

[四]机器学习之支持向量机SVM

猜你喜欢