安装

在官网或github上下载libsvm，解压
在主目录下make得到3个可执行文件，svm-train，svm-predict，svm-scale，在tools子目录下还有3个3个工具grid.py，subset.py，checkdata.py

使用libsvm的一般步骤

1 ）按照 LIBSVM 软件包所要求的格式准备数据集
2 ）对数据进行简单的缩放操作；
3 ）首要考虑选用 RBF 核函数；
4 ）采用交叉验证选择最佳参数 C 与 g ；
5 ）采用最佳参数 C 与 g 对整个训练集进行训练获取支持向量机模型；
6 ）利用获取的模型进行测试与预测。

结果示例

> ./svm-train -c 5 heart.scale heart.scale.model
optimization finished, #iter = 433
nu = 0.340308
obj = -385.016663, rho = 0.669878
nSV = 121, nBSV = 68
Total nSV = 121 
// #iter 为迭代次数，
// nu是核函数类型的参数，
// obj为SVM文件转换为的二次规划求解得到的最小值
// rho 为判决函数的偏置项 b
// nSV  为标准支持向量个数 (0<a[i]<c) 
// nBSV 为边界上的支持向量个数 (a[i]=c) 
// Total nSV 为支持向量总个数

训练后的模型保存为文件 heart_scale.model，其内容如下：

svm_type c_svc      % 训练所采用的 svm 类型
kernel_type rbf     % 训练采用的核函数类型
gamma 0.0769231     % 设置核函数中的gamma，默认值为 1/ k
nr_class 2          % 分类的类别数
total_sv 132        % 总共的支持向量个数
rho 0.424462        % 决策函数中的常数项
label 1 -1          % 类别标签
nr_sv 64 68         % 各类别标签对应的支持向量个数
SV                  % 以下为支持向量
1 1:0.166667 2:1 3:-0.333333 4:-0.433962 5:-0.383562 6:-1 7:-1 8:0.0687023 9:-1 10:-0.903226 11:-1 12:-1 13:1
0.5104832128985164 1:0.125 2:1 3:0.333333 4:-0.320755 5:-0.406393 6:1 7:1 8:0.0839695 9:1 10:-0.806452 12:-0.333333 13:0.5
...

数据格式和数据归一化

数据格式

数据的每一行是这样的：<lab el> <index1>:<value1> <index2>:<value2> ...
1. 每一行行末有个换行符，即使是最后一行。
2. 关于label标签。对于分类SVM，label必须是整数；对于回归SVM，label可以是实数；对于单分类SVM，label是不会用的，可以使任何数，但是在测试时，如果异常值已知，label必须是+1/-1，用于评估。
3. index从1开始

数据归一化使用`svm-scale`：

Usage: ./svm-scale [options] data_filename
options:
-l lower : x scaling lower limit (default -1)                   下界
-u upper : x scaling upper limit (default +1)                   上界
-y y_lower y_upper : y scaling limits (default: no y scaling)
-s save_filename : save scaling parameters to save_filename
-r restore_filename : restore scaling parameters from restore_filename

> ./svm-scale -l -1 -u 1 -s range train > train.scale
> ./svm-scale -r range test > test.scale
第一行归一化数据到-1~1，保存scale参数到range文件，数据重定向到train.scale
第二行使用scale参数文件range保存的参数归一化test数据，重定向到test.scale

数据格式检查使用`checkdata.py`：

python checkdata.py filename

切取数据子集用`subset.py`：

当数据集较大时，可先使用子集做下试验。

python subset.py [-s 0 | 1] dataset subset_size [subset] [the rest]

python subset.py -s 0 train 1000 subset rest
     0 -- stratified selection (classification only)
     1 -- random selection
使用方法0，把train文件选取1000个样本作为子集输出到subset文件，剩下的输入到rest文件

训练`svm-train`

Usage: svm-train [options] training_set_file [model_file]
options:
-s svm_type : set type of SVM (default 0)  svm的类型
    0 -- C-SVC      (multi-class classification)
    1 -- nu-SVC     (multi-class classification)    C的范围是0-1
    2 -- one-class SVM                  用于支持向量的密度估计和聚类
    3 -- epsilon-SVR    (regression)            不敏感损失函数，对样本点来说，存在着一个不为目标函数提供任何损失值的区域。
    4 -- nu-SVR     (regression)            由于EPSILON_SVR需要事先确定参数，然而在某些情况下选择合适的参数却不是一件容易的事情。而NU_SVR能够自动计算参数。
-t kernel_type : set type of kernel function (default 2)
    0 -- linear: u'*v                       线性核
    1 -- polynomial: (gamma*u'*v + coef0)^degree            多项式核
    2 -- radial basis function: exp(-gamma*|u-v|^2)         高斯核
    3 -- sigmoid: tanh(gamma*u'*v + coef0)              sigmoid核
    4 -- precomputed kernel (kernel values in training_set_file)    自定义核
-d degree : set degree in kernel function (default 3)                   for POLY
-g gamma : set gamma in kernel function (default 1/num_features)            for POLY/RBF/SIGMOID
-r coef0 : set coef0 in kernel function (default 0)                 for POLY/SIGMOID
-c cost : set the parameter C of C-SVC, epsilon-SVR, and nu-SVR (default 1)     for C_SVC/ E_SVR/ NU_SVR
-n nu : set the parameter nu of nu-SVC, one-class SVM, and nu-SVR (default 0.5)     for NU_SVC/ ONE_CLASS/ NU_SVR
-p epsilon : set the epsilon in loss function of epsilon-SVR (default 0.1)      for E_SVR
-m cachesize : set cache memory size in MB (default 100)                缓存大小
-e epsilon : set tolerance of termination criterion (default 0.001)         终止迭代条件
-h shrinking : whether to use the shrinking heuristics, 0 or 1 (default 1)      是否使用h, 某些条件下加速训练
-b probability_estimates : whether to train a SVC or SVR model for probability estimates, 0 or 1 (default 0)
-wi weight : set the parameter C of class i to weight*C, for C-SVC (default 1)      C_SVC的权重，会影响惩罚系数C
-v n: n-fold cross validation mode      交叉验证的折数
-q : quiet mode (no outputs)            后台模式

examples：

> svm-train -s 0 -c 5 -t 2 -g 0.5 -e 0.1 data_file
Train a classifier with RBF kernel exp(-0.5|u-v|^2), C=10, and
stopping tolerance 0.1.

> svm-train -s 3 -p 0.1 -t 0 data_file
Solve SVM regression with linear kernel u'v and epsilon=0.1
in the loss function.

> svm-train -s 0 -c 100 -g 0.1 -v 5 data_file
Do five-fold cross validation for the classifier using
the parameters C = 100 and gamma = 0.1

预测`svm-predict`

Usage: svm-predict [options] test_file model_file output_file
options:
-b probability_estimates: 
    whether to predict probability estimates, 0 or 1(default 0);
    for one-class SVM only 0 is supported

examples：

> svm-train -s 0 -b 1 data_file
> svm-predict -b 1 test_file data_file.model output_file
-b参数：获得具有概率信息的模型并使用概率估计来预测测试数据.

一些建议

对数据进行简单的缩放，[0, 1] or [-1, +1]
对于C_SVC，考虑使用tools下的grid.py进行交叉验证选择最佳gamma和C
nu in nu-SVC/one-class-SVM/nu-SVR approximates the fraction of training
errors and support vectors.
如果分类的数据不平衡，尝试使用-wi参数给惩罚因子施加不同的权重
对较大的问题指定较大的缓存

SVM模型类别`-s`

源码定义：enum { C_SVC, NU_SVC, ONE_CLASS, EPSILON_SVR, NU_SVR };
首先，前两个用于分类，后两个用于回归，中间那个是无监督。
C_SVC和NU_SVC使用的模型相同，都是分类SVM，区别在于惩罚系数C的范围，前者是1到正无穷，后者是0到1。

C_SVC

C的产生是在引入松弛变量 $\xi$ 时目标函数引入了C：

m i n \frac{1}{2} | | ω | |^{2} + C \sum_{i = 1}^{l} ξ_{i}

$min \frac{1}{2}||\omega||^2+C\sum_{i=1}^{l}\xi_i$
C表征你有多么重视离群点，C越大越重视，越不想丢掉它们。(C越大越严格)
有些样本丢了也就丢了，错了也就错了，这些就给一个比较小的C；而有些样本很重要，决不能分类错误，就给一个很大的C。这是参数-wi实现的。
- 选择该模型需要设置参数-s为0(default)，
- 还需要设置-c惩罚因子
- 可选-wi给类别的惩罚因子设置权重，可用于解决数据不均衡问题（给类别少的赋更多权重）

> ./svm-train -s 0 -c 10 data_file data_file.csvc
> ./svm-train -c 10 -w1 1 -w-2 5 -w4 2 data_file data_file.csvc
惩罚系数权重，对类别1是1，对类别-2是5，对类别4是2,C=10,所以对类别1,-2,4的惩罚系数分别是10,50,20.

NU_SVC

C值本身没有确切的意义，所以很难选取，提出NU_SVC.
nu的含义：
1. 间隔错误样本的个数所占总样本点数的份额的上界
2. 支持向量的个数所占总样本点数的份额的下界

需要设置-s为1
需要设置nu,参数为-n(default=0.5)。nu参数只有nu-SVC, one-class SVM, and nu-SVR有。

> ./svm-train -s 1 -n 0.7 data_file data_file.nusvc
选取-s 1，设置nu为0.7，模型保存维data_file.nusvc

ONE_CLASS

不需要类标号,用于支持向量的密度估计和聚类.

EPSILON_SVR

不敏感损失函数，对样本点来说，存在着一个不为目标函数提供任何损失值的区域

NU_SVR

由于EPSILON_SVR需要事先确定参数，然而在某些情况下选择合适的参数却不是一件容易的事情。而NU_SVR能够自动计算参数。

SVM核函数选择`-t`

源码定义：enum { LINEAR, POLY, RBF, SIGMOID, PRECOMPUTED };
最后一个是自定义核函数。
默认-t = 2即RBF高斯核函数
一般建议也是用高斯核

LINEAR`-t 0`

线性核函数，没有映射到高维空间，也就是线性可分的SVM。

K (x_{i}, x_{j}) = x_{i}^{T} x_{j}

$K(x_i, x_j) = x_i^T x_j$
没有特别需要设置的参数
./svm-train -s 0 -t 0 -c 4 data data.model

POLY`-t 1`

多项式核函数。

K (x_{i}, x_{j}) = (γ x_{i}^{T} x_{j} + c o e f 0)^{d e g r e e}

$K(x_i, x_j) = (\gamma x_i^T x_j + coef0)^{degree}$
还需要设置的参数有：

-d degree：多项式的系数，default = 3
-g gamma：多项式的gamma，default = 1 / num_features
-r coef0：多项式的常数项，default = 0

./svm-train -s 0 -c 5 -t 1 -d 2 -g 0.5 -r 0.01 data data.model
./svm-train -s 0 -c 5 -t 1 -d 2 data data.model

RBF`-t 2`

高斯核函数

K (x_{i}, x_{j}) = e x p (γ | | x_{i} - x_{j} | |^{2})

$K(x_i, x_j) = exp(\gamma ||x_i-x_j||^2)$
还需要设置
-g gamma，default = 3

./svm-train -s 0 -t 2 -g 0.5 data data.model
./svm-train data data.model

SIGMOID`-t 3`

K (x_{i}, x_{j}) = \tanh (γ x_{i}^{T} x_{j} + c o e f 0)

$K(x_i, x_j) = \tanh(\gamma x_i^T x_j + coef0)$
需要设置
-g gamma
-r coef0 常数项

./svm-train -s 1 -n 10 -t 3 -g 2 -r 0.04 data data.model
./svm-train -t 3 data data.model

调参

对于核函数是RBF的c_SVC模型，可以使用tools/grid.py自动选择最佳参数。该工具使用交叉验证评估每个参数组合的正确率，也叫GridSearch，实质是暴力搜索。

python grid.py [grid_options] [svm_options] dataset

grid_options :
-log2c {begin,end,step | "null"} : set the range of c (default -5,15,2)
    begin,end,step -- c_range = 2^{begin,...,begin+k*step,...,end}
    "null"         -- do not grid with c
-log2g {begin,end,step | "null"} : set the range of g (default 3,-15,-2)
    begin,end,step -- g_range = 2^{begin,...,begin+k*step,...,end}
    "null"         -- do not grid with g
-v n : n-fold cross validation (default 5)交叉验证的n-fold
-svmtrain pathname : 指定svm-train的路径
-gnuplot {pathname | "null"} :指定gnuplot的路径，"null"表示不画图
-out {pathname | "null"} : (default dataset.out)：设置输出的路径，"null"不输出
-resume [pathname] : 使用存在的output文件恢复GridSearch(default pathname is dataset.out)，只有当参数一致才能使用这个。

svm_options : additional options for svm-train  设置svm-train的参数

python grid.py -svmtrain ./svm-train -log2c -5,5,1 -log2g -4,0,1 -v 5 -m 30000 -out grid.out train.scale

使用LIBSVM

安装

使用libsvm的一般步骤

结果示例

数据格式和数据归一化

数据格式

数据归一化使用`svm-scale`：

数据格式检查使用`checkdata.py`：

切取数据子集用`subset.py`：

训练`svm-train`

预测`svm-predict`

一些建议

SVM模型类别`-s`

C_SVC

NU_SVC

ONE_CLASS

EPSILON_SVR

NU_SVR

SVM核函数选择`-t`

LINEAR`-t 0`

POLY`-t 1`

RBF`-t 2`

SIGMOID`-t 3`

调参

猜你喜欢

使用LIBSVM

安装

使用libsvm的一般步骤

结果示例

数据格式和数据归一化

数据格式

数据归一化使用svm-scale：

数据格式检查使用checkdata.py：

切取数据子集用subset.py：

训练svm-train

预测svm-predict

一些建议

SVM模型类别-s

C_SVC

NU_SVC

ONE_CLASS

EPSILON_SVR

NU_SVR

SVM核函数选择-t

LINEAR-t 0

POLY-t 1

RBF-t 2

SIGMOID-t 3

调参

猜你喜欢

数据归一化使用`svm-scale`：

数据格式检查使用`checkdata.py`：

切取数据子集用`subset.py`：

训练`svm-train`

预测`svm-predict`

SVM模型类别`-s`

SVM核函数选择`-t`

LINEAR`-t 0`

POLY`-t 1`

RBF`-t 2`

SIGMOID`-t 3`