Machine Learning - support vector machine (SVM)

SVM principle

SVM confidential problem is actually finding the optimal classification boundary. And maximize the spacing support vector by a straight line or plane, the partition partitioning hyperplane.

Based liter dimensional transform kernel function

Characterized by a kernel function called conversion, adding new features, such that the linear low-dimensional space can not be separated to a high dimensional space the problem of linearly separable problem.

 

Linear kernel : Linear , not by nuclear dimension to enhance the function, only seek linear classification boundary in the original dimensions of space.

SVM classification based on linear kernel related API:

import sklearn.svm as svm
model = svm.SVC(kernel='linear')
model.fit(train_x, train_y)

Case: The multiple2.txt data classification.

import numpy as np
import sklearn.model_selection as ms
import sklearn.svm as svm
import sklearn.metrics as sm
import matplotlib.pyplot as mp
x, y = [], []
data = np.loadtxt('../data/multiple2.txt', delimiter=',', dtype='f8')
x = data[:, :-1]
y = data[:, -1]
train_x, test_x, train_y, test_y = \
    ms.train_test_split(x, y, test_size=0.25, random_state=5)
 # Based on the linear kernel SVM classifier 
Model = svm.SVC (Kernel = ' Linear ' ) 
model.fit (train_x, train_y) 
n- = 500 
L, R & lt = X [0 :,] .min () - . 1, X [0 :,] .max () +. 1 
B, T = X [:,. 1] .min () -. 1, X [:,. 1] .max () +. 1 
grid_x = np.meshgrid (NP .linspace (L, R & lt, n-), 
                     np.linspace (B, T, n-)) 
flat_x = np.column_stack ((grid_x [0] .ravel (), grid_x [. 1 ] .ravel ()))     
flat_y = Model .predict (flat_x) 
grid_y = flat_y.reshape (grid_x [0] .shape) 
pred_test_y = model.predict (test_x) 
Cr= sm.classification_report(test_y, pred_test_y)
print(cr)
mp.figure('SVM Linear Classification', facecolor='lightgray')
mp.title('SVM Linear Classification', fontsize=20)
mp.xlabel('x', fontsize=14)
mp.ylabel('y', fontsize=14)
mp.tick_params(labelsize=10)
mp.pcolormesh(grid_x[0], grid_x[1], grid_y, cmap='gray')
mp.scatter(test_x[:, 0], test_x[:, 1], c=test_y, cmap='brg', s=80)
mp.show()

Polynomial kernel: poly , characterized by increasing the original sample power high power polynomial function

$$y = x_1+x_2 \\
y = x_1^2 + 2x_1x_2 + x_2^2 \\
y = x_1^3 + 3x_1^2x_2 + 3x_1x_2^2 + x_2^3$$

Case, the sample data based on kernel function training sample2.txt polynomial.

# Based on the linear kernel SVM classifier 
Model = svm.SVC (Kernel = ' poly ' , Degree =. 3 ) 
model.fit (train_x, train_y)

: Kernel function rbf probability distribution, increasing the original samples by a Gaussian distribution function wherein

Case, the sample data based on the kernel function of the training sample2.txt.

# Support Vector Machine classifier radial basis kernel function 
# C: Regular Strength 
# Gamma: standard deviation of the normal distribution curve 
Model = svm.SVC (Kernel = ' RBF ' , C = 600, Gamma = 0.01 ) 
Model. fit (train_x, train_y)

Sample category equalization

By category weight equalization, the smaller the proportion of samples with higher weights, and a larger proportion of sample weights lower, this average contribution of different types of sample classification model to improve model performance.

Sample category equalization related API:

model = svm.SVC(kernel='linear', class_weight='balanced')
model.fit(train_x, train_y)

Case: Case modify SVM linear kernel function, based on a sample category equalization read imbalance.txt training model.

... ...
... ...
data = np.loadtxt('../data/imbalance.txt', delimiter=',', dtype='f8')
x = data[:, :-1]
y = data[:, -1]
train_x, test_x, train_y, test_y = \
    ms.train_test_split(x, y, test_size=0.25, random_state=5)
# 基于线性核函数的支持向量机分类器
model = svm.SVC(kernel='linear', class_weight='balanced')
model.fit(train_x, train_y)
... ...
... ...

置信概率

根据样本与分类边界的距离远近,对其预测类别的可信程度进行量化,离边界越近的样本,置信概率越低,反之,离边界越远的样本,置信概率高。

获取每个样本的置信概率相关API:

# 在获取模型时,给出超参数probability=True
model = svm.SVC(kernel='rbf', C=600, gamma=0.01, probability=True)
预测结果 = model.predict(输入样本矩阵)
# 调用model.predict_proba(样本矩阵)可以获取每个样本的置信概率矩阵
置信概率矩阵 = model.predict_proba(输入样本矩阵)

置信概率矩阵格式如下:

  类别1 类别2
样本1 0.8 0.2
样本2 0.9 0.1
样本3 0.5 0.5

案例:修改基于径向基核函数的SVM案例,新增测试样本,输出每个测试样本的执行概率,并给出标注。

# 新增样本
prob_x = np.array([[2, 1.5], [8, 9], [4.8, 5.2], [4, 4], [2.5, 7], [7.6, 2], [5.4, 5.9]])
pred_prob_y = model.predict(prob_x)
probs = model.predict_proba(prob_x)
print(probs)
# [[3.00000090e-14 1.00000000e+00]
#  [3.00000090e-14 1.00000000e+00]
#  [9.73038186e-01 2.69618143e-02]
#  [5.65786038e-01 4.34213962e-01]
#  [2.77725531e-03 9.97222745e-01]
#  [2.91704904e-11 1.00000000e+00]
#  [9.43796673e-01 5.62033274e-02]]

# 绘制分类边界线
n = 500
l, r = x[:, 0].min() - 1, x[:, 0].max() + 1
b, t = x[:, 1].min() - 1, x[:, 1].max() + 1
grid_x = np.meshgrid(np.linspace(l, r, n), np.linspace(b, t, n))
flat_x = np.column_stack((grid_x[0].ravel(), grid_x[1].ravel()))
flat_y = model.predict(flat_x)
grid_y = flat_y.reshape(grid_x[0].shape)

mp.figure('Probability', facecolor='lightgray')
mp.title('Probability', fontsize=20)
mp.xlabel('x', fontsize=14)
mp.ylabel('y', fontsize=14)
mp.tick_params(labelsize=10)
mp.pcolormesh(grid_x[0], grid_x[1], grid_y, cmap='gray')
mp.scatter(test_x[:, 0], test_x[:, 1], c=test_y, cmap='brg', s=80)
mp.scatter(prob_x[:, 0], prob_x[:, 1], c=pred_prob_y, cmap='jet_r', s=80, marker='D')
# 绘制每个测试样本,并给出标注
for i in range(len(probs)):
    mp.annotate(
        '{}% {}%'.format(
            round(probs[i, 0] * 100, 2),
            round(probs[i, 1] * 100, 2)),
        xy=(prob_x[i, 0], prob_x[i, 1]),
        xytext=(12, -12),
        textcoords='offset points',
        horizontalalignment='left',
        verticalalignment='top',
        fontsize=9,
        bbox={'boxstyle': 'round,pad=0.6', 'fc': 'orange', 'alpha': 0.8})
mp.show()

网格搜索

获取一个最优超参数的方式可以绘制验证曲线,但是验证曲线只能每次获取一个最优超参数。如果多个超参数有很多排列组合的话,就可以使用网格搜索寻求最优超参数组合。

针对超参数组合列表中的每一个超参数组合,实例化给定的模型,做cv次交叉验证,将其中平均f1得分最高的超参数组合作为最佳选择,实例化模型对象。

网格搜索相关API:

import sklearn.model_selection as ms
model = ms.GridSearchCV(模型, 超参数组合列表, cv=折叠数)
model.fit(输入集,输出集)
# 获取网格搜索每个参数组合
model.cv_results_['params']
# 获取网格搜索每个参数组合所对应的平均测试分值
model.cv_results_['mean_test_score']
# 获取最好的参数
model.best_params_   # 最优超参数组合
model.best_score_    # 最优得分
model.best_estimator_  # 最优模型对象

案例:修改置信概率案例,基于网格搜索得到最优超参数。

import numpy as np
import sklearn.model_selection as ms
import sklearn.svm as svm
import sklearn.metrics as sm
import matplotlib.pyplot as plt

data = np.loadtxt('../machine_learning_date/multiple2.txt', delimiter=',', dtype='f8')
x = data[:, :-1]
y = data[:, -1]
# 选择svm做分类
train_x, test_x, train_y, test_y = ms.train_test_split(x, y, test_size=0.25, random_state=5)
model = svm.SVC(probability=True)
# 根据网格搜索选择最优模型
# 整理网格搜索所需要的超参数列表
params = [{'kernel': ['linear'], 'C': [1, 10, 100, 1000]},
          {'kernel': ['poly'], 'C': [1], 'degree': [2, 3]},
          {'kernel': ['rbf'], 'C': [1, 10, 100, 1000], 'gamma': [1, 0.1, 0.01, 0.001]}]
model = ms.GridSearchCV(model, params, cv=5)
model.fit(train_x, train_y)

# 获取得分最优的的超参数信息
print(model.best_params_)  # {'C': 1, 'gamma': 1, 'kernel': 'rbf'}
# 获取最优得分

print(model.best_score_)  # 0.96

# 获取最优模型的信息
print(model.best_estimator_)
# SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
#   decision_function_shape='ovr', degree=3, gamma=1, kernel='rbf',
#   max_iter=-1, probability=True, random_state=None, shrinking=True,
#   tol=0.001, verbose=False)

# 输出每个超参数组合信息及其得分
for param, score in zip(model.cv_results_['params'], model.cv_results_['mean_test_score']):
    print(param, '->', score)
    # {'C': 1, 'kernel': 'linear'} -> 0.5911111111111111
    # {'C': 10, 'kernel': 'linear'} -> 0.5911111111111111
    # ...
    # ...
    # {'C': 1000, 'gamma': 0.01, 'kernel': 'rbf'} -> 0.9555555555555556
    # {'C': 1000, 'gamma': 0.001, 'kernel': 'rbf'} -> 0.92


pred_test_y = model.predict(test_x)
print(sm.classification_report(test_y, pred_test_y))
#              precision    recall  f1-score   support
#         0.0       0.95      0.93      0.94        45
#         1.0       0.90      0.93      0.92        30
# avg / total       0.93      0.93      0.93        75

# 新增样本
prob_x = np.array([[2, 1.5], [8, 9], [4.8, 5.2], [4, 4], [2.5, 7], [7.6, 2], [5.4, 5.9]])
pred_prob_y = model.predict(prob_x)
probs = model.predict_proba(prob_x)     # 获取每个样本的置信概率矩阵
print(probs)

# 绘制分类边界线
n = 500
l, r = x[:, 0].min() - 1, x[:, 0].max() + 1
b, t = x[:, 1].min() - 1, x[:, 1].max() + 1
grid_x = np.meshgrid(np.linspace(l, r, n), np.linspace(b, t, n))
flat_x = np.column_stack((grid_x[0].ravel(), grid_x[1].ravel()))
flat_y = model.predict(flat_x)
grid_y = flat_y.reshape(grid_x[0].shape)

plt.figure('Probability')
plt.title('Probability')
plt.xlabel('x', fontsize=14)
plt.ylabel('y', fontsize=14)
plt.tick_params(labelsize=10)
plt.pcolormesh(grid_x[0], grid_x[1], grid_y, cmap='gray')
plt.scatter(test_x[:, 0], test_x[:, 1], c=test_y, cmap='brg', s=80)
plt.scatter(prob_x[:, 0], prob_x[:, 1], c=pred_prob_y, cmap='jet_r', s=80, marker='D')
for i in range(len(probs)):
    plt.annotate('{}% {}%'.format(
        round(probs[i, 0] * 100, 2),
        round(probs[i, 1] * 100, 2)),
        xy=(prob_x[i, 0], prob_x[i, 1]),
        xytext=(12, -12),
        textcoords='offset points',
        horizontalalignment='left',
        verticalalignment='top',
        fontsize=9,
        bbox={'boxstyle': 'round,pad=0.6', 'fc': 'orange', 'alpha': 0.8})
plt.show()

 

事件预测

加载event.txt,预测某个时间段是否会出现特殊事件。

import numpy as np
import sklearn.preprocessing as sp
import sklearn.model_selection as ms
import sklearn.svm as svm
import sklearn.metrics as sm


class DigitEncoder:
    # 模拟LabelEncoder编写的数字编码器
    # 非数字字符串的特征需要做标签编码,
    # 数字字符串的特征需要做转换编码

    def fit_transform(self, y):
        return y.astype('i4')

    def transform(self, y):
        return y.astype('i4')

    def inverse_transform(self, y):
        return y.astype('str')


# 加载并整理数据集
# data = np.load('../machine_learning_date/events.txt', delimiter=",", dtype='U15')

data = []
with open('../machine_learning_date/events.txt', 'r') as f:
    for line in f.readlines():
        data.append(line.split(','))
data = np.array(data)
data = np.delete(data, 1, axis=1)
cols = data.shape[1]  # 获取一共有多少列
x, y = [], []
encoders = []
for i in range(cols):
    col = data[:, i]
    # 判断当前列是否是数字字符串
    if col[0].isdigit():
        encoder = DigitEncoder()
    else:
        encoder = sp.LabelEncoder()
    # 使用编码器对数据进行编码
    if i < cols - 1:
        x.append(encoder.fit_transform(col))
    else:
        y = encoder.fit_transform(col)
    encoders.append(encoder)

x = np.array(x).T   # (5040,4)
y = np.array(y)     # (5040,)

# 拆分测试集与训练集
train_x, test_x, train_y, test_y = ms.train_test_split(x, y, test_size=0.25, random_state=7)

# 构建模型
model = svm.SVC(kernel='rbf', class_weight='balanced')
model.fit(train_x, train_y)
# 测试
pred_test_y = model.predict(test_x)
print(sm.classification_report(test_y, pred_test_y))

# 业务应用
data = [['Tuesday', '13:30:00', '21', '23']]
data = np.array(data).T
x = []
for row in range(len(data)):
    encoder = encoders[row]
    x.append(encoder.transform(data[row]))
x = np.array(x).T
pred_y = model.predict(x)
print(encoders[-1].inverse_transform(pred_y))   # ['eventA\n']

交通流量预测(回归)

加载traffic.txt,预测在某个时间段某个交通路口的车流量。

"""车流量预测"""
import numpy as np
import sklearn.preprocessing as sp
import sklearn.model_selection as ms
import sklearn.svm as svm
import sklearn.metrics as sm


class DigitEncoder:
    def fit_transform(self, y):
        return y.astype(int)

    def transform(self, y):
        return y.astype(int)

    def inverse_transform(self, y):
        return y.astype(str)


data = []
# 回归
data = np.loadtxt('../machine_learning_date/traffic.txt', delimiter=',', dtype='U20')
data = data.T
encoders, x = [], []
for row in range(len(data)):
    if data[row][0].isdigit():
        encoder = DigitEncoder()
    else:
        encoder = sp.LabelEncoder()
    if row < len(data) - 1:
        x.append(encoder.fit_transform(data[row]))
    else:
        y = encoder.fit_transform(data[row])
    encoders.append(encoder)
x = np.array(x).T
train_x, test_x, train_y, test_y = \
    ms.train_test_split(x, y, test_size=0.25, random_state=5)
# 支持向量机回归器
model = svm.SVR(kernel='rbf', C=10, epsilon=0.2)
model.fit(train_x, train_y)
pred_test_y = model.predict(test_x)
print(sm.r2_score(test_y, pred_test_y))     # 0.6379517119380995

# 业务应用
data = [['Tuesday', '13:35', 'San Francisco', 'yes']]
data = np.array(data).T
x = []
for row in range(len(data)):
    encoder = encoders[row]
    x.append(encoder.transform(data[row]))
x = np.array(x).T
pred_y = model.predict(x)
print(int(pred_y))      # 27

回归:线性回归、岭回归、多项式回归、决策树、正向激励、随机森林、SVR。

分类:逻辑分类、朴素贝叶斯、决策树、随机森林、SVC。

Guess you like

Origin www.cnblogs.com/LXP-Never/p/11426527.html