机器学习中SVM、决策树、逻辑回归、K近邻算法python代码实现

上篇文章咱已经叙述了如何从excel表中读取保存的数据特征及类别标签，这篇文章主要介绍机器学习中常用到的算法python代码实现方式，主要包括SVM、决策树、逻辑回归和K近邻算法。
在这里插入图片描述
这是上篇文章中我们使用到的数据特征，前六列是特征向量，第七列是数据标签。这篇文章默认已经把数据读出来了（如果不会读取数据的小伙伴可以去看我的上一篇文章）。

首先我们观察到x1-x6的特征相差太大，x1的单位是0.000几，而x6的单位较大有1000以上，所以这里我们首先对数据进行标准化。

数据标准化

这里我们主要使用python的 sklearn 依赖库。这先说一下sklearn库的安装方法，在cmd中直接输入以下命令，等待一会就可以了。

pip install scikit-learn

数据标准化的代码主要用到sklearn库中的preprocessing模块。

from sklearn import preprocessing
features = preprocessing.scale(features)

preprocessing.scale( ) 函数主要是对数据调整为均值为0，方差为1的正态分布，也就是高斯分布，属于数据标准化。

features = preprocessing.normalize(features, norm='l1')   # L1正则化
features = preprocessing.normalize(features, norm='l2')   # L2正则化

preprocessing.normalize( ) 函数可以实现L1正则化和L2正则化，也是很常用的方法。

详细的介绍可以参考这篇文章，写的很好：https://blog.csdn.net/weixin_40807247/article/details/82793220

机器学习代码实现

这里使用了五折交叉验证方法，具体就是先把数据评分为五份，取其中四份作为训练集，另一份作为验证集。测试五次最后求平均值。

from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn import preprocessing
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix

curr_score = 0
ids = 0
kf = KFold(n_splits=5, random_state=None, shuffle=False)    # 五折交叉验证方法

for train_index, test_index in kf.split(features):          # 数据处理，得到数组中数据的下标
    ids = ids + 1
    if model == 'svm':
        clf = SVC(kernel='rbf', C=5, gamma=0.05, probability=True)     # SVM算法
    elif model == 'decision_tree':
        clf = DecisionTreeClassifier(random_state=3)				   # 决策树算法
    elif model == 'logistic':
        clf = LogisticRegression()									   # 逻辑回归算法
    elif model == 'KN':
        clf = KNeighborsClassifier(n_neighbors=5)					   # K紧邻算法
    else:
        print("We don't support this model!")
        
    clf.fit(features[train_index], labels[train_index])      # 训练
    preds = clf.predict(features[test_index])                # 预测结果类别, 比如：1
    preds_proba = clf.predict_proba(features[test_index])    # 预测结果概率, 比如：[0.92, 0.03, 0.05]

    cm = confusion_matrix(labels[test_index], preds)		 # 这里是混淆矩阵

    now_score = clf.score(features[test_index], labels[test_index])   # 五折交叉验证，每次测试结果
    print("第 %d 折的预测准确率是：" % ids, now_score)
    curr_score = curr_score + now_score

avg_score = curr_score/5   # 最后的测试结果
print("最后的平均预测准确率是：", avg_score)

其中KFold()函数可实现交叉验证数据划分方法，非常好用。

train_test_split（）函数实现普通的训练集和测试集划分。

以上就是机器学习算法的python代码实现方法，比起深度学习来说简单太多了，主要代码就几行。

SVM算法

下面对SVM算法进行额外的一些介绍，SVM算法有两个很重要的超参数：“C” 和 “gamma”。

这里是对“C” 和 “gamma”值的一些分析，可以助大家对惩罚项参数多一些了解。
在这里插入图片描述

这里介绍一种简单找较优值的方法（尝试法），参考周志华老师的视频讲解。代码如下，可以求出两个超参数的值。

train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=0.2)  # 常用的数据划分训练集、测试集方法
# 设置可选超参数
C_value = [0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100]        # C值序列
gamma_value = [0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100]	   # gamma值序列
# 初始化变量
best_score = 0
best_params = {
    
    'C':None, 'gamma':None}

time_0 = time.time()
print("Start training ...")
for C in C_value:
    for gamma in gamma_value:
        svm = SVC(C=C, gamma=gamma)
        svm.fit(train_features, train_labels)
        score = svm.score(test_features, test_labels)
        if score > best_score:
            best_score = score
            best_params['C'] = C
            best_params['gamma'] = gamma
print(best_score, best_params)

这样就可以让电脑自动尝试较好的“C” 和 “gamma”值了。

日常学习记录，一起交流讨论吧！侵权联系~