Sklearn 相关使用(cross-valiation)

Sklearn 相关使用(cross-valiation)

参考教程:莫凡sklearn学习

1.sklearn 基本用法

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

'a simple cross valiation'
__author__ = 'xuchao'

from sklearn.datasets import load_iris   # load data
from sklearn.model_selection import train_test_split  # split the data to train data and test data
from sklearn.neighbors import  KNeighborsClassifier   # use the knn to trin the data

# 1 prepare data
iris = load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=4)  # split the data
knn = KNeighborsClassifier(n_neighbors=5)  # 先定义一个基本的模型, 分成了5类
knn.fit(X_train, y_train)   # 拟合模型
accuracy = knn.score(X_test, y_test)  # 输出预测的误差
print(accuracy)  

输出

0.98

2.1 sklearn的k折交叉验证

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

'a simple cross valiation'
__author__ = 'xuchao'

from sklearn.datasets import load_iris   # load data
from sklearn.model_selection import train_test_split  # split the data to train data and test data
from sklearn.neighbors import  KNeighborsClassifier   # use the knn to trin the data
from sklearn.model_selection  import cross_val_score  # this the cross valiation
# 1 prepare data
iris = load_iris()
X = iris.data
y = iris.target
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=4)  # split the data
knn = KNeighborsClassifier(n_neighbors=5)  # 分成了五类
scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')  # 进行K折交叉验证,cv为交叉验证的次数,
print(scores)

输出:

[ 1.          0.93333333  1.          1.          0.86666667  0.93333333
  0.93333333  1.          1.          1.        ]

2.2 sklearn的K折交叉验证选择可视化

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

'a simple cross valiation'
__author__ = 'xuchao'

from sklearn.datasets import load_iris   # load data
from sklearn.model_selection import train_test_split  # split the data to train data and test data
from sklearn.neighbors import  KNeighborsClassifier   # use the knn to trin the data
from sklearn.model_selection  import cross_val_score  # this the cross valiation
import matplotlib.pyplot as plt
import numpy as np
# 1 prepare data
iris = load_iris()
X = iris.data
y = iris.target
# 2 train the mode
echos = np.arange(1, 31)
result = []
for i in echos:
    knn = KNeighborsClassifier(n_neighbors=i)  # 分成了i类
    scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')  # 进行K=10折交叉验证,cv为交叉验证的次数,分类是按照accuracy来的
    result.append(scores.mean())
# 3 visual the figure
plt.figure()
plt.title('K cross valiation')
plt.xlabel('echos')
plt.ylabel('accuracy')
plt.plot(echos, result, 'b')
plt.show()

这里写图片描述

何为K折交叉验证?

在机器学习中,将数据集A 分为训练集(training set)B和测试集(testset)C,在样本量不充足的情况下,为了充分利用数据集对算法效果进行测试,将数据集A随机分为k个包,每次将其中一个包作为测试集,剩下k-1个包作为训练集进行训练。

3.1 学习曲线

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

'learning curve'
__author__ = 'xuchao'

from sklearn.datasets import load_digits   # load data
from sklearn.svm import SVC    #  use the svc to discriminate the digits datasets
from sklearn.model_selection import learning_curve  # learning the curve on the different size of data
import matplotlib.pyplot as plt
import numpy as np
# 1 prepare data
digits = load_digits()
X = digits.data
y = digits.target
# 2 train the mode
train_sizes, train_loss, test_loss = learning_curve(SVC(gamma=0.01), X, y, cv=10, scoring='neg_mean_squared_error', train_sizes=np.linspace(0.1, 1.0, 5))  # 这里train_sizes是关键,表示每一步使用整体数据的10%等部分数据进行train
# 3 visual the figure
train_loss_mean = - np.mean(train_loss, axis=1)  # 这里的损失都是负的,为了显示的好处我们加上-号
test_loss_mean = - np.mean(test_loss, axis=1)   # 这里是二维数据,横的是%数据,列是第k折数据,所以我们需要对每一行求平均
# 4 visual the data
plt.figure()
plt.plot(train_sizes, train_loss_mean, 'o-', color="r", label="train loss") # 这里注意一下需要使用'o-'才可以显示点和线
plt.plot(train_sizes, test_loss_mean, 'o-', color="b", label="cross valiation")
plt.legend(loc='best')  # 显示图例
plt.xlabel('train sizes')
plt.ylabel('loss')
plt.show()

这里写图片描述
从上面我们可以看到learn_curve是可以将数据分成百分比的,训练误差一直是0,验证误差一直减少,所以我们是可以增加数据量减少过拟合.学习曲线可以帮助我们观察过拟合的情况

3.2 利用validation_curve学习不同的参数控制过拟合

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

'learning curve'
__author__ = 'xuchao'

from sklearn.datasets import load_digits   # load data
from sklearn.svm import SVC    #  use the svc to discriminate the digits datasets
from sklearn.model_selection import validation_curve  # learning the curve on the different size of data
import matplotlib.pyplot as plt
import numpy as np
# 1 prepare data
digits = load_digits()
X = digits.data
y = digits.target
# 2 train the mode
param_range = param_range = np.logspace(-6, -2.3, 5)
train_loss, test_loss = validation_curve(SVC(), X, y, param_name='gamma', param_range=param_range, cv=10, scoring='neg_mean_squared_error')
# 这里需要注意validation_curve中的参数param_name=需要是estimator里面的形参,而不是任意一个数, param_range是需要进行测试的参数列表
# 3 visual the figure
train_loss_mean = - np.mean(train_loss, axis=1)  # 这里的损失都是负的,为了显示的好处我们加上-号
test_loss_mean = - np.mean(test_loss, axis=1)   # 这里是二维数据,横的是%数据,列是第k折数据,所以我们需要对每一行求平均
# 4 visual the data
plt.figure()
plt.plot(param_range, train_loss_mean, 'o-', color="r", label="train loss") # 这里注意一下需要使用'o-'才可以显示点和线
plt.plot(param_range, test_loss_mean, 'o-', color="b", label="cross valiation")
plt.legend(loc='best')  # 显示图例
plt.xlabel('train sizes')
plt.ylabel('loss')
plt.show()

这里写图片描述
从上图我们可以看出了当在0.001之后,训练集误差为0,但是验证集误差在不断上升,表示这个时候已经出现了过拟合的现象了.

注意:

ipython笔记可以参考我的github网站

猜你喜欢

转载自blog.csdn.net/alxe_made/article/details/80494029