Sklearn 相关使用(cross-valiation)
参考教程:莫凡sklearn学习
1.sklearn 基本用法
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
'a simple cross valiation'
__author__ = 'xuchao'
from sklearn.datasets import load_iris # load data
from sklearn.model_selection import train_test_split # split the data to train data and test data
from sklearn.neighbors import KNeighborsClassifier # use the knn to trin the data
# 1 prepare data
iris = load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=4) # split the data
knn = KNeighborsClassifier(n_neighbors=5) # 先定义一个基本的模型, 分成了5类
knn.fit(X_train, y_train) # 拟合模型
accuracy = knn.score(X_test, y_test) # 输出预测的误差
print(accuracy)
输出:
0.98
2.1 sklearn的k折交叉验证
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
'a simple cross valiation'
__author__ = 'xuchao'
from sklearn.datasets import load_iris # load data
from sklearn.model_selection import train_test_split # split the data to train data and test data
from sklearn.neighbors import KNeighborsClassifier # use the knn to trin the data
from sklearn.model_selection import cross_val_score # this the cross valiation
# 1 prepare data
iris = load_iris()
X = iris.data
y = iris.target
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=4) # split the data
knn = KNeighborsClassifier(n_neighbors=5) # 分成了五类
scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy') # 进行K折交叉验证,cv为交叉验证的次数,
print(scores)
输出:
[ 1. 0.93333333 1. 1. 0.86666667 0.93333333
0.93333333 1. 1. 1. ]
2.2 sklearn的K折交叉验证选择可视化
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
'a simple cross valiation'
__author__ = 'xuchao'
from sklearn.datasets import load_iris # load data
from sklearn.model_selection import train_test_split # split the data to train data and test data
from sklearn.neighbors import KNeighborsClassifier # use the knn to trin the data
from sklearn.model_selection import cross_val_score # this the cross valiation
import matplotlib.pyplot as plt
import numpy as np
# 1 prepare data
iris = load_iris()
X = iris.data
y = iris.target
# 2 train the mode
echos = np.arange(1, 31)
result = []
for i in echos:
knn = KNeighborsClassifier(n_neighbors=i) # 分成了i类
scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy') # 进行K=10折交叉验证,cv为交叉验证的次数,分类是按照accuracy来的
result.append(scores.mean())
# 3 visual the figure
plt.figure()
plt.title('K cross valiation')
plt.xlabel('echos')
plt.ylabel('accuracy')
plt.plot(echos, result, 'b')
plt.show()
何为K折交叉验证?
在机器学习中,将数据集A 分为训练集(training set)B和测试集(testset)C,在样本量不充足的情况下,为了充分利用数据集对算法效果进行测试,将数据集A随机分为k个包,每次将其中一个包作为测试集,剩下k-1个包作为训练集进行训练。
3.1 学习曲线
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
'learning curve'
__author__ = 'xuchao'
from sklearn.datasets import load_digits # load data
from sklearn.svm import SVC # use the svc to discriminate the digits datasets
from sklearn.model_selection import learning_curve # learning the curve on the different size of data
import matplotlib.pyplot as plt
import numpy as np
# 1 prepare data
digits = load_digits()
X = digits.data
y = digits.target
# 2 train the mode
train_sizes, train_loss, test_loss = learning_curve(SVC(gamma=0.01), X, y, cv=10, scoring='neg_mean_squared_error', train_sizes=np.linspace(0.1, 1.0, 5)) # 这里train_sizes是关键,表示每一步使用整体数据的10%等部分数据进行train
# 3 visual the figure
train_loss_mean = - np.mean(train_loss, axis=1) # 这里的损失都是负的,为了显示的好处我们加上-号
test_loss_mean = - np.mean(test_loss, axis=1) # 这里是二维数据,横的是%数据,列是第k折数据,所以我们需要对每一行求平均
# 4 visual the data
plt.figure()
plt.plot(train_sizes, train_loss_mean, 'o-', color="r", label="train loss") # 这里注意一下需要使用'o-'才可以显示点和线
plt.plot(train_sizes, test_loss_mean, 'o-', color="b", label="cross valiation")
plt.legend(loc='best') # 显示图例
plt.xlabel('train sizes')
plt.ylabel('loss')
plt.show()
从上面我们可以看到learn_curve是可以将数据分成百分比的,训练误差一直是0,验证误差一直减少,所以我们是可以增加数据量减少过拟合.学习曲线可以帮助我们观察过拟合的情况
3.2 利用validation_curve学习不同的参数控制过拟合
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
'learning curve'
__author__ = 'xuchao'
from sklearn.datasets import load_digits # load data
from sklearn.svm import SVC # use the svc to discriminate the digits datasets
from sklearn.model_selection import validation_curve # learning the curve on the different size of data
import matplotlib.pyplot as plt
import numpy as np
# 1 prepare data
digits = load_digits()
X = digits.data
y = digits.target
# 2 train the mode
param_range = param_range = np.logspace(-6, -2.3, 5)
train_loss, test_loss = validation_curve(SVC(), X, y, param_name='gamma', param_range=param_range, cv=10, scoring='neg_mean_squared_error')
# 这里需要注意validation_curve中的参数param_name=需要是estimator里面的形参,而不是任意一个数, param_range是需要进行测试的参数列表
# 3 visual the figure
train_loss_mean = - np.mean(train_loss, axis=1) # 这里的损失都是负的,为了显示的好处我们加上-号
test_loss_mean = - np.mean(test_loss, axis=1) # 这里是二维数据,横的是%数据,列是第k折数据,所以我们需要对每一行求平均
# 4 visual the data
plt.figure()
plt.plot(param_range, train_loss_mean, 'o-', color="r", label="train loss") # 这里注意一下需要使用'o-'才可以显示点和线
plt.plot(param_range, test_loss_mean, 'o-', color="b", label="cross valiation")
plt.legend(loc='best') # 显示图例
plt.xlabel('train sizes')
plt.ylabel('loss')
plt.show()
从上图我们可以看出了当在0.001之后,训练集误差为0,但是验证集误差在不断上升,表示这个时候已经出现了过拟合的现象了.
注意:
ipython笔记可以参考我的github网站