机器学习：数据集划分(包含交叉验证)

1.留出法

原始数据分成训练集、验证集和测试集，并且保持数据分布的一致性，可以使用shuffle
缺点：只进行了一次划分，数据结果具有偶然性

from sklearn.model_selection import train_test_split
'''
(1)random_state不填或者为0时，每次都不同；其余值表示不同随机数
(2)shuffle表示是否在分割之前对数据进行洗牌（默认True）
'''
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30,random_state=42,shuffle=True)

2.交叉验证法
在这里插入图片描述

数据量大时，k设置小一些 / 数据量小时，k设置大一些
优点：降低由一次随机划分带来的偶然性，提高其泛化能力，提高对数据的使用效率。
缺点：可能存在一种情况：数据集有5类，抽取出来的也正好是按照类别划分的5类，也就是说第一折全是0类，第二折全是1类，等等；这样的结果就会导致，模型训练时。没有学习到测试集中数据的特点，从而导致模型得分很低，甚至为0，

from sklearn.model_selection import KFold
kf = KFold(n_splits=2)
for train_index, test_index in kf.split(X):
    print('X_train:%s ' % X[train_index])
    print('X_test: %s ' % X[test_index])

3.交叉验证法—留一法

即K = 样本数
优点：不存在数据分布不一致
缺点：耗时

from sklearn.model_selection import LeaveOneOut
loo = LeaveOneOut()
for train_index, test_index in loo.split(X):
    print('X_train:%s ' % X[train_index])
    print('X_test: %s ' % X[test_index])

4.分层交叉验证
在这里插入图片描述

from sklearn.model_selection import KFold,StratifiedKFold
kf = StratifiedKFold(n_splits=4,random_state=0,shuffle=False)
for train_index,test_index in kf.split(X):
    print("Stratified X_train:",X[train_index])
    print("Stratified X_test:",X[test_index])
```python
#交叉验证的分数
from sklearn.model_selection import cross_val_score
score = cross_val_score(model,X,y,cv=kf)

在这里插入图片描述

from sklearn.model_selection import ShuffleSplit,cross_val_score
# n_splits表示迭代次数，可以存在既不在训练集中，也不在测试集中的数据
kf = ShuffleSplit(train_size=0.5,test_size=0.4,n_splits=8) 
scores = cross_val_score(model,data,target,cv=kf)

5.自助法
（1）优点：

在数据集较小、难以划分时很有用
能从D中产生不同的S，对集成学习等方法有好处

（2）缺点：

产生的S改变了D的分布，会引入估计偏差

import numpy as np
import pandas as pd
import random
data = pd.DataFrame(np.random.rand(10,4),columns=list('ABCD'))
data['y'] = [random.choice([0,1]) for i in range(10)]
train = data.sample(frac=1.0,replace=True) # 有放回随机采样
test = data.loc[data.index.difference(train.index)].copy() # 将未采样的样本作为测试集

c.x.y.07.30

发布了60 篇原创文章 · 获赞 55 · 访问量 3万+

私信关注

机器学习：数据集划分(包含交叉验证)

猜你喜欢