python 划分训练集——K折交叉验证

首先来生成一个训练集

import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold

filename_label = {'filename':[str(i)+'.jpg' for i in range(100)], 'label':[np.random.randint(0,5) for i in range(100)]}
train = pd.DataFrame(filename_label)

print(train['label'].value_counts())
'''
2    23
1    23
0    20
4    18
3    16
Name: label, dtype: int64
'''
train.head(10)

在这里插入图片描述

接下来使用 sklearn.model_selection.StratifiedKFold,把这个 CSV 文件分成 2K 个文件,即 K 个训练集加 K 个测试集:

n_splits = 5  # K
x = train['filename'].values
y = train['label'].values
skf = StratifiedKFold(n_splits=n_splits, random_state=42, shuffle=True)

for index,(train_index,test_index) in enumerate(skf.split(x,y), start=1):
    res_train = pd.DataFrame()
    res_train['filename'] = train['filename'].iloc[train_index]
    res_train['label'] = train['label'].iloc[train_index]
    res_train.to_csv("train_{}.csv".format(index),index=False)

    res_train = pd.DataFrame()
    res_train['filename'] = train['filename'].iloc[test_index]
    res_train['label'] = train['label'].iloc[test_index]
    res_train.to_csv("test_{}.csv".format(index),index=False)

在这里插入图片描述
因为是 5 折交叉验证,所以训练集和测试集的行数之比为 4:1

发布了274 篇原创文章 · 获赞 446 · 访问量 42万+

猜你喜欢

转载自blog.csdn.net/itnerd/article/details/104307606