sklearn学习:train_test_split

train_test_split 是sklearn中一个用来随机分割train,test数据集的工具

sklearn.model_selection.train_test_split(*arrays, **options)[source]

需要注意的参数包括,

1. test_size

test_size的参数类型可能有多种:
如果为float型,需要介于0.0到1.0之间,表示要分割在测试数据集中的比例;
如果为int型,表示测试数据集的绝对数量;
如果为None,默认为训练数据集的补集。

当未指定train_size时,默认值为0.25。

2.stratify

stratify的参数类型:array-like or None, and default is None
如果不是None,则数据以分层方式分割,将其用作类标签。

3.random_state

random_state的参数类型:int, RandomState instance or None, optional, default is None
如果是int,random_state是随机数生成种子;
如果是RandomSate实例,random_state是随机数生成器;
如果为None,则随机数生成器是np.random使用的RandomState实例。

控制random_state的意义在于,如不确定,random_state值在每次运行时可能是不一样的,为防止每一次修改其他变量,运行结果被不同的random_state值影响,可以控制该变量。

official example:

import numpy as np
from sklearn.model_selection import train_test_split
X, y = np.arange(10).reshape((5, 2)), range(5)
print('x:',X,'\n')

print('list(y):',list(y),'\n')

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

print('x_train:',X_train,'\n')
print('y_train:',y_train,'\n')
print('X_test:',X_test,'\n')
print('y_test:',y_test,'\n')

# if shuffle=False
y_new_train,y_new_test = train_test_split(y, shuffle=False)
print('y_new_train',y_new_train,'\n')
print('y_new_test',y_new_test,'\n')

output:
x: [[0 1]
[2 3]
[4 5]
[6 7]
[8 9]]

list(y): [0, 1, 2, 3, 4]

x_train: [[4 5]
[0 1]
[6 7]]

y_train: [2, 0, 3]

X_test: [[2 3]
[8 9]]

y_test: [1, 4]

y_new_train [0, 1, 2]

y_new_test [3, 4]


official document:
http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

猜你喜欢

转载自blog.csdn.net/ninnyyan/article/details/80567099