Detailed explanation of Sklearn train_test_split parameters

1 important parameters

train_test_splitFunction:
Divide an array or matrix into training and test sets

sklearn.model_selection.train_test_split
(*arrays, test_size=None, 
train_size=None, random_state=None, 
shuffle=True, stratify=None)
***参数***
# *arrays:sequence of indexables with same length / shape[0]
# 具有相同行数的可索引的序列(可以是lists|numpy arrays|scipy-sparse matrices|pandas dataframes)
# test_size:float or int, default=None
# 测试集大小(如果是float:0-1之间|如果是整数,代表测试集的样本数量|如果为None,就看train_size,如果train_size也没有,默认设置为0.25)
# train_sizefloat or int, default=None
# 训练集大小(如果是float:0-1之间|如果是整数,代表测试集的样本数量|如果为None,就看test_size)
# random_state,int, RandomState instance or None, default=None
# 随机数:在对数据分割之前设置数据的打断方式,在多个函数调用之间传递一个可重复输出的int
# shuffle,bool, default=True
# 是否在数据分割前打乱数据,If shuffle=False then stratify must be None.
# stratify:array-like, default=None
# 分层,不经常用,如果不是“无”,则以分层方式拆分数据,并将其用作类标签。阅读《用户指南》中的更多内容。

random_stateParameters: Provides the random number generator to use

  1. None (default)
    use the global random state instance from numpy.random. Calling the function multiple times will reuse the same instance and will produce different results.
  2. An integer
    Use a new random number generator seeded by the given integer An integer
    Use a new random number generator seeded by the given integer. **Using an int will produce the same result across different calls. **However, it is worth checking that your results are stable across many different random seeds. Popular integer random seeds are 0 and 42. Integer values ​​must be in the range [0, 2^32-1].

The random number generator used by Sklearn is the Mersenne Twister pseudo-random number generator
Reference link:

insert image description here

    1. A numpy.random.RandomState instance
      using the provided random state, affecting only other users of the same random state instance. Calling a function multiple times will reuse the same instance and will produce different results.

return value

# splitting:list, length=2 * len(arrays)
# 包含训练-测试的列表

2 examples

import numpy as np
from sklearn.model_selection import train_test_split
X, y = np.arange(10).reshape((5, 2)), range(5)

insert image description here

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

insert image description here

train_test_split(y, shuffle=False)
[[0, 1, 2], [3, 4]]

3 note

  1. When shuffle=True and random_state is an integer, the divided subsets are out-of-order subsets. If the value of random_state remains unchanged, the data set obtained for each run remains unchanged.
  2. When shuffle=True and random_state=None, the divided subsets are out-of-order subsets, and the data set of each run changes.
  3. When shuffle=False, no matter whether the random_state is a fixed value or not, it will not affect the division result, and the division results in sequential subsets (does not change each time).
  4. In order to ensure that the data is disturbed and the division of each experiment is consistent, it is only necessary to set random_state to an integer (0-42), and the default value in the shuffle function is True (note: the difference in random_state selection will affect the accuracy of the model)

Learning link:

Guess you like

Origin blog.csdn.net/weixin_45913084/article/details/130062441