1 important parameters
train_test_split
Function:
Divide an array or matrix into training and test sets
sklearn.model_selection.train_test_split
(*arrays, test_size=None,
train_size=None, random_state=None,
shuffle=True, stratify=None)
***参数***
# *arrays:sequence of indexables with same length / shape[0]
# 具有相同行数的可索引的序列(可以是lists|numpy arrays|scipy-sparse matrices|pandas dataframes)
# test_size:float or int, default=None
# 测试集大小(如果是float:0-1之间|如果是整数,代表测试集的样本数量|如果为None,就看train_size,如果train_size也没有,默认设置为0.25)
# train_sizefloat or int, default=None
# 训练集大小(如果是float:0-1之间|如果是整数,代表测试集的样本数量|如果为None,就看test_size)
# random_state,int, RandomState instance or None, default=None
# 随机数:在对数据分割之前设置数据的打断方式,在多个函数调用之间传递一个可重复输出的int
# shuffle,bool, default=True
# 是否在数据分割前打乱数据,If shuffle=False then stratify must be None.
# stratify:array-like, default=None
# 分层,不经常用,如果不是“无”,则以分层方式拆分数据,并将其用作类标签。阅读《用户指南》中的更多内容。
random_state
Parameters: Provides the random number generator to use
- None (default)
use the global random state instance from numpy.random. Calling the function multiple times will reuse the same instance and will produce different results. - An integer
Use a new random number generator seeded by the given integer An integer
Use a new random number generator seeded by the given integer. **Using an int will produce the same result across different calls. **However, it is worth checking that your results are stable across many different random seeds. Popular integer random seeds are 0 and 42. Integer values must be in the range [0, 2^32-1].
The random number generator used by Sklearn is the Mersenne Twister pseudo-random number generator
Reference link:
- RandomState Official Documentation
- The significance of the random_state parameter in the python sklearn model
-
- A numpy.random.RandomState instance
using the provided random state, affecting only other users of the same random state instance. Calling a function multiple times will reuse the same instance and will produce different results.
- A numpy.random.RandomState instance
return value
# splitting:list, length=2 * len(arrays)
# 包含训练-测试的列表
2 examples
import numpy as np
from sklearn.model_selection import train_test_split
X, y = np.arange(10).reshape((5, 2)), range(5)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
train_test_split(y, shuffle=False)
[[0, 1, 2], [3, 4]]
3 note
- When shuffle=True and random_state is an integer, the divided subsets are out-of-order subsets. If the value of random_state remains unchanged, the data set obtained for each run remains unchanged.
- When shuffle=True and random_state=None, the divided subsets are out-of-order subsets, and the data set of each run changes.
- When shuffle=False, no matter whether the random_state is a fixed value or not, it will not affect the division result, and the division results in sequential subsets (does not change each time).
- In order to ensure that the data is disturbed and the division of each experiment is consistent, it is only necessary to set random_state to an integer (0-42), and the default value in the shuffle function is True (note: the difference in random_state selection will affect the accuracy of the model)
Learning link: