机器学习中的交叉验证简介

1.什么是交叉验证？

交叉验证是在实验中的数据不充分的情况下，但是我们又想训练出好的模型的情况下采用的措施。交叉验证的思想：重复使用数据，把给定的数据进行拆分，将切分的数据集组合为训练集与测试集，在此基础上不断反复进行训练、测试以及模型选择。下边来介绍下使用过的两个交叉验证方法，交叉验证的方法主要是使用sklearn库中方法，我们可以直接调用库中的方法，主要是在于参数的设置以及你应用的场景。

2.常见的交叉验证方法

(1)简单交叉验证：

随机的把数据集分为两部分，一部分作为训练集，另一部分作为测试集。然后利用训练集在各种条件下（参数个数不同的情况）训练模型，从而得到不同的模型，在测试集上评价各个模型的测试误差，选出测试误差最小的模型。例如：70%的数据作为训练集，30%的数据作为测试集，在Python中使用的方式如下：

sklearn.model_selection.train_test_split(*arrays, **option)

Parameters:	
          *arrays : sequence of indexables with same length / shape[0]

                    Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes.

          test_size : float, int, None, optional

                    If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. By default, the value is set to 0.25. The default will change in version 0.21. It will remain 0.25 only if train_size is unspecified, otherwise it will complement the specified train_size.

          train_size : float, int, or None, default None

                     If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.

          random_state : int, RandomState instance or None, optional (default=None)

                     If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

          shuffle : boolean, optional (default=True)

                     Whether or not to shuffle the data before splitting. If shuffle=False then stratify must be None.

          stratify : array-like or None (default is None)

                     If not None, data is split in a stratified fashion, using this as the class labels.

          Returns:	
                     splitting : list, length=2 * len(arrays)

                     List containing train-test split of inputs.
New in version 0.16: If the input is sparse, the output will be a scipy.sparse.csr_matrix. Else, output type is the same as the input type.

平时主要使用的情况是

>>> import numpy as np
>>> from sklearn.model_selection import train_test_split
>>> X, y = np.arange(10).reshape((5, 2)), range(5)
>>> X
array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7],
       [8, 9]])
>>> list(y)
[0, 1, 2, 3, 4]

>>> X_train, X_test, y_train, y_test = train_test_split(
...     X, y, test_size=0.33, random_state=42)
...
>>> X_train
array([[4, 5],
       [0, 1],
       [6, 7]])
>>> y_train
[2, 0, 3]
>>> X_test
array([[2, 3],
       [8, 9]])
>>> y_test
[1, 4]

(2)K折交叉验证---K-Fold

原理：首先随机的将已给数据切分为K个互不相交的大小相同的子集；然后使用K-1个子集的数据训练模型，利用余下的子集测试模型；将这一过程对可能的K种选择重复进行；最后终选出K次评测中平均误差最小的模型。下边是sklearn中对这个交叉验证的介绍，我们在使用的时候需要注意里边参数的设置

class KFold(_BaseKFold):
    """K-Folds cross-validator
    Provides train/test indices to split data in train/test sets. Split
    dataset into k consecutive folds (without shuffling by default).
    Each fold is then used once as a validation while the k - 1 remaining
    folds form the training set.
    Read more in the :ref:`User Guide <cross_validation>`.
    Parameters
    ----------
    n_splits : int, default=3 # 这里主要是设置交叉验证的次数，次数不是越多越好，次数的设定一般在十次，这个根据具体应用场景进行设置
        Number of folds. Must be at least 2.
    shuffle : boolean, optional# 可选选项
        Whether to shuffle the data before splitting into batches.
    random_state : int, RandomState instance or None, optional, default=None# 随机状态
        If int, random_state is the seed used by the random number generator;
        If RandomState instance, random_state is the random number generator;
        If None, the random number generator is the RandomState instance used
        by `np.random`. Used when ``shuffle`` == True.
    Examples
    --------
    >>> from sklearn.model_selection import KFold
    >>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
    >>> y = np.array([1, 2, 3, 4])
    >>> kf = KFold(n_splits=2)
    >>> kf.get_n_splits(X)
    2
    >>> print(kf)  # doctest: +NORMALIZE_WHITESPACE
    KFold(n_splits=2, random_state=None, shuffle=False)
    >>> for train_index, test_index in kf.split(X):
    ...    print("TRAIN:", train_index, "TEST:", test_index)
    ...    X_train, X_test = X[train_index], X[test_index]
    ...    y_train, y_test = y[train_index], y[test_index]
    TRAIN: [2 3] TEST: [0 1]
    TRAIN: [0 1] TEST: [2 3]
    Notes
    -----
    The first ``n_samples % n_splits`` folds have size
    ``n_samples // n_splits + 1``, other folds have size
    ``n_samples // n_splits``, where ``n_samples`` is the number of samples.
    Randomized CV splitters may return different results for each call of
    split. You can make the results identical by setting ``random_state``
    to an integer.
    See also
    --------
    StratifiedKFold
        Takes group information into account to avoid building folds with
        imbalanced class distributions (for binary or multiclass
        classification tasks).
    GroupKFold: K-fold iterator variant with non-overlapping groups.
    RepeatedKFold: Repeats K-Fold n times.
    """

    def __init__(self, n_splits=3, shuffle=False,
                 random_state=None):
        super(KFold, self).__init__(n_splits, shuffle, random_state)

    def _iter_test_indices(self, X, y=None, groups=None):
        n_samples = _num_samples(X)
        indices = np.arange(n_samples)
        if self.shuffle:
            check_random_state(self.random_state).shuffle(indices)

        n_splits = self.n_splits
        fold_sizes = (n_samples // n_splits) * np.ones(n_splits, dtype=np.int)
        fold_sizes[:n_samples % n_splits] += 1
        current = 0
        for fold_size in fold_sizes:
            start, stop = current, current + fold_size
            yield indices[start:stop]
            current = stop

总结：

简单介绍了日常用到的两个交叉验证的方法，在不断学习的过程中，还有其他的交叉验证方法没有使用，日后会进行补充的。

机器学习中的交叉验证简介

猜你喜欢