缺失数据填补基础方法（2）——Random Forest (MissForest)填补

一、MissForest介绍

MissForest以迭代的方式使用随机森林来填补缺失值[1]。默认情况下，填补器开始用最少数量缺失值来填补缺失值的列（它应该是变量）——我们称之为候选列（candidate column）。

第一步为使用初始猜测填补剩余非候选列的所有缺失值，初始猜测是表示数值变量列的平均值，以及表示分类变量的列模式。注意在填补器的fit()方法调用期间，需要显式标识分类变量。

然后，填补器拟合一个随机森林模型，其中候选列作为结果变量，其余列作为所有行的预测值，其中候选列值不缺失。拟合（fit）后，使用拟合后随机森林预测来填补候选列的缺失行。非候选列的行用作拟合模型的输入数据。

接下来，填补器将进入下一个候选列，第一轮中缺失值的数量为非候选列中的第二小（the second smallest）。对于缺失值的每一列，该过程会重复自身，可能会对每一列进行多次迭代知道满足停止标准。停止标准由连续迭代中填补数组之间的“差”（“difference”）决定。

对于数值变量（num_bars_），“差”定义如下：

 sum((X_new[:, num_vars_] - X_old[:, num_vars_]) ** 2) /
 sum((X_new[:, num_vars_]) ** 2)

对于类别变量（cat_vars_），“差”定义如下：

sum(X_new[:, cat_vars_] != X_old[:, cat_vars_])) / n_cat_missing

其中，X_new是新填补的数组，X_old是上一轮插补的数组，n_cat_missing是缺少的分类值总数，sum（）是跨行和列执行的。在[1]之后，当两种类型变量（如果可用）的X_new和X_old之间的差值首次增大时，认为已满足停止标准。

注意：分类变量需要是一个one-hot的（也称为伪编码），并且需要在插补器的fit（）方法调用期间显式标识它们。

二、代码示例

代码示例：

>>> from missingpy import MissForest
>>> nan = float("NaN")
>>> X = [[1, 2, nan], [3, 4, 3], [nan, 6, 5], [8, 8, 7]]
>>> imputer = MissForest(random_state=1337)
>>> imputer.fit_transform(X)
Iteration: 0
Iteration: 1
Iteration: 2
array([[1.  , 2. , 3.92 ],
       [3.  , 4. , 3. ],
       [2.71, 6. , 5. ],
       [8.  , 8. , 7. ]])

三、API实例

MissForest的API如下：

MissForest(max_iter=10, decreasing=False, missing_values=np.nan,
             copy=True, n_estimators=100, criterion=('mse', 'gini'),
             max_depth=None, min_samples_split=2, min_samples_leaf=1,
             min_weight_fraction_leaf=0.0, max_features='auto',
             max_leaf_nodes=None, min_impurity_decrease=0.0,
             bootstrap=True, oob_score=False, n_jobs=-1, random_state=None,
             verbose=0, warm_start=False, class_weight=None)
             
Parameters
----------
NOTE: Most parameter definitions below are taken verbatim from the
Scikit-Learn documentation at [2] and [3].

max_iter : int, optional (default = 10)
    The maximum iterations of the imputation process. Each column with a
    missing value is imputed exactly once in a given iteration.

decreasing : boolean, optional (default = False)
    If set to True, columns are sorted according to decreasing number of
    missing values. In other words, imputation will move from imputing
    columns with the largest number of missing values to columns with
    fewest number of missing values.

missing_values : np.nan, integer, optional (default = np.nan)
    The placeholder for the missing values. All occurrences of
    `missing_values` will be imputed.

copy : boolean, optional (default = True)
    If True, a copy of X will be created. If False, imputation will
    be done in-place whenever possible.

criterion : tuple, optional (default = ('mse', 'gini'))
    The function to measure the quality of a split.The first element of
    the tuple is for the Random Forest Regressor (for imputing numerical
    variables) while the second element is for the Random Forest
    Classifier (for imputing categorical variables).

n_estimators : integer, optional (default=100)
    The number of trees in the forest.

max_depth : integer or None, optional (default=None)
    The maximum depth of the tree. If None, then nodes are expanded until
    all leaves are pure or until all leaves contain less than
    min_samples_split samples.

min_samples_split : int, float, optional (default=2)
    The minimum number of samples required to split an internal node:
    - If int, then consider `min_samples_split` as the minimum number.
    - If float, then `min_samples_split` is a fraction and
      `ceil(min_samples_split * n_samples)` are the minimum
      number of samples for each split.

min_samples_leaf : int, float, optional (default=1)
    The minimum number of samples required to be at a leaf node.
    A split point at any depth will only be considered if it leaves at
    least ``min_samples_leaf`` training samples in each of the left and
    right branches.  This may have the effect of smoothing the model,
    especially in regression.
    - If int, then consider `min_samples_leaf` as the minimum number.
    - If float, then `min_samples_leaf` is a fraction and
      `ceil(min_samples_leaf * n_samples)` are the minimum
      number of samples for each node.

min_weight_fraction_leaf : float, optional (default=0.)
    The minimum weighted fraction of the sum total of weights (of all
    the input samples) required to be at a leaf node. Samples have
    equal weight when sample_weight is not provided.

max_features : int, float, string or None, optional (default="auto")
    The number of features to consider when looking for the best split:
    - If int, then consider `max_features` features at each split.
    - If float, then `max_features` is a fraction and
      `int(max_features * n_features)` features are considered at each
      split.
    - If "auto", then `max_features=sqrt(n_features)`.
    - If "sqrt", then `max_features=sqrt(n_features)` (same as "auto").
    - If "log2", then `max_features=log2(n_features)`.
    - If None, then `max_features=n_features`.
    Note: the search for a split does not stop until at least one
    valid partition of the node samples is found, even if it requires to
    effectively inspect more than ``max_features`` features.

max_leaf_nodes : int or None, optional (default=None)
    Grow trees with ``max_leaf_nodes`` in best-first fashion.
    Best nodes are defined as relative reduction in impurity.
    If None then unlimited number of leaf nodes.

min_impurity_decrease : float, optional (default=0.)
    A node will be split if this split induces a decrease of the impurity
    greater than or equal to this value.
    The weighted impurity decrease equation is the following::
        N_t / N * (impurity - N_t_R / N_t * right_impurity
                            - N_t_L / N_t * left_impurity)
    where ``N`` is the total number of samples, ``N_t`` is the number of
    samples at the current node, ``N_t_L`` is the number of samples in the
    left child, and ``N_t_R`` is the number of samples in the right child.
    ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum,
    if ``sample_weight`` is passed.

bootstrap : boolean, optional (default=True)
    Whether bootstrap samples are used when building trees.

oob_score : bool (default=False)
    Whether to use out-of-bag samples to estimate
    the generalization accuracy.

n_jobs : int or None, optional (default=-1)
    The number of jobs to run in parallel for both `fit` and `predict`.
    ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
    ``-1`` means using all processors.

random_state : int, RandomState instance or None, optional (default=None)
    If int, random_state is the seed used by the random number generator;
    If RandomState instance, random_state is the random number generator;
    If None, the random number generator is the RandomState instance used
    by `np.random`.

verbose : int, optional (default=0)
    Controls the verbosity when fitting and predicting.

warm_start : bool, optional (default=False)
    When set to ``True``, reuse the solution of the previous call to fit
    and add more estimators to the ensemble, otherwise, just fit a whole
    new forest. See :term:`the Glossary <warm_start>`.

class_weight : dict, list of dicts, "balanced", "balanced_subsample" or \
None, optional (default=None)
    Weights associated with classes in the form ``{
    
    class_label: weight}``.
    If not given, all classes are supposed to have weight one. For
    multi-output problems, a list of dicts can be provided in the same
    order as the columns of y.
    Note that for multioutput (including multilabel) weights should be
    defined for each class of every column in its own dict. For example,
    for four-class multilabel classification weights should be
    [{
    
    0: 1, 1: 1}, {
    
    0: 1, 1: 5}, {
    
    0: 1, 1: 1}, {
    
    0: 1, 1: 1}] instead of
    [{
    
    1:1}, {
    
    2:5}, {
    
    3:1}, {
    
    4:1}].
    The "balanced" mode uses the values of y to automatically adjust
    weights inversely proportional to class frequencies in the input data
    as ``n_samples / (n_classes * np.bincount(y))``
    The "balanced_subsample" mode is the same as "balanced" except that
    weights are computed based on the bootstrap sample for every tree
    grown.
    For multi-output, the weights of each column of y will be multiplied.
    Note that these weights will be multiplied with sample_weight (passed
    through the fit method) if sample_weight is specified.
    NOTE: This parameter is only applicable for Random Forest Classifier
    objects (i.e., for categorical variables).

Attributes
----------
statistics_ : Dictionary of length two
    The first element is an array with the mean of each numerical feature
    being imputed while the second element is an array of modes of
    categorical features being imputed (if available, otherwise it
    will be None).

Methods
-------
fit(self, X, y=None, cat_vars=None):
    Fit the imputer on X.

    Parameters
    ----------
    X : {
    
    array-like}, shape (n_samples, n_features)
        Input data, where ``n_samples`` is the number of samples and
        ``n_features`` is the number of features.

    cat_vars : int or array of ints, optional (default = None)
        An int or an array containing column indices of categorical
        variable(s)/feature(s) present in the dataset X.
        ``None`` if there are no categorical variables in the dataset.

    Returns
    -------
    self : object
        Returns self.
    
        
transform(X):
    Impute all missing values in X.

    Parameters
    ----------
    X : {
    
    array-like}, shape = [n_samples, n_features]
        The input data to complete.

    Returns
    -------
    X : {
    
    array-like}, shape = [n_samples, n_features]
        The imputed dataset.
    

fit_transform(X, y=None, **fit_params):
    Fit MissForest and impute all missing values in X.

    Parameters
    ----------
    X : {
    
    array-like}, shape (n_samples, n_features)
        Input data, where ``n_samples`` is the number of samples and
        ``n_features`` is the number of features.

    Returns
    -------
    X : {
    
    array-like}, shape (n_samples, n_features)
        Returns imputed dataset.

参考：

Stekhoven, Daniel J., and Peter Bühlmann. “MissForest—non-parametric missing value imputation for mixed-type data.” Bioinformatics 28.1 (2011): 112-118.