Practice|How to deal with missing values in random forest

Move your little hand to make a fortune, give it a thumbs up!

alt

Except for some over-cleaned datasets found online, missing values ​​are everywhere. In fact, the more complex and larger the dataset, the greater the chance of missing values. Missing values ​​are a fascinating area of ​​statistical research, but in practice they are often troublesome.

If you are dealing with a forecasting problem where you want to predict variable Y from p-dimensional covariates X=(X_1,…,X_p) and you are faced with missing values ​​in X, then tree-based methods have an interesting solution. This method is actually quite old, but seems to perform very well in a variety of datasets. I'm talking about the "missing attribute criterion" (MIA; [1]). While there are many good articles (such as this one) on missing values, this powerful method seems somewhat underused. In particular, there is no need to impute, remove, or predict missing values ​​in any way, but rather run the predictions as if they were fully observed data.

I'll quickly explain how the method itself works, and then provide an example along with the Distributed Random Forest (DRF) explained here. I chose DRF because it's a very general version of Random Forest (in particular, it can also be used to predict random vector Y), and because I'm a bit biased here. MIA is actually implemented for Generalized Random Forests (GRF), which covers a wide range of forest implementations. In particular, since the implementation of DRF on CRAN is based on GRF, the MIA method can also be used with a slight modification.

Of course, note that this is a quick fix (as far as I know) with no theoretical guarantees. Depending on the mechanism of deletion, the analysis could be severely biased. On the other hand, the most common methods of dealing with missing values ​​do not have any theoretical guarantees, or are known to bias the analysis, and MIA seems to work well, at least empirically, and

working principle

Recall that in RF, partitions are constructed in the form X_j < S or X_j ≥ S with dimensions j=1,…,p. In order to find this split value S, it optimizes some kind of criterion on Y, such as the CART criterion. Thus, the observations are continuously partitioned by a decision rule that depends on X.

alt

The original paper's explanation is a bit confusing, but as far as I understand, MIA works as follows: Let us consider a sample (Y_1, X_1),..., (Y_n, X_n),

alt

不缺失值的分割就是像上面那样寻找值S,然后将节点1中所有X_ij < S的Y_i和节点2中所有X_ij ≥ S的Y_i扔进去。计算每个值S的目标标准,例如CART,我们可以选择最好的一个。对于缺失值,每个候选分割值 S 有 3 个选项需要考虑:

  • 对所有观测值 i 使用通常的规则,使得 X_ij 被观测到,如果 X_ij 丢失,则将 i 发送到节点 1。
  • 对所有观测值 i 使用通常的规则,以便观测到 X_ij,如果缺少 X_ij,则将 i 发送到节点 2。
  • 忽略通常的规则,如果 X_ij 缺失,则将 i 发送到节点 1;如果观察到 X_ij,则将 i 发送到节点 2。

遵循这些规则中的哪一个再次根据我们使用的 Y_i 的标准来决定。

alt

例子

需要指出的是,CRAN 上的 drf 包尚未使用最新的方法进行更新。将来有一天,所有这些都将在 CRAN 上的一个包中实现。但是,目前有两个版本:

如果您想使用缺失值(无置信区间)的快速 drf 实现,您可以使用本文末尾附带的“drfown”函数。此代码改编自

lorismichel/drf: Distributional Random Forests (Cevid et al., 2020) (github.com)

另一方面,如果您想要参数的置信区间,请使用此(较慢的)代码

drfinference/drf-foo.R at main · JeffNaef/drfinference (github.com)

特别是,drf-foo.R 包含后一种情况所需的所有内容。

我们将重点关注具有置信区间的较慢代码,如本文所述,并考虑与所述文章中相同的示例:

set.seed(2)

n<-2000
beta1<-1
beta2<--1.8


# Model Simulation
X<-mvrnorm(n = n, mu=c(0,0), Sigma=matrix(c(1,0.7,0.7,1), nrow=2,ncol=2))
u<-rnorm(n=n, sd = sqrt(exp(X[,1])))
Y<- matrix(beta1*X[,1] + beta2*X[,2] + u, ncol=1)

请注意,这是一个异方差线性模型,p=2,误差项的方差取决于 X_1 值。现在我们还以随机缺失 (MAR) 方式向 X_1 添加缺失值:

prob_na <- 0.3
X[, 1] <- ifelse(X[, 2] <= -0.2 & runif(n) < prob_na, NA, X[, 1]) 

这意味着每当 X_2 的值小于 -0.2 时,X_1 缺失的概率为 0.3。因此X_1丢失的概率取决于X_2,这就是所谓的“随机丢失”。这已经是一个复杂的情况,通过查看缺失值的模式可以获得信息。也就是说,缺失不是“随机完全缺失(MCAR)”,因为X_1的缺失取决于X_2的值。这反过来意味着我们得出的 X_2 的分布是不同的,取决于 X_1 是否缺失。这尤其意味着删除具有缺失值的行可能会严重影响分析。

我们现在修复 x 并估计给定 X=x 的条件期望和方差,与上一篇文章中完全相同。

# Choose an x that is not too far out
x<-matrix(c(1,1),ncol=2)

# Choose alpha for CIs
alpha<-0.05

然后我们还拟合 DRF 并预测测试点 x 的权重(对应于预测 Y|X=x 的条件分布):

## Fit the new DRF framework
drf_fit <- drfCI(X=X, Y=Y, min.node.size = 5, splitting.rule='FourierMMD', num.features=10, B=100)

## predict weights
DRF = predictdrf(drf_fit, x=x)
weights <- DRF$weights[1,]

条件期望

我们首先估计 Y|X=x 的条件期望。

# Estimate the conditional expectation at x:
condexpest<- sum(weights*Y)

# Use the distribution of weights, see below
distofcondexpest<-unlist(lapply(DRF$weightsb, function(wb)  sum(wb[1,]*Y)  ))

# Can either use the above directly to build confidence interval, or can use the normal approximation.
# We will use the latter
varest<-var(distofcondexpest-condexpest)

# build 95%-CI
lower<-condexpest - qnorm(1-alpha/2)*sqrt(varest)
upper<-condexpest + qnorm(1-alpha/2)*sqrt(varest)
round(c(lower, condexpest, upper),2)

# without NAs: (-1.00, -0.69 -0.37)
# with NAs: (-1.15, -0.67, -0.19)

值得注意的是,使用 NA 获得的值与上一篇文章中未使用 NA 的第一次分析得到的值非常接近!这确实令我震惊,因为这个缺失的机制并不容易处理。有趣的是,估计器的估计方差也翻倍,从没有缺失值的大约 0.025 到有缺失值的大约 0.06。

真相如下:

alt

所以我们有一个轻微的错误,但置信区间包含事实,正如它们应该的那样。

对于更复杂的目标,结果看起来相似,例如条件方差:

# Estimate the conditional expectation at x:
condvarest<- sum(weights*Y^2) - condexpest^2

distofcondvarest<-unlist(lapply(DRF$weightsb, function(wb)  {
  sum(wb[1,]*Y^2) - sum(wb[1,]*Y)^2
} ))

# Can either use the above directly to build confidence interval, or can use the normal approximation.
# We will use the latter
varest<-var(distofcondvarest-condvarest)

# build 95%-CI
lower<-condvarest - qnorm(1-alpha/2)*sqrt(varest)
upper<-condvarest + qnorm(1-alpha/2)*sqrt(varest)

c(lower, condvarest, upper)

# without NAs: (1.89, 2.65, 3.42)
# with NAs: (1.79, 2.74, 3.69)

这里估计值的差异有点大。由于真相被给出为

alt

NA 的估计甚至稍微更准确(当然这可能只是随机性)。同样,(方差)估计量的方差估计随着缺失值的增加而增加,从 0.15(无缺失值)增加到 0.23。

结论

本文[1]中,我们讨论了 MIA,它是随机森林中分裂方法的一种改进,用于处理缺失值。由于它是在 GRF 和 DRF 中实现的,因此它可以被广泛使用,我们看到的小例子表明它工作得非常好。

However, I'd like to point out again that even for a large number of data points, there is no theoretical guarantee that agreement or confidence intervals are meaningful. There are many reasons for missing values, and great care must be taken not to bias the analysis by carelessly handling this issue. The MIA approach is by no means a well-understood solution to this problem. However, at the moment this seems like a reasonable quick fix that seems to be able to exploit patterns in missing data. If anyone has done a more extensive simulation analysis, I'd be curious about the results.

Code

require(drf)            
             
drfown <-               function(X, Y,
                              num.trees = 500,
                              splitting.rule = "FourierMMD",
                              num.features = 10,
                              bandwidth = NULL,
                              response.scaling = TRUE,
                              node.scaling = FALSE,
                              sample.weights = NULL,
                              sample.fraction = 0.5,
                              mtry = min(ceiling(sqrt(ncol(X)) + 20), ncol(X)),
                              min.node.size = 15,
                              honesty = TRUE,
                              honesty.fraction = 0.5,
                              honesty.prune.leaves = TRUE,
                              alpha = 0.05,
                              imbalance.penalty = 0,
                              compute.oob.predictions = TRUE,
                              num.threads = NULL,
                              seed = stats::runif(10, .Machine$integer.max),
                              compute.variable.importance = FALSE) {
  
  # initial checks for X and Y
  if (is.data.frame(X)) {
    
    if (is.null(names(X))) {
      stop("the regressor should be named if provided under data.frame format.")
    }
    
    if (any(apply(X, 2class) %inc("factor""character"))) {
      any.factor.or.character <- TRUE
      X.mat <- as.matrix(fastDummies::dummy_cols(X, remove_selected_columns = TRUE))
    } else {
      any.factor.or.character <- FALSE
      X.mat <- as.matrix(X)
    }
    
    mat.col.names.df <- names(X)
    mat.col.names <- colnames(X.mat)
  } else {
    X.mat <- X
    mat.col.names <- NULL
    mat.col.names.df <- NULL
    any.factor.or.character <- FALSE
  }
  
  if (is.data.frame(Y)) {
    
    if (any(apply(Y, 2, class) %in% c("factor""character"))) {
      stop("Y should only contain numeric variables.")
    }
    Y <- as.matrix(Y)
  }
  
  if (is.vector(Y)) {
    Y <- matrix(Y,ncol=1)
  }
  
  
  #validate_X(X.mat)
  
  if (inherits(X, "Matrix") && !(inherits(X, "dgCMatrix"))) {
        stop("Currently only sparse data of class 'dgCMatrix' is supported.")
    }
  
  drf:
::validate_sample_weights(sample.weights, X.mat)
  #Y <- validate_observations(Y, X)
  
  # set legacy GRF parameters
  clusters <- vector(mode = "numeric", length = 0)
  samples.per.cluster <- 0
  equalize.cluster.weights <- FALSE
  ci.group.size <- 1
  
  num.threads <- drf:::validate_num_threads(num.threads)
  
  all.tunable.params <- c("sample.fraction""mtry""min.node.size""honesty.fraction",
                          "honesty.prune.leaves""alpha""imbalance.penalty")
  
  # should we scale or not the data
  if (response.scaling) {
    Y.transformed <- scale(Y)
  } else {
    Y.transformed <- Y
  }
  
  data <- drf:::create_data_matrices(X.mat, outcome = Y.transformed, sample.weights = sample.weights)
  
  # bandwidth using median heuristic by default
  if (is.null(bandwidth)) {
    bandwidth <- drf:::medianHeuristic(Y.transformed)
  }
  
  
  args <- list(num.trees = num.trees,
               clusters = clusters,
               samples.per.cluster = samples.per.cluster,
               sample.fraction = sample.fraction,
               mtry = mtry,
               min.node.size = min.node.size,
               honesty = honesty,
               honesty.fraction = honesty.fraction,
               honesty.prune.leaves = honesty.prune.leaves,
               alpha = alpha,
               imbalance.penalty = imbalance.penalty,
               ci.group.size = ci.group.size,
               compute.oob.predictions = compute.oob.predictions,
               num.threads = num.threads,
               seed = seed,
               num_features = num.features,
               bandwidth = bandwidth,
               node_scaling = ifelse(node.scaling, 10))
  
  if (splitting.rule == "CART") {
    ##forest <- do.call(gini_train, c(data, args))
    forest <- drf:::do.call.rcpp(drf:::gini_train, c(data, args))
    ##forest <- do.call(gini_train, c(data, args))
  } else if (splitting.rule == "FourierMMD") {
    forest <- drf:::do.call.rcpp(drf:::fourier_train, c(data, args))
  } else {
    stop("splitting rule not available.")
  }
  
  class(forest) <- c("drf")
  forest[["ci.group.size"]] <- ci.group.size
  forest[["X.orig"]] <- X.mat
  forest[["is.df.X"]] <- is.data.frame(X)
  forest[["Y.orig"]] <- Y
  forest[["sample.weights"]] <- sample.weights
  forest[["clusters"]] <- clusters
  forest[["equalize.cluster.weights"]] <- equalize.cluster.weights
  forest[["tunable.params"]] <- args[all.tunable.params]
  forest[["mat.col.names"]] <- mat.col.names
  forest[["mat.col.names.df"]] <- mat.col.names.df
  forest[["any.factor.or.character"]] <- any.factor.or.character
  
  if (compute.variable.importance) {
    forest[['variable.importance']] <- variableImportance(forest, h = bandwidth)
  }
  
  forest
}

Reference

[1]

Source: https://towardsdatascience.com/random-forests-and-missing-values-3daaea103db0

This article is published by mdnice multi-platform

Guess you like

Origin blog.csdn.net/swindler_ice/article/details/131524830