R language manually draws random forest decision curve

DCA (Decision Curve Analysis) clinical decision curve is a method for evaluating the diagnostic accuracy of diagnostic models. It was created by Dr. Andrew Vickers in 2006. We usually judge a disease and like to use the AUC value of the ROC curve to determine the accuracy of the model. However, the ROC curve is usually evaluated by specificity and sensitivity. In actual clinical practice, we should also consider the impact of false positives and false negatives on patients. Therefore, the concepts of threshold probability and net benefit are introduced in the DCA curve.
In previous articles, we have introduced how to draw decision curves of logistic regression, cox regression and competitive regression models through R language. Many fans privately asked how to draw the decision curve of random forest. Today we will demonstrate how to draw the decision curve of random forest manually. It is also common for the decision curve drawing method of other models.
What is a decision curve?
insert image description here
In fact, the decision curve is 3 lines, which is equivalent to drawing 3 lines. Its X-axis represents the threshold probability, and the Y-axis represents the net benefit. It can be seen that calculating the net benefit is the key, and it is necessary to calculate 3 net benefits, all, none, and the net benefit of our model.
What is the net benefit?
insert image description here
The net benefit needs to calculate the number of true positives and false positives, so we need to calculate a key thing, which is the predicted probability, and the actual outcome, so as to obtain relevant data. Let's demonstrate it through an example. Using the data in the article "Decision Tree and Random Forest Based on R Language (2)" as an example, we first import the data and the R package.

library(randomForest)
library(pROC)
library(foreign)
bc <- read.spss("E:/r/test/bankloan_cs.sav",
                use.value.labels=F, to.data.frame=T)

Delete some of the redundant variables to get the following data
Age, the number of years of the employing employer, the time address has lived in this place, income, debtinc debt-to-income ratio, creddebt credit card debt, othdebt other debts, and the last default is our outcome indicator , that is, whether it is a high-risk customer. Official account reply: bank customer data, you can get this data.

bc<-bc[,c(-1:-3,-13:-15,-5)]
head(bc,6)

##   age employ address income debtinc creddebt   othdebt default
## 1  28      7       2     44    17.7 2.990592  4.797408       0
## 2  64     34      17    116    14.7 5.047392 12.004608       0
## 3  40     20      12     61     4.8 1.042368  1.885632       0
## 4  30     11       3     27    34.5 1.751220  7.563780       0
## 5  25      2       2     30    22.4 0.759360  5.960640       1
## 6  35      2       9     38    10.9 1.462126  2.679874       1
bc$default<-as.factor(bc$default)

Divide the data into a training set and a prediction set (that is, a modeling, a verification), you must first set a seed, so that it is repeatable
###Set training and prediction sets

set.seed(1)
index <- sample(2,nrow(bc),replace = TRUE,prob=c(0.7,0.3))
traindata <- bc[index==1,]
testdata <- bc[index==2,]

###Fitting the random forest model, the default value of mtry is the independent variable divided by 3

def_ntree<- randomForest(default ~age+employ+address+income+debtinc+creddebt
                         +othdebt,data=traindata,
                         ntree=500,important=TRUE,proximity=TRUE)

Use the verification set to generate prediction probabilities (I am just demonstrating here, you can also use the prediction set), here def_pred generates two probabilities of 0 and 1, we only need the probability of 1, so we need to convert it.

def_pred<-predict(def_ntree, newdata=testdata,type = "prob")##生成概率
def_pred<-as.data.frame(def_pred)
testdata$def_pred<-def_pred$`1`

We first need to generate the number of people in the verification set and calculate the occurrence rate of the actual result, that is, divide the number of people in the verification set (that is, the part of 1) by the total number of people, and use it later for calculation.

N=dim(testdata)[1]
testdata$default<-as.numeric(testdata$default)-1
event.rate=mean(testdata$default)

The non-occurrence rate is

1-event.rate
## [1] 0.6391982

Next, we need to build a forecast data table of net profit, which will be displayed as an X-axis, and the range must be 0-1. The step size can be set by ourselves, usually 0.01

nb=data.frame(threshold=seq(from=0, to=1, by=0.01))
head(nb,6)
##   threshold
## 1      0.00
## 2      0.01
## 3      0.02
## 4      0.03
## 5      0.04
## 6      0.05

According to the net benefit formula, it can be obtained as follows, so that nb has 3 data

nb["all"]=event.rate - (1-event.rate)*nb$threshold/(1-nb$threshold)
nb["none"]=0
head(nb,6)
##   threshold       all none
## 1      0.00 0.3608018    0
## 2      0.01 0.3543452    0
## 3      0.02 0.3477569    0
## 4      0.03 0.3410328    0
## 5      0.04 0.3341685    0
## 6      0.05 0.3271598    0

The next step is to calculate the number of true positives and false positives (tp and fp), because it is necessary to calculate the number of true positives and false positives (tp and fp) under different predictions, so each threshold must be run, here is Write a loop. tp is equal to the incidence rate of the part of the data where the predicted probability is greater than the threshold probability, and then multiplied by the number of people in this part, fp is just the opposite, experience it.

for(t in 1:length(nb$threshold)){
    
    
  tp=mean(testdata[testdata["def_pred"]>=nb$threshold[t],"default"])*sum(testdata["def_pred"]>=nb$threshold[t])
  fp=(1-mean(testdata[testdata["def_pred"]>=nb$threshold[t],"default"]))*sum(testdata["def_pred"]>=nb$threshold[t])
  if (sum(testdata["def_pred"]>=nb$threshold[t])==0) {
    
    
    tp=0
    fp=0
  }
  nb[t,"def_pred"]=tp/N - fp/N*(nb$threshold[t]/(1-nb$threshold[t]))  #净获益公式
}

In this way, after running the cycle, the net benefit of the model under different thresholds can be obtained, and the picture can be drawn

head(nb,6)
##   threshold       all none  def_pred
## 1      0.00 0.3608018    0 0.3608018
## 2      0.01 0.3543452    0 0.3546827
## 3      0.02 0.3477569    0 0.3487569
## 4      0.03 0.3410328    0 0.3433747
## 5      0.04 0.3341685    0 0.3376949
## 6      0.05 0.3271598    0 0.3323174

We use ggplot to draw pictures here, we need to adjust the data format

library(reshape2)
library(ggplot2)

Convert long data to wide data

plotdat <- melt(nb,id="threshold",measure=c("def_pred","all","none"))
head(plotdat,6)
##   threshold variable     value
## 1      0.00 def_pred 0.3608018
## 2      0.01 def_pred 0.3546827
## 3      0.02 def_pred 0.3487569
## 4      0.03 def_pred 0.3433747
## 5      0.04 def_pred 0.3376949
## 6      0.05 def_pred 0.3323174

draw graphics

ggplot(plotdat,aes(x=threshold,y=value,colour=variable))+geom_line()+
  coord_cartesian(xlim=c(0,1), ylim=c(-0.05,0.5))+ 
  labs(x="Threshold probability (%)")+labs(y="Net benefit")+
  scale_color_discrete(name="Model",labels=c("RF","all","none"))+
  theme_bw()+
  theme(panel.grid.major=element_blank(),
        panel.grid.minor=element_blank(),
        legend.title=element_blank())

insert image description here
In this way, the graph is drawn, and it is not difficult to do it in just a few steps. This method is applicable to various models including lasso regression, machine learning, etc. as long as the predicted probability is obtained. In fact, it has little to do with the model. The main thing is to calculate the prediction Probability, if the foundation is not good or friends who want to be convenient, you can also use functions written by others, both dca.r and stdca.r are available.
I will use dca.r to demonstrate briefly here.
The previous steps are the same, and the probability must be calculated

set.seed(1)
index <- sample(2,nrow(bc),replace = TRUE,prob=c(0.7,0.3))
traindata <- bc[index==1,]
testdata <- bc[index==2,]

###Fitting the random forest model, the default value of mtry is the independent variable divided by 3

def_ntree<- randomForest(default ~age+employ+address+income+debtinc+creddebt
                         +othdebt,data=traindata,
                         ntree=500,important=TRUE,proximity=TRUE)
def_pred<-predict(def_ntree, newdata=testdata,type = "prob")##生成概率
def_pred<-as.data.frame(def_pred)
testdata$def_pred<-def_pred$`1`

Here the outcome variable needs to be converted into a number, otherwise an error will be reported.

testdata$default<-as.numeric(testdata$default)-1  ##这里结局变量需转成数字0和1

For drawing, you need to specify the outcome variable and the predicted probability. Here, the outcome is default, and the predicted probability is def_pred.

out<- dca(data = testdata, outcome = "default", 
                 predictors = c("def_pred"),
                 xstart = 0, xstop = 1, ymin = -0.05)

insert image description here
Data can also be extracted and plotted

out$net.benefit
##     threshold           all none      def_pred
## 1        0.00   0.360801782    0  0.3608017817
## 2        0.01   0.354345234    0  0.3546826843
## 3        0.02   0.347756920    0  0.3487568747
## 4        0.03   0.341032765    0  0.3433747388

The use of ggolot drawing has been demonstrated above, so it will not be demonstrated here. I packaged dca.r and stdca.r. If you need these two functions, please forward this article on the official account to the circle of friends, set 10 likes, and send the screenshot to me. If you find it troublesome, give me a reward of 5 yuan. I can too.

Guess you like

Origin blog.csdn.net/dege857/article/details/130592910