Decision tree and random forest based on R language (2)

In the last section, we talked about decision trees. Today we are going to talk about random forests. Random forest algorithms have been really popular in the past two years. I just checked the literature. As long as you talk about random forests, you can just send a few BB sentences. For a Chinese core or dual-core paper, you might as well engage in random forest now. Time is not waiting, so what are you waiting for?
Random forest was proposed by Breiman in 2001. It solves the problem of collinearity in logistic regression. It includes an algorithm for estimating missing values. If a part of the data is missing, it can still maintain a certain degree of accuracy. The algorithm of classification tree in random forest naturally includes the interaction of variables, so it does not need to check whether the interaction and nonlinearity of variables are significant. In most cases, the default settings of the model parameters can give optimal or near optimal results.
Random forest can be simply understood as a lot of decision numbers through classification voting. The principle is roughly as follows: Random sampling of the training set is performed with replacement, and multiple samples obtained form a subset of the training set as a new training set. Then randomly extract p features of the training set from the generated new training set to form a subset, and use the subset to train a decision tree without pruning it. Repeat this process until n decision trees are trained, classify the test samples to be classified for each decision tree, and count the classification results of each decision tree, and use the category recognized by the most decision tree as the final classification result.
We continue to use the last data for random forest, we need randomForest, pROC, foreign packages, we first import the data and view the variables

library(randomForest)
library(pROC)
library(foreign)
bc <- read.spss("E:/r/test/bankloan_cs.sav",
                use.value.labels=F, to.data.frame=T)
names(bc)

Insert picture description here
Insert picture description here
Delete some of the redundant variables
bc<-bc[,c(-1:-3,-13:-15,-5)] to
get the following data
Age age, employ’s working employer’s years, address living time in this place, income Income, debt income ratio of debtinc, credit card debt of creddebt, othdebt other debts, the last default is our outcome indicator, that is, whether it is a high-risk customer.
Insert picture description here
Divide the data into a training set and a prediction set (that is, a modeling and a verification), and a seed must be set first, so that there is repeatability

###设置训练和预测集
set.seed(1)
index <- sample(2,nrow(bc),replace = TRUE,prob=c(0.7,0.3))
traindata <- bc[index==1,]
testdata <- bc[index==2,]
###拟合随机森林模型,默认的mtry的值是自变量除以3
def_ntree<- randomForest(default ~age+employ+address+income+debtinc+creddebt
                         +othdebt,data=traindata,
                         ntree=500,important=TRUE,proximity=TRUE)

I feel that the model error is still quite large
Insert picture description here

plot(def_ntree)##画图

Insert picture description here
The picture shows that 500 branches of trees have been reached, and the error has changed very little. You can also use the following code to find the tree with the smallest error, but I feel that it is more reliable to look at the picture

which.min(def_ntree $err.rate[,1])### 最小误差的树

We create a customer data by random simulation and judge the system

newdata1<-data.frame(age=30,employ=5,address=2,income=100,
                     debtinc=5.2,creddebt=0.3,othdebt=0.2)
predict(def_ntree,newdata1)

Insert picture description here
The system is considered as a high-risk customer. If it is a 3 classification outcome indicator, you can also use the following grammar for probability analysis and voting

predict(def_ntree,newdata1,type = "prob")

Look at the important indicators related to the model, and get the most important indicators that affect the model through scoring and diagrams. It can be seen that the debt-to-income ratio is the most important, which is the same as the decision tree judgment

importance(def_ntree)
varImpPlot(def_ntree)

Insert picture description here
Insert picture description here
If we want to know the impact of the most influential debt-to-income ratio indicator on the outcome indicator, the figure on the right below shows that debt income greater than 30 can easily be judged as a high-risk customer

partialPlot(def_ntree,traindata,debtinc,"0",xlab = "debtinc",ylab = "Variabl effect")

Insert picture description here
Finally, verify the model through the validation set

def_pred<-predict(def_ntree, newdata=testdata)##生成概率
roc<-multiclass.roc (as.ordered(testdata$default) ,as.ordered(def_pred))#拟合ROC
roc1<-roc(as.ordered(testdata$default) ,as.ordered(def_pred))
round(auc(roc1),3)##AUC
round(ci(roc1),3)##95%CI
plot(roc1)
plot(roc1,print.auc=T, auc.polygon=T, grid=c(0.1, 0.2), grid.col=c("green","red"), 
     max.auc.polygon=T, auc.polygon.col="skyblue",print.thres=T)


Insert picture description here
It can also be drawn like this

plot(1-roc1$specificities,roc1$sensitivities,col="red",
     lty=1,lwd=2,type = "l",xlab = "specificities",ylab = "sensitivities")
abline(0,1)
legend(0.7,0.3,c("AUC=0.82"),lty=c(1),lwd=c(2),col="red",bty = "n")

Insert picture description here
references:

  1. R's randomForest description
  2. Tang Dawei. Research on the daily runoff prediction model based on random forest and its implementation in R language[J]. Heilongjiang Water Conservancy Science and Technology, 2019(12).
  3. Zheng Zhiwei,Qiu Jialing,Yang Qingling,Gong Xiaochun,Guo Shanqing,Jia Zhongwei,Hao Chun.The application of random forest to text sentiment analysis and the implementation of R software[J].Modern Preventive Medicine,2018,45(8):1345-1348,1353.
  4. Li Xinhai.The application of random forest model in classification and regression analysis[J].Acta Applied Entomology,2013,50(4):1190-1197. For
    more exciting articles, please pay attention to the public number: Zero-Basic Research
    Insert picture description here

Guess you like

Origin blog.csdn.net/dege857/article/details/114794505