Decision tree and random forest based on R language (3)

In the previous section, we introduced some simple situations of random forests and the random forest model analysis of binary outcome variables using R language. Today we continue to introduce the random forest model analysis of continuous variable outcome variables. Not much nonsense, start immediately. The data used is the atmospheric ozone data set that comes with SPSS, which mainly describes the ozone concentration and some related indicators of the atmosphere. Because some of the data is non-linear, it is not appropriate to use Logistic regression, and the random forest model can be used for analysis.
Need to use randomForest, pROC, lforeign, Metrics, ggplot2 package, you must download it first, let's import the data to see the situation

library(randomForest)
library(pROC)
library(foreign)
library(Metrics)
library(ggplot2)
bc <- read.spss("E:/r/test/ozone.sav",
                use.value.labels=F, to.data.frame=T)
names(bc)

Insert picture description here
Insert picture description here
There are seven variables in the data, ozon daily ozone level is the outcome variable, Inversion base height, Pressure gradient (mm Hg) Pressure gradient (mm Hg), Visibility (miles) visibility (miles), Temperature (degrees F) ) Temperature (Fahrenheit), Day of the year, vh, I don't know what it is, anyway, it is a parameter, and all the variables here are continuous.
Like the two-class model, we first establish a training set and a validation set

set.seed(1)
index <- sample(2,nrow(bc),replace = TRUE,prob=c(0.7,0.3))
traindata <- bc[index==1,]
testdata <- bc[index==2,]

Insert picture description here
Build a random forest model

def_ntree<- randomForest(ozon ~vh+ibh+dpg+vis+temp,data=traindata,
                         ntree=500,important=TRUE,proximity=TRUE)
plot(def_ntree)

Insert picture description here
It can be seen that after 500 trees, the model is relatively stable, we can also find mtry through the tuneRF function,

mte<-tuneRF(traindata[,c(1,3:6)],traindata[,2],stepFactor =2)

If we want to know the relationship between temperature and ozone concentration, we can find that when the temperature reaches 90 degrees, it has a great impact on ozone.

partialPlot(def_ntree,traindata,temp,"0",xlab = "temp",ylab = "Variabl effect")

Insert picture description here
Next generate the probability of the training set and the prediction set

ctree.predict = predict(def_ntree,testdata)
traindata$ctree.predict1 = predict(def_ntree,traindata)

Compare the fit of the model by viewing R2, rmse, and mae

r2<-cor(ctree.predict,testdata$ozon)^2
rmse<-rmse(ctree.predict,testdata$ozon)
mae<-mae(ctree.predict,testdata$ozon)

Insert picture description here
Generate a calibrated scatter plot, it can be found that after more than 15, the prediction performance becomes worse and the residual error increases

plot(ctree.predict,testdata$ozon,pch=19,col="gray25",xlab = "实际值",ylab="预存值")
abline(0,1,col="red")

Insert picture description here
Scoring the importance of variables, you can find that temperature and altitude are important variables that affect ozone

varImpPlot(def_ntree)

Insert picture description here

ggplot() +
  geom_line(data=traindata,aes(traindata$doy, traindata$ozon,col="red"))+
  geom_line(data=traindata,aes(traindata$doy, traindata$ctree.predict1,col="greed"))+
  theme(legend.background = element_blank(),legend.position = c(0.1,0.8))+
  scale_color_discrete(name = "类别", labels = c("实际值", "预测值"))

Insert picture description here
We can also bring in any data to find the predicted value, we can also predict by classifying the ozone concentration, and verify the prediction performance of the model through the verification set. I will not demonstrate one by one here. You can refer to the previous two articles. content.
For more exciting articles, please pay attention to the public number: zero-based scientific research
Insert picture description here

Guess you like

Origin blog.csdn.net/dege857/article/details/114916294