In the previous section, we introduced some simple situations of random forests and the random forest model analysis of binary outcome variables using R language. Today we continue to introduce the random forest model analysis of continuous variable outcome variables. Not much nonsense, start immediately. The data used is the atmospheric ozone data set that comes with SPSS, which mainly describes the ozone concentration and some related indicators of the atmosphere. Because some of the data is non-linear, it is not appropriate to use Logistic regression, and the random forest model can be used for analysis.
Need to use randomForest, pROC, lforeign, Metrics, ggplot2 package, you must download it first, let's import the data to see the situation
library(randomForest)
library(pROC)
library(foreign)
library(Metrics)
library(ggplot2)
bc <- read.spss("E:/r/test/ozone.sav",
use.value.labels=F, to.data.frame=T)
names(bc)
There are seven variables in the data, ozon daily ozone level is the outcome variable, Inversion base height, Pressure gradient (mm Hg) Pressure gradient (mm Hg), Visibility (miles) visibility (miles), Temperature (degrees F) ) Temperature (Fahrenheit), Day of the year, vh, I don't know what it is, anyway, it is a parameter, and all the variables here are continuous.
Like the two-class model, we first establish a training set and a validation set
set.seed(1)
index <- sample(2,nrow(bc),replace = TRUE,prob=c(0.7,0.3))
traindata <- bc[index==1,]
testdata <- bc[index==2,]
Build a random forest model
def_ntree<- randomForest(ozon ~vh+ibh+dpg+vis+temp,data=traindata,
ntree=500,important=TRUE,proximity=TRUE)
plot(def_ntree)
It can be seen that after 500 trees, the model is relatively stable, we can also find mtry through the tuneRF function,
mte<-tuneRF(traindata[,c(1,3:6)],traindata[,2],stepFactor =2)
If we want to know the relationship between temperature and ozone concentration, we can find that when the temperature reaches 90 degrees, it has a great impact on ozone.
partialPlot(def_ntree,traindata,temp,"0",xlab = "temp",ylab = "Variabl effect")
Next generate the probability of the training set and the prediction set
ctree.predict = predict(def_ntree,testdata)
traindata$ctree.predict1 = predict(def_ntree,traindata)
Compare the fit of the model by viewing R2, rmse, and mae
r2<-cor(ctree.predict,testdata$ozon)^2
rmse<-rmse(ctree.predict,testdata$ozon)
mae<-mae(ctree.predict,testdata$ozon)
Generate a calibrated scatter plot, it can be found that after more than 15, the prediction performance becomes worse and the residual error increases
plot(ctree.predict,testdata$ozon,pch=19,col="gray25",xlab = "实际值",ylab="预存值")
abline(0,1,col="red")
Scoring the importance of variables, you can find that temperature and altitude are important variables that affect ozone
varImpPlot(def_ntree)
ggplot() +
geom_line(data=traindata,aes(traindata$doy, traindata$ozon,col="red"))+
geom_line(data=traindata,aes(traindata$doy, traindata$ctree.predict1,col="greed"))+
theme(legend.background = element_blank(),legend.position = c(0.1,0.8))+
scale_color_discrete(name = "类别", labels = c("实际值", "预测值"))
We can also bring in any data to find the predicted value, we can also predict by classifying the ozone concentration, and verify the prediction performance of the model through the verification set. I will not demonstrate one by one here. You can refer to the previous two articles. content.
For more exciting articles, please pay attention to the public number: zero-based scientific research