装袋法优势:
1.有效降低了预测的方差
2.具有更好的预测效能
3.可以提供内在的预测效能估计
装袋法局限:
1.计算量
2.解释性差
如何用R建立装袋树?
先得到自变量和因变量
> library(caret)
> library(pROC)
> dat=read.csv("https://raw.githubusercontent.com/happyrabbit/DataScientistR/master/Data/SegData.csv")
> trainx=dat[,grep("Q",names(dat))]
> trainx$segment=dat$segment
> trainy=dat$gender
>
> set.seed(1000)
> bagTune=caret::train(trainx,trainy,method="treebag",nbagg=1000,metric="ROC",trControl=trainControl(method="cv",summaryFunction=twoClassSummary,classProbs=TRUE,savePredictions=TRUE))
> bagTune
Bagged CART
1000 samples
11 predictor
2 classes: 'Female', 'Male'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 900, 900, 900, 899, 900, 900, ...
Resampling results:
ROC Sens Spec
0.6987467 0.6535065 0.6619192
>
可以看出,最优ROC比之前单棵树有所改进。
敏感度(Sens)0.66,特异度(Spec)为0.65
随机森林
(详细讲解略,会在专门章节讲解各模型)
> mtryValues=c(1:5)
> set.seed(100)
> rfTune=train(x=trainx,y=trainy,method="rf",ntree=1000,tuneGrid=data.frame(.mtry=mtryValues),importance=TRUE,metric="ROC",trControl=trainControl(method="cv",summaryFunction=twoClassSummary,classProbs=TRUE,savePredictions=TRUE))
> rfTrue
Error: object 'rfTrue' not found
> rfTune
Random Forest
1000 samples
11 predictor
2 classes: 'Female', 'Male'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 901, 899, 900, 900, 901, 900, ...
Resampling results across tuning parameters:
mtry ROC Sens Spec
1 0.7140952 0.5485390 0.7981313
2 0.7160706 0.6405844 0.7174242
3 0.7190544 0.6460714 0.7018687
4 0.7161016 0.6514610 0.7086364
5 0.7171203 0.6550649 0.7019192
ROC was used to select the optimal model using the largest value.
The final value used for the model was mtry = 3.
>
如上,最优曲线下面积和装袋树相比有所提高,但提高程度不大
得到调优参数后还可以通过randomForest包拟合随机森林
> library(randomForest)
randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.
载入程辑包:‘randomForest’
The following object is masked from ‘package:ggplot2’:
margin
Warning message:
程辑包‘randomForest’是用R版本3.4.3 来建造的
> rfit=randomForest(trainy~.,trainx,mtry=1,ntree=1000)
> importance(rfit)
MeanDecreaseGini
Q1 9.765156
Q2 8.376815
Q3 7.237283
Q4 11.838238
Q5 5.824810
Q6 9.209922
Q7 6.708015
Q8 7.892137
Q9 5.658564
Q10 4.212070
segment 11.839103
>
对于分类情况,重要性的衡量是基于预测袋外数据时基尼系数减少的均值,可以用varImpPlot()函数对重要性绘图
上图,平均考虑所有的树,变量segment和Q4对于区分用户性别最重要