R语言笔记之装袋树和随机森林

装袋法优势:

1.有效降低了预测的方差
2.具有更好的预测效能
3.可以提供内在的预测效能估计
装袋法局限:
1.计算量
2.解释性差
如何用R建立装袋树?
先得到自变量和因变量

> library(caret)
> library(pROC)
> dat=read.csv("https://raw.githubusercontent.com/happyrabbit/DataScientistR/master/Data/SegData.csv")
> trainx=dat[,grep("Q",names(dat))]
> trainx$segment=dat$segment
> trainy=dat$gender
> 
> set.seed(1000)
> bagTune=caret::train(trainx,trainy,method="treebag",nbagg=1000,metric="ROC",trControl=trainControl(method="cv",summaryFunction=twoClassSummary,classProbs=TRUE,savePredictions=TRUE))
> bagTune
Bagged CART 

1000 samples
  11 predictor
   2 classes: 'Female', 'Male' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 900, 900, 900, 899, 900, 900, ... 
Resampling results:

  ROC        Sens       Spec     
  0.6987467  0.6535065  0.6619192

> 

可以看出,最优ROC比之前单棵树有所改进。
敏感度(Sens)0.66,特异度(Spec)为0.65

随机森林

(详细讲解略,会在专门章节讲解各模型)

> mtryValues=c(1:5)
> set.seed(100)
> rfTune=train(x=trainx,y=trainy,method="rf",ntree=1000,tuneGrid=data.frame(.mtry=mtryValues),importance=TRUE,metric="ROC",trControl=trainControl(method="cv",summaryFunction=twoClassSummary,classProbs=TRUE,savePredictions=TRUE))
> rfTrue
Error: object 'rfTrue' not found
> rfTune
Random Forest 

1000 samples
  11 predictor
   2 classes: 'Female', 'Male' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 901, 899, 900, 900, 901, 900, ... 
Resampling results across tuning parameters:

  mtry  ROC        Sens       Spec     
  1     0.7140952  0.5485390  0.7981313
  2     0.7160706  0.6405844  0.7174242
  3     0.7190544  0.6460714  0.7018687
  4     0.7161016  0.6514610  0.7086364
  5     0.7171203  0.6550649  0.7019192

ROC was used to select the optimal model using the largest value.
The final value used for the model was mtry = 3.
> 

如上,最优曲线下面积和装袋树相比有所提高,但提高程度不大
得到调优参数后还可以通过randomForest包拟合随机森林

> library(randomForest)
randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.

载入程辑包:‘randomForest’

The following object is masked from ‘package:ggplot2’:

    margin

Warning message:
程辑包‘randomForest’是用R版本3.4.3 来建造的 
> rfit=randomForest(trainy~.,trainx,mtry=1,ntree=1000)
> importance(rfit)
        MeanDecreaseGini
Q1              9.765156
Q2              8.376815
Q3              7.237283
Q4             11.838238
Q5              5.824810
Q6              9.209922
Q7              6.708015
Q8              7.892137
Q9              5.658564
Q10             4.212070
segment        11.839103
> 

对于分类情况,重要性的衡量是基于预测袋外数据时基尼系数减少的均值,可以用varImpPlot()函数对重要性绘图
这里写图片描述
上图,平均考虑所有的树,变量segment和Q4对于区分用户性别最重要

猜你喜欢

转载自blog.csdn.net/lulujiang1996/article/details/79048675
今日推荐