Full code | Classic application of random forest in regression analysis

The background of the official account records various reading indicators of published articles, including: content title, total number of readers, total number of readings, total number of sharers, total number of shares, number of followers after reading, delivered reading rate, number of readings generated by sharing, first time Sharing rate, the number of readings brought by each sharing, and the reading completion rate.

We try to use the random forest algorithm in machine learning to predict whether there are certain indicators or combinations of indicators that can predict the number of followers after reading.

Data format and read data

The dataset includes 9 statistical indicators for 1588 articles.

Read Statistics Matrix: WeChatOfficialAccount.txt
Number of followers after reading:
WeChatOfficialAccountFollowers.txt

feature_file <- "data/WeChatOfficialAccount.txt"
metadata_file <- "data/WeChatOfficialAccountFollowers.txt"

feature_mat <- read.table(feature_file, row.names = 1, header = T, sep="\t", stringsAsFactors =T)

# 处理异常的特征名字
# rownames(feature_mat) <- make.names(rownames(feature_mat))

metadata <- read.table(metadata_file, row.names=1, header=T, sep="\t", stringsAsFactors =T)

dim(feature_mat)

## [1] 1588    9

Reading statistics are represented as follows:

feature_mat[1:4,1:5]

##   TotalReadingPeople TotalReadingCounts TotalSharingPeople TotalSharingCounts ReadingRate
## 1               8278              11732                937               1069      0.0847
## 2               8951              12043                828                929      0.0979
## 3              18682              22085                781                917      0.0608
## 4               4978               6166                525                628      0.0072

Metadata representation is as follows

head(metadata)

##   FollowersAfterReading
## 1                   227
## 2                   188
## 3                   119
## 4                   116
## 5                   105
## 6                   100

Sample Screening and Sequencing

It is also an operation that needs to be ensured that the sample order in the sample table and the expression table are aligned .

feature_mat_sampleL <- rownames(feature_mat)
metadata_sampleL <- rownames(metadata)

common_sampleL <- intersect(feature_mat_sampleL, metadata_sampleL)

# 保证表达表样品与METAdata样品顺序和数目完全一致
feature_mat <- feature_mat[common_sampleL,,drop=F]
metadata <- metadata[common_sampleL,,drop=F]

Whether to judge classification or regression

The parameters have been given when reading the data earlier stringsAsFactors =T, so this step can be ignored.

If the column corresponding to the group is a number, convert it to a numeric type - do regression
If the column corresponding to group is grouped, convert to factor type - do classification

# R4.0之后默认读入的不是factor，需要做一个转换
# devtools::install_github("Tong-Chen/ImageGP")
library(ImageGP)

# 此处的FollowersAfterReading根据需要修改
group = "FollowersAfterReading"

# 如果group对应的列为数字，转换为数值型 - 做回归
# 如果group对应的列为分组，转换为因子型 - 做分类
if(numCheck(metadata[[group]])){
    if (!is.numeric(metadata[[group]])) {
      metadata[[group]] <- mixedToFloat(metadata[[group]])
    }
} else{
  metadata[[group]] <- as.factor(metadata[[group]])
}

Preliminary Analysis of Random Forest

library(randomForest)

# 查看参数是个好习惯
# 有了前面的基础概述，再看每个参数的含义就明确了很多
# 也知道该怎么调了
# 每个人要解决的问题不同，通常不是别人用什么参数，自己就跟着用什么参数
# 尤其是到下游分析时
# ?randomForest

# 查看源码
# randomForest:::randomForest.default

After loading the package, analyze it directly, and adjust the parameters after seeing the result.

# 设置随机数种子，具体含义见 https://mp.weixin.qq.com/s/6plxo-E8qCdlzCgN8E90zg
set.seed(304)

# 直接使用默认参数
rf <- randomForest(feature_mat, metadata[[group]])

Looking at the preliminary results, the random forest type is judged as a tree 分类is constructed, and the optimal decision is made 500from randomly selected 3indicators each time a decision is made ( mtry), the average square residue Mean of squared residuals: 39.82736, and the degree of variation explained % Var explained: 74.91. The result looks normal.

rf

## 
## Call:
##  randomForest(x = feature_mat, y = metadata[[group]]) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 39.82736
##                     % Var explained: 74.91

Observing the prediction effect of the model on the training set, it seems that the consistency is not bad.

library(ggplot2)

followerDF <- data.frame(Real_Follower=metadata[[group]], Predicted_Follower=predict(rf, newdata=feature_mat))

sp_scatterplot(followerDF, xvariable = "Real_Follower", yvariable = "Predicted_Follower",
               smooth_method = "auto") + coord_fixed(1)

Random Forest Standard Operating Procedure

Split training and test sets

library(caret)
seed <- 1
set.seed(seed)
train_index <- createDataPartition(metadata[[group]], p=0.75, list=F)
train_data <- feature_mat[train_index,]
train_data_group <- metadata[[group]][train_index]

test_data <- feature_mat[-train_index,]
test_data_group <- metadata[[group]][-train_index]

dim(train_data)

## [1] 1192    9

dim(test_data)

## [1] 396   9

Boruta feature selection to identify key categorical variables

# install.packages("Boruta")
library(Boruta)
set.seed(1)

boruta <- Boruta(x=train_data, y=train_data_group, pValue=0.01, mcAdj=T, 
       maxRuns=300)

boruta

## Boruta performed 14 iterations in 5.917085 secs.
##  8 attributes confirmed important: AverageReadingCountsForEachSharing, FirstSharingRate,
## ReadingRate, TotalReadingCounts, TotalReadingCountsOfSharing and 3 more;
##  1 attributes confirmed unimportant: ReadingFinishRate;

Look at the results of variable importance identification (in fact, it has also been reflected in the output above), 8one important variable, 0one possibly important variable ( tentative variable, the importance score has no statistical difference from the best shadow variable score), 1one is not important Variables.

table(boruta$finalDecision)

## 
## Tentative Confirmed  Rejected 
##         0         8         1

Plot the importance of the identified variables. If there are few variables, you can use the default drawing. When there are many variables, the drawn picture cannot be seen clearly, and you need to organize the data and draw by yourself.

Define a function to extract the importance value corresponding to each variable.

library(dplyr)
boruta.imp <- function(x){
  imp <- reshape2::melt(x$ImpHistory, na.rm=T)[,-1]
  colnames(imp) <- c("Variable","Importance")
  imp <- imp[is.finite(imp$Importance),]

  variableGrp <- data.frame(Variable=names(x$finalDecision), 
                            finalDecision=x$finalDecision)

  showGrp <- data.frame(Variable=c("shadowMax", "shadowMean", "shadowMin"),
                        finalDecision=c("shadowMax", "shadowMean", "shadowMin"))

  variableGrp <- rbind(variableGrp, showGrp)

  boruta.variable.imp <- merge(imp, variableGrp, all.x=T)

  sortedVariable <- boruta.variable.imp %>% group_by(Variable) %>% 
    summarise(median=median(Importance)) %>% arrange(median)
  sortedVariable <- as.vector(sortedVariable$Variable)


  boruta.variable.imp$Variable <- factor(boruta.variable.imp$Variable, levels=sortedVariable)

  invisible(boruta.variable.imp)
}

boruta.variable.imp <- boruta.imp(boruta)

head(boruta.variable.imp)

##                             Variable Importance finalDecision
## 1 AverageReadingCountsForEachSharing   4.861474     Confirmed
## 2 AverageReadingCountsForEachSharing   4.648540     Confirmed
## 3 AverageReadingCountsForEachSharing   6.098471     Confirmed
## 4 AverageReadingCountsForEachSharing   4.701201     Confirmed
## 5 AverageReadingCountsForEachSharing   3.852440     Confirmed
## 6 AverageReadingCountsForEachSharing   3.992969     Confirmed

Only Confirmedthe variables are plotted. It can be seen from the figure that the top 4variables in the ranking of importance are all related to "sharing" (the number of readings generated by sharing, the total number of sharers, the total number of shares, the first share rate), and the sharing of articles is very important for increasing attention.

library(ImageGP)

sp_boxplot(boruta.variable.imp, melted=T, xvariable = "Variable", yvariable = "Importance",
           legend_variable = "finalDecision", legend_variable_order = c("shadowMax", "shadowMean", "shadowMin", "Confirmed"),
           xtics_angle = 90, coordinate_flip =T)

Extract important variables and potentially important variables

boruta.finalVarsWithTentative <- data.frame(Item=getSelectedAttributes(boruta, withTentative = T), Type="Boruta_with_tentative")

data <- cbind(feature_mat, metadata)

variableFactor <- rev(levels(boruta.variable.imp$Variable))

sp_scatterplot(data, xvariable = group, yvariable = variableFactor[1], smooth_method = "auto")

Because there are not many variables, you can also use it ggpairsto see how all variables are related to each other and how they are related to the response variable?

library(GGally)

ggpairs(data, progress = F)

Cross-validation to choose parameters and fit the model

Define a function to generate some columns for testing mtry(a series of values not greater than the total number of variables).

generateTestVariableSet <- function(num_toal_variable){
  max_power <- ceiling(log10(num_toal_variable))
  tmp_subset <- c(unlist(sapply(1:max_power, function(x) (1:10)^x, simplify = F)), ceiling(max_power/3))
  #return(tmp_subset)
  base::unique(sort(tmp_subset[tmp_subset<num_toal_variable]))
}
# generateTestVariableSet(78)

Select data related to key characteristic variables

# 提取训练集的特征变量子集
boruta_train_data <- train_data[, boruta.finalVarsWithTentative$Item]
boruta_mtry <- generateTestVariableSet(ncol(boruta_train_data))

Tuning and modeling with Caret

library(caret)

if(file.exists('rda/wechatRegression.rda')){
  borutaConfirmed_rf_default <- readRDS("rda/wechatRegression.rda")
} else {

# Create model with default parameters
trControl <- trainControl(method="repeatedcv", number=10, repeats=5)

seed <- 1
set.seed(seed)
# 根据经验或感觉设置一些待查询的参数和参数值
tuneGrid <- expand.grid(mtry=boruta_mtry)

borutaConfirmed_rf_default <- train(x=boruta_train_data, y=train_data_group, method="rf", 
                                    tuneGrid = tuneGrid, # 
                                    metric="RMSE", #metric='Kappa'
                                    trControl=trControl)
saveRDS(borutaConfirmed_rf_default, "rda/wechatRegression.rda")
}

borutaConfirmed_rf_default

## Random Forest 
## 
## 1192 samples
##    8 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 1073, 1073, 1073, 1072, 1073, 1073, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##   1     6.441881  0.7020911  2.704873
##   2     6.422848  0.7050505  2.720557
##   3     6.418449  0.7052825  2.736505
##   4     6.431665  0.7039496  2.742612
##   5     6.453067  0.7013595  2.754239
##   6     6.470716  0.6998307  2.758901
##   7     6.445304  0.7020575  2.756523
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 3.

Plotting Accuracy vs. Hyperparameters

plot(borutaConfirmed_rf_default)

Plot the 20 variables with the highest contribution (the importance of variables assessed by Boruta is slightly different from the importance assessed by the model itself)

dotPlot(varImp(borutaConfirmed_rf_default))

Extract the final selected model and evaluate its performance.

borutaConfirmed_rf_default_finalmodel <- borutaConfirmed_rf_default$finalModel

First, use the training data set to evaluate the training effect of the built model, RMSE=3.1, Rsquared=0.944, which is quite good.

# 获得模型结果评估参数
predictions_train <- predict(borutaConfirmed_rf_default_finalmodel, newdata=train_data)
postResample(pred = predictions_train, obs = train_data_group)

##      RMSE  Rsquared       MAE 
## 3.1028533 0.9440182 1.1891391

Use the test data to evaluate the predictive effect of the model, RMSE=6.2, Rsquared=0.825, ok. Follow up with other methods to see if it can be improved.

predictions_train <- predict(borutaConfirmed_rf_default_finalmodel, newdata=test_data)
postResample(pred = predictions_train, obs = test_data_group)

##      RMSE  Rsquared       MAE 
## 6.2219834 0.8251457 2.7212806

library(ggplot2)

testfollowerDF <- data.frame(Real_Follower=test_data_group, Predicted_Follower=predictions_train)

sp_scatterplot(testfollowerDF, xvariable = "Real_Follower", yvariable = "Predicted_Follower",
               smooth_method = "auto") + coord_fixed(1)

Drawbacks of Random Forest Regression

The values predicted by the random forest regression model will not exceed the value range of the response variable in the training set and cannot be used for extrapolation.

Regression-Enhanced Random Forests (RERFs) can be used as a solution.

References

https://medium.com/swlh/random-forest-and-its-implementation-71824ced454f
https://neptune.ai/blog/random-forest-regression-when-does-it-fail-and-why
https://levelup.gitconnected.com/random-forest-regression-209c0f354c84
https://rpubs.com/Isaac/caret_reg

Machine Learning Tutorial Series

Starting from random forests, understand the concepts and practices of decision trees, random forests, ROC/AUC, data sets, and cross-validation step by step.

Use text for words that can be explained clearly, use pictures for display, formulas for unclear descriptions, and write a simple code for unclear formulas to clarify each link and concept step by step.

Then to mature code application, model tuning, model comparison, model evaluation, and learn the knowledge and skills needed for the entire machine learning.

Past products (click on the picture to go directly to the text corresponding tutorial)

machine learning

Reply in the background with "The first wave of benefits in the Life Letter Collection" or click to read the original text to get a collection of tutorials