Full code | Classic application of random forest in regression analysis

The background of the official account records various reading indicators of published articles, including: content title, total number of readers, total number of readings, total number of sharers, total number of shares, number of followers after reading, delivered reading rate, number of readings generated by sharing, first time Sharing rate, the number of readings brought by each sharing, and the reading completion rate.

We try to use the random forest algorithm in machine learning to predict whether there are certain indicators or combinations of indicators that can predict the number of followers after reading.

Data format and read data

The dataset includes 9 statistical indicators for 1588 articles.

  • Read Statistics Matrix: WeChatOfficialAccount.txt

  • Number of followers after reading:

    WeChatOfficialAccountFollowers.txt

feature_file <- "data/WeChatOfficialAccount.txt"
metadata_file <- "data/WeChatOfficialAccountFollowers.txt"

feature_mat <- read.table(feature_file, row.names = 1, header = T, sep="\t", stringsAsFactors =T)

# 处理异常的特征名字
# rownames(feature_mat) <- make.names(rownames(feature_mat))

metadata <- read.table(metadata_file, row.names=1, header=T, sep="\t", stringsAsFactors =T)

dim(feature_mat)
## [1] 1588    9

Reading statistics are represented as follows:

feature_mat[1:4,1:5]
##   TotalReadingPeople TotalReadingCounts TotalSharingPeople TotalSharingCounts ReadingRate
## 1               8278              11732                937               1069      0.0847
## 2               8951              12043                828                929      0.0979
## 3              18682              22085                781                917      0.0608
## 4               4978               6166                525                628      0.0072

Metadata representation is as follows

head(metadata)
##   FollowersAfterReading
## 1                   227
## 2                   188
## 3                   119
## 4                   116
## 5                   105
## 6                   100

Sample Screening and Sequencing

It is also an operation that needs to be ensured that the sample order in the sample table and the expression table are aligned .

feature_mat_sampleL <- rownames(feature_mat)
metadata_sampleL <- rownames(metadata)

common_sampleL <- intersect(feature_mat_sampleL, metadata_sampleL)

# 保证表达表样品与METAdata样品顺序和数目完全一致
feature_mat <- feature_mat[common_sampleL,,drop=F]
metadata <- metadata[common_sampleL,,drop=F]

Whether to judge classification or regression 

The parameters have been given when reading the data earlier stringsAsFactors =T, so this step can be ignored.

  • If the column corresponding to the group is a number, convert it to a numeric type - do regression

  • If the column corresponding to group is grouped, convert to factor type - do classification

# R4.0之后默认读入的不是factor,需要做一个转换
# devtools::install_github("Tong-Chen/ImageGP")
library(ImageGP)

# 此处的FollowersAfterReading根据需要修改
group = "FollowersAfterReading"

# 如果group对应的列为数字,转换为数值型 - 做回归
# 如果group对应的列为分组,转换为因子型 - 做分类
if(numCheck(metadata[[group]])){
    if (!is.numeric(metadata[[group]])) {
      metadata[[group]] <- mixedToFloat(metadata[[group]])
    }
} else{
  metadata[[group]] <- as.factor(metadata[[group]])
}

Preliminary Analysis of Random Forest 

library(randomForest)

# 查看参数是个好习惯
# 有了前面的基础概述,再看每个参数的含义就明确了很多
# 也知道该怎么调了
# 每个人要解决的问题不同,通常不是别人用什么参数,自己就跟着用什么参数
# 尤其是到下游分析时
# ?randomForest

# 查看源码
# randomForest:::randomForest.default

After loading the package, analyze it directly, and adjust the parameters after seeing the result.

# 设置随机数种子,具体含义见 https://mp.weixin.qq.com/s/6plxo-E8qCdlzCgN8E90zg
set.seed(304)

# 直接使用默认参数
rf <- randomForest(feature_mat, metadata[[group]])

Looking at the preliminary results, the random forest type is judged as a tree 分类is constructed, and the optimal decision is made 500from randomly selected 3indicators each time a decision is made ( mtry), the average square residue Mean of squared residuals: 39.82736, and the degree of variation explained % Var explained: 74.91. The result looks normal.

rf
## 
## Call:
##  randomForest(x = feature_mat, y = metadata[[group]]) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 39.82736
##                     % Var explained: 74.91

Observing the prediction effect of the model on the training set, it seems that the consistency is not bad.

library(ggplot2)

followerDF <- data.frame(Real_Follower=metadata[[group]], Predicted_Follower=predict(rf, newdata=feature_mat))

sp_scatterplot(followerDF, xvariable = "Real_Follower", yvariable = "Predicted_Follower",
               smooth_method = "auto") + coord_fixed(1)

26eb69b78b8b5e6cb4072025633a1f61.png

Random Forest Standard Operating Procedure

Split training and test sets

library(caret)
seed <- 1
set.seed(seed)
train_index <- createDataPartition(metadata[[group]], p=0.75, list=F)
train_data <- feature_mat[train_index,]
train_data_group <- metadata[[group]][train_index]

test_data <- feature_mat[-train_index,]
test_data_group <- metadata[[group]][-train_index]
dim(train_data)
## [1] 1192    9
dim(test_data)
## [1] 396   9

Boruta feature selection to identify key categorical variables

# install.packages("Boruta")
library(Boruta)
set.seed(1)

boruta <- Boruta(x=train_data, y=train_data_group, pValue=0.01, mcAdj=T, 
       maxRuns=300)

boruta
## Boruta performed 14 iterations in 5.917085 secs.
##  8 attributes confirmed important: AverageReadingCountsForEachSharing, FirstSharingRate,
## ReadingRate, TotalReadingCounts, TotalReadingCountsOfSharing and 3 more;
##  1 attributes confirmed unimportant: ReadingFinishRate;

Look at the results of variable importance identification (in fact, it has also been reflected in the output above), 8one important variable, 0one possibly important variable ( tentative variable, the importance score has no statistical difference from the best shadow variable score), 1one is not important Variables.

table(boruta$finalDecision)
## 
## Tentative Confirmed  Rejected 
##         0         8         1

Plot the importance of the identified variables. If there are few variables, you can use the default drawing. When there are many variables, the drawn picture cannot be seen clearly, and you need to organize the data and draw by yourself.

Define a function to extract the importance value corresponding to each variable.

library(dplyr)
boruta.imp <- function(x){
  imp <- reshape2::melt(x$ImpHistory, na.rm=T)[,-1]
  colnames(imp) <- c("Variable","Importance")
  imp <- imp[is.finite(imp$Importance),]

  variableGrp <- data.frame(Variable=names(x$finalDecision), 
                            finalDecision=x$finalDecision)

  showGrp <- data.frame(Variable=c("shadowMax", "shadowMean", "shadowMin"),
                        finalDecision=c("shadowMax", "shadowMean", "shadowMin"))

  variableGrp <- rbind(variableGrp, showGrp)

  boruta.variable.imp <- merge(imp, variableGrp, all.x=T)

  sortedVariable <- boruta.variable.imp %>% group_by(Variable) %>% 
    summarise(median=median(Importance)) %>% arrange(median)
  sortedVariable <- as.vector(sortedVariable$Variable)


  boruta.variable.imp$Variable <- factor(boruta.variable.imp$Variable, levels=sortedVariable)

  invisible(boruta.variable.imp)
}
boruta.variable.imp <- boruta.imp(boruta)

head(boruta.variable.imp)
##                             Variable Importance finalDecision
## 1 AverageReadingCountsForEachSharing   4.861474     Confirmed
## 2 AverageReadingCountsForEachSharing   4.648540     Confirmed
## 3 AverageReadingCountsForEachSharing   6.098471     Confirmed
## 4 AverageReadingCountsForEachSharing   4.701201     Confirmed
## 5 AverageReadingCountsForEachSharing   3.852440     Confirmed
## 6 AverageReadingCountsForEachSharing   3.992969     Confirmed

Only Confirmedthe variables are plotted. It can be seen from the figure that the top 4variables in the ranking of importance are all related to "sharing" (the number of readings generated by sharing, the total number of sharers, the total number of shares, the first share rate), and the sharing of articles is very important for increasing attention.

library(ImageGP)

sp_boxplot(boruta.variable.imp, melted=T, xvariable = "Variable", yvariable = "Importance",
           legend_variable = "finalDecision", legend_variable_order = c("shadowMax", "shadowMean", "shadowMin", "Confirmed"),
           xtics_angle = 90, coordinate_flip =T)

b26e6fe743309e905fcd50ec5938a7c6.png

Extract important variables and potentially important variables

boruta.finalVarsWithTentative <- data.frame(Item=getSelectedAttributes(boruta, withTentative = T), Type="Boruta_with_tentative")
data <- cbind(feature_mat, metadata)

variableFactor <- rev(levels(boruta.variable.imp$Variable))

sp_scatterplot(data, xvariable = group, yvariable = variableFactor[1], smooth_method = "auto")

7da578582522c491ef487e4f4cb4cd60.png

Because there are not many variables, you can also use it ggpairsto see how all variables are related to each other and how they are related to the response variable?

library(GGally)

ggpairs(data, progress = F)

f16775ea61a09e8d0f9b7e17ef0416b1.png

Cross-validation to choose parameters and fit the model

Define a function to generate some columns for testing mtry(a series of values ​​not greater than the total number of variables).

generateTestVariableSet <- function(num_toal_variable){
  max_power <- ceiling(log10(num_toal_variable))
  tmp_subset <- c(unlist(sapply(1:max_power, function(x) (1:10)^x, simplify = F)), ceiling(max_power/3))
  #return(tmp_subset)
  base::unique(sort(tmp_subset[tmp_subset<num_toal_variable]))
}
# generateTestVariableSet(78)

Select data related to key characteristic variables

# 提取训练集的特征变量子集
boruta_train_data <- train_data[, boruta.finalVarsWithTentative$Item]
boruta_mtry <- generateTestVariableSet(ncol(boruta_train_data))

Tuning and modeling with Caret

library(caret)

if(file.exists('rda/wechatRegression.rda')){
  borutaConfirmed_rf_default <- readRDS("rda/wechatRegression.rda")
} else {

# Create model with default parameters
trControl <- trainControl(method="repeatedcv", number=10, repeats=5)

seed <- 1
set.seed(seed)
# 根据经验或感觉设置一些待查询的参数和参数值
tuneGrid <- expand.grid(mtry=boruta_mtry)

borutaConfirmed_rf_default <- train(x=boruta_train_data, y=train_data_group, method="rf", 
                                    tuneGrid = tuneGrid, # 
                                    metric="RMSE", #metric='Kappa'
                                    trControl=trControl)
saveRDS(borutaConfirmed_rf_default, "rda/wechatRegression.rda")
}

borutaConfirmed_rf_default
## Random Forest 
## 
## 1192 samples
##    8 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 1073, 1073, 1073, 1072, 1073, 1073, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##   1     6.441881  0.7020911  2.704873
##   2     6.422848  0.7050505  2.720557
##   3     6.418449  0.7052825  2.736505
##   4     6.431665  0.7039496  2.742612
##   5     6.453067  0.7013595  2.754239
##   6     6.470716  0.6998307  2.758901
##   7     6.445304  0.7020575  2.756523
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 3.

Plotting Accuracy vs. Hyperparameters

plot(borutaConfirmed_rf_default)

032a4c0dfe4de896363f7faf7196174e.png

Plot the 20 variables with the highest contribution (the importance of variables assessed by Boruta is slightly different from the importance assessed by the model itself)

dotPlot(varImp(borutaConfirmed_rf_default))

299966cdea1ee172501a66456a07a372.png

Extract the final selected model and evaluate its performance.

borutaConfirmed_rf_default_finalmodel <- borutaConfirmed_rf_default$finalModel

First, use the training data set to evaluate the training effect of the built model, RMSE=3.1, Rsquared=0.944, which is quite good.

# 获得模型结果评估参数
predictions_train <- predict(borutaConfirmed_rf_default_finalmodel, newdata=train_data)
postResample(pred = predictions_train, obs = train_data_group)
##      RMSE  Rsquared       MAE 
## 3.1028533 0.9440182 1.1891391

Use the test data to evaluate the predictive effect of the model, RMSE=6.2, Rsquared=0.825, ok. Follow up with other methods to see if it can be improved.

predictions_train <- predict(borutaConfirmed_rf_default_finalmodel, newdata=test_data)
postResample(pred = predictions_train, obs = test_data_group)
##      RMSE  Rsquared       MAE 
## 6.2219834 0.8251457 2.7212806
library(ggplot2)

testfollowerDF <- data.frame(Real_Follower=test_data_group, Predicted_Follower=predictions_train)

sp_scatterplot(testfollowerDF, xvariable = "Real_Follower", yvariable = "Predicted_Follower",
               smooth_method = "auto") + coord_fixed(1)

31c6f84e1f8ac81982f1ddac5731e53c.png

Drawbacks of Random Forest Regression

The values ​​predicted by the random forest regression model will not exceed the value range of the response variable in the training set and cannot be used for extrapolation.

Regression-Enhanced Random Forests (RERFs) can be used as a solution.

References

  1. https://medium.com/swlh/random-forest-and-its-implementation-71824ced454f

  2. https://neptune.ai/blog/random-forest-regression-when-does-it-fail-and-why

  3. https://levelup.gitconnected.com/random-forest-regression-209c0f354c84

  4. https://rpubs.com/Isaac/caret_reg

Machine Learning Tutorial Series

Starting from random forests, understand the concepts and practices of decision trees, random forests, ROC/AUC, data sets, and cross-validation step by step.

Use text for words that can be explained clearly, use pictures for display, formulas for unclear descriptions, and write a simple code for unclear formulas to clarify each link and concept step by step.

Then to mature code application, model tuning, model comparison, model evaluation, and learn the knowledge and skills needed for the entire machine learning.

Past products (click on the picture to go directly to the text corresponding tutorial)

5b60bd03881933573b8d5f8fde64196b.jpeg

ce0cc939102f1377512a571ebad7f97a.jpeg

8678ac111485d7720aaa29cb26b2bea5.jpeg

33e5e9678405cf6296c68152a8ac37ca.jpeg

e9d143c84dc12c5abdf0ac98580d884a.jpeg

ca4453f46f76491fd2d2a5389abeb6c2.jpeg

df9b57a46102033f8a507a0be7934fb1.jpeg

94a006cb7e9391277aa45687f457568f.jpeg

12c4c6dfe2cb4e43ccd7f4459a2181b8.jpeg

3f3fec5828244bbe7df6460c0237b72b.jpeg

b3075b86695dcde5b0e5ecefbdff620d.jpeg

9205c2f89827ced95cf5b5ce343994f5.jpeg

ce80d62e45baec9cc478b3c8eba83fef.png

22807f6f6421c6dd5c8f7ec1847ca6db.png

b1f595dd47bd16e9634203d775dacd82.png

231a8408df64dccf3058f9fde720a578.png

f08a59c6926c982f4c17ec95d7edb7a0.jpeg

805c7165723824d15dea70b19a8dc381.jpeg

789d3d368809445f4837de4f2c9bf327.jpeg

c602430b3d129a0a473712d424e5f2ff.jpeg

790ccb6946a9584695524ff8a0fdece2.png

6bdfae34c1e8af780957e5ac0e17799c.png

e4fc02e75792a4a2774bd663719aa557.jpeg

091e3a9e5fafc6cb32d6001b459e5bfa.png

8c72f04b49dbdf36e02825853d0b633f.png

5ca9e3b8e6f2fb7fcb074fc8ed0d65b0.jpeg

34fd3856584d8d3b988c08db4e86aa8f.png

d840b0a420e30344b55c3baa416123ec.png

machine learning

Reply in the background with "The first wave of benefits in the Life Letter Collection" or click to read the original text to get a collection of tutorials

65ebb5633638849664f6ad19b8b25968.jpeg

0b89d690585b208f5b647471b2596795.jpeg

42a0148a8b68fed02de8d99281e2ba30.png

Guess you like

Origin blog.csdn.net/qazplm12_3/article/details/132680446