The background of the official account records various reading indicators of published articles, including: content title, total number of readers, total number of readings, total number of sharers, total number of shares, number of followers after reading, delivered reading rate, number of readings generated by sharing, first time Sharing rate, the number of readings brought by each sharing, and the reading completion rate.
We try to use the random forest algorithm in machine learning to predict whether there are certain indicators or combinations of indicators that can predict the number of followers after reading.
Data format and read data
The dataset includes 9 statistical indicators for 1588 articles.
Read Statistics Matrix: WeChatOfficialAccount.txt
Number of followers after reading:
WeChatOfficialAccountFollowers.txt
feature_file <- "data/WeChatOfficialAccount.txt"
metadata_file <- "data/WeChatOfficialAccountFollowers.txt"
feature_mat <- read.table(feature_file, row.names = 1, header = T, sep="\t", stringsAsFactors =T)
# 处理异常的特征名字
# rownames(feature_mat) <- make.names(rownames(feature_mat))
metadata <- read.table(metadata_file, row.names=1, header=T, sep="\t", stringsAsFactors =T)
dim(feature_mat)
## [1] 1588 9
Reading statistics are represented as follows:
feature_mat[1:4,1:5]
## TotalReadingPeople TotalReadingCounts TotalSharingPeople TotalSharingCounts ReadingRate
## 1 8278 11732 937 1069 0.0847
## 2 8951 12043 828 929 0.0979
## 3 18682 22085 781 917 0.0608
## 4 4978 6166 525 628 0.0072
Metadata representation is as follows
head(metadata)
## FollowersAfterReading
## 1 227
## 2 188
## 3 119
## 4 116
## 5 105
## 6 100
Sample Screening and Sequencing
It is also an operation that needs to be ensured that the sample order in the sample table and the expression table are aligned .
feature_mat_sampleL <- rownames(feature_mat)
metadata_sampleL <- rownames(metadata)
common_sampleL <- intersect(feature_mat_sampleL, metadata_sampleL)
# 保证表达表样品与METAdata样品顺序和数目完全一致
feature_mat <- feature_mat[common_sampleL,,drop=F]
metadata <- metadata[common_sampleL,,drop=F]
Whether to judge classification or regression
The parameters have been given when reading the data earlier stringsAsFactors =T
, so this step can be ignored.
If the column corresponding to the group is a number, convert it to a numeric type - do regression
If the column corresponding to group is grouped, convert to factor type - do classification
# R4.0之后默认读入的不是factor,需要做一个转换
# devtools::install_github("Tong-Chen/ImageGP")
library(ImageGP)
# 此处的FollowersAfterReading根据需要修改
group = "FollowersAfterReading"
# 如果group对应的列为数字,转换为数值型 - 做回归
# 如果group对应的列为分组,转换为因子型 - 做分类
if(numCheck(metadata[[group]])){
if (!is.numeric(metadata[[group]])) {
metadata[[group]] <- mixedToFloat(metadata[[group]])
}
} else{
metadata[[group]] <- as.factor(metadata[[group]])
}
Preliminary Analysis of Random Forest
library(randomForest)
# 查看参数是个好习惯
# 有了前面的基础概述,再看每个参数的含义就明确了很多
# 也知道该怎么调了
# 每个人要解决的问题不同,通常不是别人用什么参数,自己就跟着用什么参数
# 尤其是到下游分析时
# ?randomForest
# 查看源码
# randomForest:::randomForest.default
After loading the package, analyze it directly, and adjust the parameters after seeing the result.
# 设置随机数种子,具体含义见 https://mp.weixin.qq.com/s/6plxo-E8qCdlzCgN8E90zg
set.seed(304)
# 直接使用默认参数
rf <- randomForest(feature_mat, metadata[[group]])
Looking at the preliminary results, the random forest type is judged as a tree 分类
is constructed, and the optimal decision is made 500
from randomly selected 3
indicators each time a decision is made ( mtry
), the average square residue Mean of squared residuals
: 39.82736, and the degree of variation explained % Var explained
: 74.91. The result looks normal.
rf
##
## Call:
## randomForest(x = feature_mat, y = metadata[[group]])
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 39.82736
## % Var explained: 74.91
Observing the prediction effect of the model on the training set, it seems that the consistency is not bad.
library(ggplot2)
followerDF <- data.frame(Real_Follower=metadata[[group]], Predicted_Follower=predict(rf, newdata=feature_mat))
sp_scatterplot(followerDF, xvariable = "Real_Follower", yvariable = "Predicted_Follower",
smooth_method = "auto") + coord_fixed(1)
Random Forest Standard Operating Procedure
Split training and test sets
library(caret)
seed <- 1
set.seed(seed)
train_index <- createDataPartition(metadata[[group]], p=0.75, list=F)
train_data <- feature_mat[train_index,]
train_data_group <- metadata[[group]][train_index]
test_data <- feature_mat[-train_index,]
test_data_group <- metadata[[group]][-train_index]
dim(train_data)
## [1] 1192 9
dim(test_data)
## [1] 396 9
Boruta feature selection to identify key categorical variables
# install.packages("Boruta")
library(Boruta)
set.seed(1)
boruta <- Boruta(x=train_data, y=train_data_group, pValue=0.01, mcAdj=T,
maxRuns=300)
boruta
## Boruta performed 14 iterations in 5.917085 secs.
## 8 attributes confirmed important: AverageReadingCountsForEachSharing, FirstSharingRate,
## ReadingRate, TotalReadingCounts, TotalReadingCountsOfSharing and 3 more;
## 1 attributes confirmed unimportant: ReadingFinishRate;
Look at the results of variable importance identification (in fact, it has also been reflected in the output above), 8
one important variable, 0
one possibly important variable ( tentative variable
, the importance score has no statistical difference from the best shadow variable score), 1
one is not important Variables.
table(boruta$finalDecision)
##
## Tentative Confirmed Rejected
## 0 8 1
Plot the importance of the identified variables. If there are few variables, you can use the default drawing. When there are many variables, the drawn picture cannot be seen clearly, and you need to organize the data and draw by yourself.
Define a function to extract the importance value corresponding to each variable.
library(dplyr)
boruta.imp <- function(x){
imp <- reshape2::melt(x$ImpHistory, na.rm=T)[,-1]
colnames(imp) <- c("Variable","Importance")
imp <- imp[is.finite(imp$Importance),]
variableGrp <- data.frame(Variable=names(x$finalDecision),
finalDecision=x$finalDecision)
showGrp <- data.frame(Variable=c("shadowMax", "shadowMean", "shadowMin"),
finalDecision=c("shadowMax", "shadowMean", "shadowMin"))
variableGrp <- rbind(variableGrp, showGrp)
boruta.variable.imp <- merge(imp, variableGrp, all.x=T)
sortedVariable <- boruta.variable.imp %>% group_by(Variable) %>%
summarise(median=median(Importance)) %>% arrange(median)
sortedVariable <- as.vector(sortedVariable$Variable)
boruta.variable.imp$Variable <- factor(boruta.variable.imp$Variable, levels=sortedVariable)
invisible(boruta.variable.imp)
}
boruta.variable.imp <- boruta.imp(boruta)
head(boruta.variable.imp)
## Variable Importance finalDecision
## 1 AverageReadingCountsForEachSharing 4.861474 Confirmed
## 2 AverageReadingCountsForEachSharing 4.648540 Confirmed
## 3 AverageReadingCountsForEachSharing 6.098471 Confirmed
## 4 AverageReadingCountsForEachSharing 4.701201 Confirmed
## 5 AverageReadingCountsForEachSharing 3.852440 Confirmed
## 6 AverageReadingCountsForEachSharing 3.992969 Confirmed
Only Confirmed
the variables are plotted. It can be seen from the figure that the top 4
variables in the ranking of importance are all related to "sharing" (the number of readings generated by sharing, the total number of sharers, the total number of shares, the first share rate), and the sharing of articles is very important for increasing attention.
library(ImageGP)
sp_boxplot(boruta.variable.imp, melted=T, xvariable = "Variable", yvariable = "Importance",
legend_variable = "finalDecision", legend_variable_order = c("shadowMax", "shadowMean", "shadowMin", "Confirmed"),
xtics_angle = 90, coordinate_flip =T)
Extract important variables and potentially important variables
boruta.finalVarsWithTentative <- data.frame(Item=getSelectedAttributes(boruta, withTentative = T), Type="Boruta_with_tentative")
data <- cbind(feature_mat, metadata)
variableFactor <- rev(levels(boruta.variable.imp$Variable))
sp_scatterplot(data, xvariable = group, yvariable = variableFactor[1], smooth_method = "auto")
Because there are not many variables, you can also use it ggpairs
to see how all variables are related to each other and how they are related to the response variable?
library(GGally)
ggpairs(data, progress = F)
Cross-validation to choose parameters and fit the model
Define a function to generate some columns for testing mtry
(a series of values not greater than the total number of variables).
generateTestVariableSet <- function(num_toal_variable){
max_power <- ceiling(log10(num_toal_variable))
tmp_subset <- c(unlist(sapply(1:max_power, function(x) (1:10)^x, simplify = F)), ceiling(max_power/3))
#return(tmp_subset)
base::unique(sort(tmp_subset[tmp_subset<num_toal_variable]))
}
# generateTestVariableSet(78)
Select data related to key characteristic variables
# 提取训练集的特征变量子集
boruta_train_data <- train_data[, boruta.finalVarsWithTentative$Item]
boruta_mtry <- generateTestVariableSet(ncol(boruta_train_data))
Tuning and modeling with Caret
library(caret)
if(file.exists('rda/wechatRegression.rda')){
borutaConfirmed_rf_default <- readRDS("rda/wechatRegression.rda")
} else {
# Create model with default parameters
trControl <- trainControl(method="repeatedcv", number=10, repeats=5)
seed <- 1
set.seed(seed)
# 根据经验或感觉设置一些待查询的参数和参数值
tuneGrid <- expand.grid(mtry=boruta_mtry)
borutaConfirmed_rf_default <- train(x=boruta_train_data, y=train_data_group, method="rf",
tuneGrid = tuneGrid, #
metric="RMSE", #metric='Kappa'
trControl=trControl)
saveRDS(borutaConfirmed_rf_default, "rda/wechatRegression.rda")
}
borutaConfirmed_rf_default
## Random Forest
##
## 1192 samples
## 8 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 1073, 1073, 1073, 1072, 1073, 1073, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 1 6.441881 0.7020911 2.704873
## 2 6.422848 0.7050505 2.720557
## 3 6.418449 0.7052825 2.736505
## 4 6.431665 0.7039496 2.742612
## 5 6.453067 0.7013595 2.754239
## 6 6.470716 0.6998307 2.758901
## 7 6.445304 0.7020575 2.756523
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 3.
Plotting Accuracy vs. Hyperparameters
plot(borutaConfirmed_rf_default)
Plot the 20 variables with the highest contribution (the importance of variables assessed by Boruta is slightly different from the importance assessed by the model itself)
dotPlot(varImp(borutaConfirmed_rf_default))
Extract the final selected model and evaluate its performance.
borutaConfirmed_rf_default_finalmodel <- borutaConfirmed_rf_default$finalModel
First, use the training data set to evaluate the training effect of the built model, RMSE=3.1
, Rsquared=0.944
, which is quite good.
# 获得模型结果评估参数
predictions_train <- predict(borutaConfirmed_rf_default_finalmodel, newdata=train_data)
postResample(pred = predictions_train, obs = train_data_group)
## RMSE Rsquared MAE
## 3.1028533 0.9440182 1.1891391
Use the test data to evaluate the predictive effect of the model, RMSE=6.2
, Rsquared=0.825
, ok. Follow up with other methods to see if it can be improved.
predictions_train <- predict(borutaConfirmed_rf_default_finalmodel, newdata=test_data)
postResample(pred = predictions_train, obs = test_data_group)
## RMSE Rsquared MAE
## 6.2219834 0.8251457 2.7212806
library(ggplot2)
testfollowerDF <- data.frame(Real_Follower=test_data_group, Predicted_Follower=predictions_train)
sp_scatterplot(testfollowerDF, xvariable = "Real_Follower", yvariable = "Predicted_Follower",
smooth_method = "auto") + coord_fixed(1)
Drawbacks of Random Forest Regression
The values predicted by the random forest regression model will not exceed the value range of the response variable in the training set and cannot be used for extrapolation.
Regression-Enhanced Random Forests (RERFs) can be used as a solution.
References
https://medium.com/swlh/random-forest-and-its-implementation-71824ced454f
https://neptune.ai/blog/random-forest-regression-when-does-it-fail-and-why
https://levelup.gitconnected.com/random-forest-regression-209c0f354c84
https://rpubs.com/Isaac/caret_reg
Machine Learning Tutorial Series
Starting from random forests, understand the concepts and practices of decision trees, random forests, ROC/AUC, data sets, and cross-validation step by step.
Use text for words that can be explained clearly, use pictures for display, formulas for unclear descriptions, and write a simple code for unclear formulas to clarify each link and concept step by step.
Then to mature code application, model tuning, model comparison, model evaluation, and learn the knowledge and skills needed for the entire machine learning.
Machine Learning Algorithms - A Preliminary Study on Decision Trees of Random Forests (1)
Machine Learning Algorithm - Theoretical Overview of Random Forest
Machine Learning Algorithm - A Preliminary Study of Random Forest (1)
Machine Learning - Random Forest Manual 10-Fold Cross Validation
Machine learning model evaluation indicators - ROC curve and AUC value
One function unifies 238 machine learning R packages, which is amazing
General steps for random forest analysis based on Caret and RandomForest packages (1)
Interpretation of more parameters for Caret model training and parameter adjustment (2)
4 Ways to Randomly Adjust Parameters of Random Forest Based on Caret
Machine Learning Part 18 - Boruta Feature Variable Screening (2)
Machine Learning Part 20 - Building a Random Forest Based on Feature Variables Selected by Boruta
Machine Learning Part 21 - Feature Recursive Elimination RFE Algorithm Theory
Machine Learning Part 22 - The feature variables screened out by RFE are 4 times that of Boruta
Machine Learning Part 23 - More Feature Variables But Failed to Improve Random Forest Classification
This unified reference manual for 238 machine learning model R packages is recommended to you
Past products (click on the picture to go directly to the text corresponding tutorial)
Reply in the background with "The first wave of benefits in the Life Letter Collection" or click to read the original text to get a collection of tutorials