Decision tree, random forest, cluster analysis

background

Predicted the new contractor's credit rating, which features five indicators are:
x1 market share
x2 customer complaint rate
x3 when annual gross
x4 sales accounted for the proportion of paid-up capital
x5 net profit of
above five indicators and use 821 sample data of historical rating results to develop multi-category credit rating model and clustering model, as follows:

Connect the network link of the R language loading package to the domestic mirror to ensure a smooth network;

There are two ways to load the mirror settings, as follows:
A. Method 1: The code specifies the mirror
local({r <- getOption("repos")
r["CRAN"] <- "http://mirrors.aliyun.com/CRAN /"
r["CRANextra"] <- "http://mirrors.aliyun.com/CRAN/"
options(repos=r)})
B. Method 2: Manually modify the mirror image
Insert picture description here
(2) Use the na.omit() command Delete the missing data rows; it
Insert picture description here
can be seen that the data has no missing values. After the deletion, the data and the original data remain the same, and the next step can be analyzed.
(3) For the Annex 1 data set read into the R language, use the factor function to convert the rank column of the rating result into an ordered factor data type.

Insert picture description here
After ordering the data factors, the result is A<B<C<D.
(4) Randomly split the converted df1 data frame into the development sample df1.train and the validation sample df1.validate
Insert picture description here

Random numbers are set for the data extraction ratio and data samples respectively.

Insert picture description here

Classical Decision Tree Model

(5) Use df1.train to train the classic decision tree model, and use df1.validate to verify the accuracy of the model.

library(rpart)
set.seed(123456)
dtree <- rpart(grade~.,data = df1.train,method = 'class',
               parms = list(split = 'information'))
dtree$cptable  #不同大小的树对应的预测误差
plotcp(dtree)  #交叉验证误差与复杂度参数的关系图
dtree.pruned <- prune(dtree,cp=0.015) #根据复杂度剪枝,控制树的大小
library(rpart.plot)
prp(dtree.pruned,type = 2, extra = 104,
    fallen.leaves = TRUE, main = 'Decision Tree') #画出最终的决策树
dtree.pred <- predict(dtree.pruned,df1.validate,type = 'class')
str(dtree.pred)
dtree.pref <- table(df1.validate$grade,dtree.pred,
                    dnn = c('Actual','Predicted'))
print(dtree.pref)

The results are shown as follows:
Insert picture description here
the relationship between the graph cross-validation error and the complexity parameter
cp=0.015, pruning according to the complexity, and controlling the size of the tree. The drawing of the decision tree graph is: the
Insert picture description here
decision tree model is verified using the validation set data, and the results are as follows:
Insert picture description here
The number of data that can be verified by calculation is: 54+205+31+31=321, and the overall data is: 737. The final accuracy rate is: 321/737=43.56%.

Inferred decision tree model

(6) Use df1.train training conditions to infer the decision tree model, and use df1.validate to verify the accuracy of the model.

#install.packages("party")
library(party) #加载必要包
fit.ctree <- ctree(grade~.,data = df1.train)  #生成条件决策树
plot(fit.ctree,main = 'Conditional Inference Tree') #画出决策树
ctree.pred <- predict(fit.ctree,df1.validate,type = 'response') #对验证集分类
ctree.pref <- table(df1.validate$grade,ctree.pred,
                    dnn = c('Actual','"Predict')) #观察准确率
print(ctree.pref)

The execution results are as follows:

Insert picture description here
Graph conditional inference decision
tree Use validation set data to verify the decision tree model, and the final prediction data result is:

Insert picture description here
The number of data that can be verified by calculation is: 0+304+0+31=335, and the overall data is: 737. The final accuracy rate is: 335/737=45.45%

Random forest model

(7) Use df1.train to train the random forest model, and use df1.validate to verify the accuracy of the model.

library(randomForest)
set.seed(123456)
ez.forest <- randomForest(grade~.,data = df1.train,
                          na.action = na.roughfix,  #变量缺失值替换成对应列的中位数
                          importance = TRUE)  #生成森林
print(ez.forest)
importance(ez.forest,type=2) 
#type=2参数得到的某个变量的相对重要性就是分割该变量时节点不纯度的下降总量(所有决策树)取平均(/决策树的数量)
forest.pred <- predict(ez.forest,df1.validate)
forest.perf <- table(df1.validate$grade,forest.pred,
                     dnn = c('Actual','Predicted'))
print(forest.perf)

The execution results are as follows: the
Insert picture description here
decision tree model is verified using the validation set data, and the final prediction data result is: the
Insert picture description here
number of data that can be verified by calculation is 77+224+29+35=365, and the overall data is: 737. The final accuracy rate is: 365/737=49.53%

Calculate the prediction accuracy of the above three models on the validation set, and evaluate which model has the highest accuracy.
In summary, the final accuracy rate of the classic decision tree model is 43.56%, the final accuracy rate of the conditional inference decision tree model is 45.45%, and the final accuracy rate of the random forest model is 49.53%. The random forest model with high accuracy is selected.

Cluster analysis model

Use K-means clustering model to cluster analysis on this batch of data.

setwd("C:\\Users\\Administrator\\Desktop\\project")
df2<-read.csv("data2_cluster.csv")
str(df2)
library(factoextra)  
df2<-df2[,-1]  
b<-scale(df2)  
#设置随机种子,保证试验客重复进行  
set.seed(1234)  
#确定最佳聚类个数,使用组内平方误差和法  
fviz_nbclust(b,kmeans,method="wss")+geom_vline(xintercept=4,linetype=2)
#根据最佳聚类个数,进行kmeans聚类
res<-kmeans(b,4)
#将分类结果放入原数据集中
res1<-cbind(df2,res$cluster)
#导出最终结果
write.csv(res1,file='res1.csv')
#查看最终聚类图形
fviz_cluster(res,data=df2)

Insert picture description here

Figure: Judgment of the optimal number of clusters
This function intuitively gives the optimal number of categories: 4. Therefore, the data is clustered into 4 categories. The clustering result graph is:
Insert picture description here
Figure: Clustering result graph

Guess you like

Origin blog.csdn.net/tandelin/article/details/107098405