R combat | Cluster analysis

c72286fd32c732e9193dcd53285e699f.png

Cluster analysis

RThere are various cluster analysis functions. This article mainly introduces three methods: 层次聚集, 划分聚类, 基于模型的聚类.

data preparation

Before cluster analysis, data can be pre-processed, such as including missing value treatment and data normalization. Take the iris dataset (iris) as an example.

# 数据预处理
mydata <- iris[,1:4]
mydata <- na.omit(mydata) # 删除缺失值
mydata <- scale(mydata) # 数据标准化

Partitioning

K-meansis our most commonly used 欧式距离clustering algorithm based on, which believes that the closer the distance between two targets, the greater the similarity. The analyst needs to first determine how many categories to divide this set of data, that is, the number of clusters. Plotting the sum of squares within the group against the number of clusters extracted can help determine the appropriate number of clusters.

# 探索最佳聚类个数
# 计算不同个数聚类内部的距离平方和
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(mydata,centers=i)$withinss)
plot(1:15, wss, type="b", 
     xlab="Number of Clusters",
     ylab="Within groups sum of squares")
777c468b401e66e5b69c3eed46c55367.png
WSS

WSSThe number of clusters that decreases and stabilizes is our optimal cluster. As shown in the figure, the requirement is met when the number of clusters is 4. (3-5 are similar, and there is no unique number of clusters)

# K-Means 聚类分析
fit <- kmeans(mydata, 4) # 聚类数为4
# 每个聚类的均值
aggregate(mydata,by=list(fit$cluster),FUN=mean)
# 添加聚类列
mydata_result <- data.frame(mydata, fit$cluster)

Hierarchical clustering

R language provides a wealth of hierarchical clustering functions, here is a brief introduction Wardto the hierarchical clustering analysis using methods.

Hierarchical Clustering, literally understood, is hierarchical clustering, and the final result is a tree structure. Technically speaking, hierarchical clustering creates a hierarchically nested clustering tree by calculating the similarity between data points of different categories.

The advantage of hierarchical clustering is that there is no need to specify the number of specific categories . What it gets is a tree. After the clustering is completed, a knife can be cross-cut at any level to get the specified number of clusters.

ecd121859dc2a4bd2c8592af48cc6044.png
Example of number of clusters
# Ward 层次聚类
d <- dist(mydata, method = "euclidean") #距离矩阵
fit <- hclust(d, method="ward")
plot(fit) # 聚类树
a22698608a9c94c32f282757eee652f1.png

groups <- cutree(fit, k=3) # 将聚类树切成3个聚类
rect.hclust(fit, k=3, border="red")
fff7c63d2c56a4b30dbc3bd18901f670.png

pvclustThe functions in the package pvclust()provide values ​​for hierarchical clustering based on multiscale bootstrapresampling p. Clusters that are highly supported by the data will have larger pvalues.

# Ward Hierarchical Clustering with Bootstrapped p values
library(pvclust)
fit <- pvclust(mydata, method.hclust="ward",
   method.dist="euclidean")
plot(fit) # dendogram with p values
# add rectangles around groups highly supported by the data
pvrect(fit, alpha=.95)
eae783b53250f8fff9217cd1cfd971f0.png
pvclust

model-based clustering

Model-based clustering methods use maximum likelihood estimation and Bayesian criterion to select the best clustering model and determine the optimal number of clusters among a large number of hypothetical models. Among them, for parameterized Gaussian mixture models, Mclustthe functions in the package Mclust()select the optimal model according to the BIC of the hierarchical clustering initialization EM.

# Model Based Clustering
library(mclust)
fit <- Mclust(mydata)
plot(fit) # plot results
summary(fit) # display the best model
7e139493413fe120c91a10aa07bbe37d.png
Mclust
> summary(fit) # display the best model
---------------------------------------------------- 
Gaussian finite mixture model fitted by EM algorithm 
---------------------------------------------------- 

Mclust VVV (ellipsoidal, varying volume, shape, and orientation) model with 2 components: 

 log-likelihood   n df       BIC       ICL
      -322.6936 150 29 -790.6956 -790.6969

Clustering table:
  1   2 
 50 100

From the above results, it is better to divide into two clusters.

Cluster visualization

# K-Means Clustering with 5 clusters
fit <- kmeans(mydata, 3)

# vary parameters for most readable graph
library(cluster)
clusplot(mydata, fit$cluster, color=TRUE, shade=TRUE,
   labels=2, lines=0)
9847341ddd5c0dfb98a78be1ec79cb54.png
clusplot
# Centroid Plot against 1st 2 discriminant functions
install.packages('fpc')
library(fpc)
plotcluster(mydata, fit$cluster)
bbfb5b8eaacb91177da34810f2e36072.png
fpc

reference

  • Quick-R: Cluster Analysis (statmethods.net)(https://www.statmethods.net/advstats/cluster.html)

Canoe Notes 2022 VIP Project

rights and interests:

  1. Sample data and code of all tweets in the 2022 Canoe Notes (updated in real time in the VIP group) .

    1e1590316c7c77e56ce5f9a7019124a9.png
  2. Canoe Notes Scientific Research Exchange Group .

  3. Half-price purchase 跟着Cell学作图系列合集(free tutorial + code collection)|Follow Cell to learn to draw a collection of series .

TOLL:

99¥/person . You can add WeChat: mzbj0002transfer money, or give a reward directly at the end of the article.

eb15739cc43d1b96ebf2b7a7fd86ed2e.png
Scan the QR code to add WeChat

e4a18c6b167e23129b196bc5d271165f.png

Guess you like

Origin blog.csdn.net/weixin_45822007/article/details/124418596