R
There are various cluster analysis functions. This article mainly introduces three methods: 层次聚集
, 划分聚类
, 基于模型的聚类
.
data preparation
Before cluster analysis, data can be pre-processed, such as including missing value treatment and data normalization. Take the iris dataset (iris) as an example.
# 数据预处理
mydata <- iris[,1:4]
mydata <- na.omit(mydata) # 删除缺失值
mydata <- scale(mydata) # 数据标准化
Partitioning
K-means
is our most commonly used 欧式距离
clustering algorithm based on, which believes that the closer the distance between two targets, the greater the similarity. The analyst needs to first determine how many categories to divide this set of data, that is, the number of clusters. Plotting the sum of squares within the group against the number of clusters extracted can help determine the appropriate number of clusters.
# 探索最佳聚类个数
# 计算不同个数聚类内部的距离平方和
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(mydata,centers=i)$withinss)
plot(1:15, wss, type="b",
xlab="Number of Clusters",
ylab="Within groups sum of squares")
WSS
The number of clusters that decreases and stabilizes is our optimal cluster. As shown in the figure, the requirement is met when the number of clusters is 4. (3-5 are similar, and there is no unique number of clusters)
# K-Means 聚类分析
fit <- kmeans(mydata, 4) # 聚类数为4
# 每个聚类的均值
aggregate(mydata,by=list(fit$cluster),FUN=mean)
# 添加聚类列
mydata_result <- data.frame(mydata, fit$cluster)
Hierarchical clustering
R language provides a wealth of hierarchical clustering functions, here is a brief introduction Ward
to the hierarchical clustering analysis using methods.
Hierarchical Clustering, literally understood, is hierarchical clustering, and the final result is a tree structure. Technically speaking, hierarchical clustering creates a hierarchically nested clustering tree by calculating the similarity between data points of different categories.
The advantage of hierarchical clustering is that there is no need to specify the number of specific categories . What it gets is a tree. After the clustering is completed, a knife can be cross-cut at any level to get the specified number of clusters.
# Ward 层次聚类
d <- dist(mydata, method = "euclidean") #距离矩阵
fit <- hclust(d, method="ward")
plot(fit) # 聚类树
groups <- cutree(fit, k=3) # 将聚类树切成3个聚类
rect.hclust(fit, k=3, border="red")
pvclust
The functions in the package pvclust()
provide values for hierarchical clustering based on multiscale bootstrap
resampling p
. Clusters that are highly supported by the data will have larger p
values.
# Ward Hierarchical Clustering with Bootstrapped p values
library(pvclust)
fit <- pvclust(mydata, method.hclust="ward",
method.dist="euclidean")
plot(fit) # dendogram with p values
# add rectangles around groups highly supported by the data
pvrect(fit, alpha=.95)
model-based clustering
Model-based clustering methods use maximum likelihood estimation and Bayesian criterion to select the best clustering model and determine the optimal number of clusters among a large number of hypothetical models. Among them, for parameterized Gaussian mixture models, Mclust
the functions in the package Mclust()
select the optimal model according to the BIC of the hierarchical clustering initialization EM.
# Model Based Clustering
library(mclust)
fit <- Mclust(mydata)
plot(fit) # plot results
summary(fit) # display the best model
> summary(fit) # display the best model
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust VVV (ellipsoidal, varying volume, shape, and orientation) model with 2 components:
log-likelihood n df BIC ICL
-322.6936 150 29 -790.6956 -790.6969
Clustering table:
1 2
50 100
From the above results, it is better to divide into two clusters.
Cluster visualization
# K-Means Clustering with 5 clusters
fit <- kmeans(mydata, 3)
# vary parameters for most readable graph
library(cluster)
clusplot(mydata, fit$cluster, color=TRUE, shade=TRUE,
labels=2, lines=0)
# Centroid Plot against 1st 2 discriminant functions
install.packages('fpc')
library(fpc)
plotcluster(mydata, fit$cluster)
reference
Quick-R: Cluster Analysis (statmethods.net)(https://www.statmethods.net/advstats/cluster.html)
Canoe Notes 2022 VIP Project
rights and interests:
Sample data and code of all tweets in the 2022 Canoe Notes (updated in real time in the VIP group) .
Canoe Notes Scientific Research Exchange Group .
Half-price purchase
跟着Cell学作图系列合集
(free tutorial + code collection)|Follow Cell to learn to draw a collection of series .
TOLL:
99¥/person . You can add WeChat: mzbj0002
transfer money, or give a reward directly at the end of the article.