Advanced data analysis in R language "cluster analysis"

640?wx_fmt=gif&wxfrom=5&wx_lazy=1

About the AuthorIntroduction

Yao Moumou 

Know the column: https://zhuanlan.zhihu.com/mydata

640?wx_fmt=gif&wxfrom=5&wx_lazy=1

This section mainly summarizes the idea of ​​cluster analysis in data analysis.

Cluster analysis groups data objects only based on the information found in the data that describes the objects and their relationships. The goal is that objects in groups are similar to each other (related), while objects in different groups are different (unrelated ‍ ‍ ) . The greater the similarity (homogeneity) within the group and the greater the difference between the groups, the better the clustering.

This explanation comes from "Introduction to Data Mining", which is already in vernacular and easy to understand.

Take a chestnut: Classify creatures according to Kingdom, Phylum, Class, Order, Family, Genus, and Species.


0. General steps of cluster analysis

0.1. Steps

  • Choose appropriate variables and select variables that may be important for identifying and understanding different groups of observations in the data. (This step is very important, and advanced clustering methods cannot make up for the problem of poor selection of clustering variables)

  • Scaling the data, generally standardizing the data, is used to avoid the difference between various data due to different dimensions.

  • Look for outliers, outliers have a great impact on the results of cluster analysis, so it is necessary to filter and delete them

  • Calculate the distance, which will be discussed in detail later. This distance is used to determine the size of the correlation, which in turn affects the results of the cluster analysis.

  • Select a clustering algorithm. Hierarchical clustering is more suitable for small samples, and divisional clustering is more suitable for larger data volumes. There are many other excellent algorithms, choose the appropriate one according to the actual situation.

  • To get the results, one is to obtain the results of one or more clustering methods, the other is to determine the final number of classes required, and the third is to propose subgroups to obtain the final clustering scheme

  • Visualization and interpretation to demonstrate the meaning of the scheme

  • Validation results

0.2. Calculate the distance

Calculating distance refers to using a suitable distance calculation method to calculate the distance between different observations, and this distance is used to measure the similarity or dissimilarity between observations.

The optional methods are Euclidean distance, Manhattan distance, Rankine distance, asymmetric binary distance, maximum distance and Minkowski distance.

The most commonly used is the Euclidean distance. It computes the square root of the sum of the squares of the differences of all variables between two observations:

640?wx_fmt=png

A distance measure commonly used for continuous data.


1. Partition cluster analysis

Divide the observations into K groups, and reorganize them into the optimal viscosity class according to the given rules, which is the division method.

There are usually two kinds: K-means and K-centre

1.1. K-Means Clustering

K-means cluster analysis is the most common partitioning method and uses centroids to represent a class. Using Euclidean agglomeration distance, it is easy to be affected by outliers.

1.1.1 Algorithm

1. Select K initial centroids, the initial centroids can be selected randomly, and each centroid is a class
2. Assign each observation to the centroid closest to it, and form a new class with the centroid
3. Recalculate the centroid of each class , the so-called centroid is the average vector of all observations in a class (here called a vector, because each observation contains many variables, so we regard an observation as a multidimensional vector, and the dimension is determined by the number of variables).
4. Repeat 2. and 3.
5. until the centroid does not change or the maximum number of iterations is reached

1.1.2. R language implementation

  • 1. Normalize the data

df <- scale(data[-1])

  • 2. Determine the number of clusters

# Self-written function to draw a gravel diagram, and judge the number of clusters according to the speed of the linear slope change

 wssplot(df)

 # Use the 24 indicators in the NbClust package to determine the number of clusters

 rid(NbClust)

 set.seed(1234)

 devAskNewPage(ask= TRUE)

 nc <- NbClust(df,method= "kmeans")

 table(nc$Best.n[1,])

 barplot(table(nc$Best.n[1,]))# Visualize the indicator distribution as a bar chart

  • 3. Perform K-means cluster analysis

# 3 Mean cluster analysis, and try the configuration of 25 initial center values, choose the best

 set.seed(1234)

 fit.km <- kmeans(df, 3, nstart=25)

 # View the number of observations in each cluster

 fit$km$size

 # View the centroids of the three clusters

  • 4. Comparison with original categorical variables

# Use the randIndex function from the flexclust package

 library(flexclust)

 randIndex (fit.km)

  • If the data of the cluster analysis is carried out after excluding a certain category variable, this step is required to make a good agreement between the total partitions. The Rand index varies from -1 (disagree) to 1 (strongly agree).

1.2. K-center point clustering

Unlike K-means clustering, instead of centroids, a class is represented by the most representative observation, called the center point.

1.2.1. Algorithms

1. Select K center points and choose them at random
. 2. Calculate the distance or dissimilarity between observations to each center point.
3. Assign each observation to the nearest center point
. 4. Calculate the difference between each center point and each observation. Sum of distances (total cost)
5. Pick a point in the class that is not the center point, swap with the center point (it becomes the center point, and the center point becomes the normal observation)
6. Reassign each observation to the closest point to it Center points, combined into one class
7. Calculate the total cost again
8. If the total cost is less than the total cost calculated in step 4., use the new point as the center point
9. Repeat 5. to 8. until the center point no longer changes

1.2.2. R language implementation

# Use the pam() function from the cluster package

 pam(x, k, metric="euclidean", stand=FALSE)

 # x is a data matrix or data frame, k is the number of clusters, metric is the similarity/dissimilarity measure used, whether there are variables in stand that are standardized before this metric


2. Hierarchical Cluster Analysis

层次聚类分析分为凝聚层次聚类分析和分裂层次聚类分析,两者是正好相反的过程,最常用到的是凝聚层次聚类分析,这篇总结只关注这一类。

其思想就是,把每一个单个的观测都视为一个类,而后计算各类之间的距离(用到 0.2. 中的方法),选取最相近的两个类,把他们合并为一个类。新的这些类再继续计算距离,合并最近的两个类。如此往复,直到只剩下一个类。将这个过程中的每一次合并都用树状图记录下来,这个树状图便包含了我们所需要的信息。

步骤可抽象为:

1. 计算类与类之间的距离,用邻近度矩阵记录
2. 将最近的两个类合并为一个新的类
3. 根据新的类,更新邻近度矩阵
4. 重复2. 3.
5. 到只只剩下一个类的时候,停止

2.1. 定义类之间的距离

观测与观测之间的距离在 0.2. 中已经说明了,有多种方式可以选择,其中最常用的就是欧几里得距离。

而类与类之间的距离,在初始的时候考虑观测之间距离是如何定义即可,因为这个时候所有类都只有一个观测。但当经过第一次合并之后,有的类就有多个观测了,这个时候类与类之间的距离应该如何定义呢?

一般有五种(下面把每个类中的观测抽象为一个点):

1. MIN(单联动):一个类中的所有点与另一个类中的所有点计算距离,将距离最近的两个点之间的距离视为两个类之间的距离
2. MAX(全联动):一个类中的所有点与另一个类中的所有点计算距离,将距离最远的两个点之间的距离视为两个类之间的距离
3. 组平均(平均联动):一个类中的所有点与另一个类中的所有点计算距离,将计算得到的所有距离相加再除以组合数所得到的平均距离视为两个类之间的距离
4. 质心:将两个类的质心之间的距离,视为两个类的距离。所谓质心就是一个类中的所有观测的平均向量(这里称为向量,是因为每一个观测都包含很多变量,所以我们把一个观测视为一个多维向量,维数由变量数决定)。
5. ward 法:也一个类由质心代表,但它将两个类合并时所导致的 SSE  的增加来度量两个类之间的距离。

这里插一句嘴,为什么层次聚类更适合小样本呢?就是因为,观测每多一个,在计算类之间距离的时候,次数就要多一倍,这个计算的复杂性是成阶乘上涨的,所以它一般用于小样本。但是,前人也研究出了一些聚类分析的可伸缩方法,这个是后话。

2.2. R 语言实现

d <- dist(x, method=)

 # 选定点与点之间计算距离的方式并生成距离矩阵


 fit <- hclust(d, method=)

 # d 为之前生成的距离矩阵,method 为类与类之间的距离定义方式


 plot(fit, hang=-1)

 # 画出树状图


 library(NbClust)

 devAskNewPage(ask=TRUE)

 nc <- NbClust(x, distance=, method= )

 table(nc$Best.n[1,])

 # 用于选择聚类的个数,看哪种聚类个数得到更多评判准则的赞同


 clusters <- cutree(fit, k=5)

 table(clusters)

 # 展示 5 类聚类方案中每一类有多少观测

 

aggregate(x, by=list(cluster= clusters),median)

 # 描述聚类

 # x 为用于聚类的数据,或者为标准化之后的数据


 plot(fit, hang=-1)

 rect,hclust(fit, k=5)

 # 画出树状图,并叠加5类方案


3. 避免不存在的类

利用 NbClust 包中的立方聚类规则可以帮助我们揭示不存在的结构。

plot(nc$All.index[,4], type="o", ylab="CCC")

 # nc为 NbClust() 函数计算的到的结果

CCC 值为负并且对于两类或多类递减是,就是典型的单峰分布,表明没有类存在。




 往期精彩内容整理合集 

2017年R语言发展报告(国内)

R语言中文社区历史文章整理(作者篇)

R语言中文社区历史文章整理(类型篇)

640?wx_fmt=jpeg

公众号后台回复关键字即可学习

Reply  to R                   Quick Start of R Language and Data Mining  Reply           to  Kaggle
Cases  Top Ten           Kaggle Cases    (in serial)
Reply  to Text       Mining Tutorial  Reply to Quantitative   Investment      Zhang Dan teaches you how to      use R       language to quantify  investment             and reply  to  user       portraits Case Studies






Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325777739&siteId=291194637