1.1 What is cluster analysis
Clustering:
- Clustering is a process of dividing a data set into several groups (classes) or clusters
(clusters), so that data objects in the same group have a high degree of similarity: while data objects in different groups are dissimilar. - Similarity or dissimilarity is determined based on the value of the data description attribute, which is usually represented by the distance between each data object.
- Cluster analysis is particularly suitable for exploring the correlation between samples to make a preliminary evaluation of the structure of a sample.
The difference between clustering and classification:
Clustering is a learning method without (teacher) supervision . Unlike classification, it does not rely on predetermined data categories and a set of learning training samples labeled with data categories.
Therefore, clustering is observational learning rather than example-based learning.
Applications of cluster analysis:
Market analysis: Help market analysts discover different customer groups from the basic customer base, and use purchase patterns to characterize different customer groups; World Wide Web: Cluster
WEB log data to find the same User access pattern;
image processing;
pattern recognition;
isolated point detection, etc.
1.2 Similarity measure between samples—distance
Euclidean distance is the most commonly used
. Different distance formulas will result in different classification results.
1.3 Similarity measure between variables - similarity coefficient
Taking the data in Example 1 as an example, calculate the correlation coefficient and the cosine of the angle between each indicator.
a=[7.9 39.77 8.49 12.94 19.27 11.05 2.04 13.29
7.68 50.37 11.35 13.3 19.25 14.59 2.75 14.87
9.42 27.93 8.2 8.14 16.17 9.42 1.55 9.76
9.16 27.98 9.01 9.32 15.99 9.1 1.82 11.35
10.06 28.64 10.52 10.05 16.18 8.39 1.96 10.81];
R=corrcoef(a)%指标之间的相关系数
a1=normc(a)%将a的各列化为单位向量
J=a1'*a1%计算各列的夹角余弦
Display of running results
1.4 Distance between classes
1.5 Steps of pedigree clustering method
Take analysis example 1 as an example
Cluster Analysis Example
b=[7.9 39.77 8.49 12.94 19.27 11.05 2.04 13.29
7.68 50.37 11.35 13.3 19.25 14.59 2.75 14.87
9.42 27.93 8.2 8.14 16.17 9.42 1.55 9.76
9.16 27.98 9.01 9.32 15.99 9.1 1.82 11.35
10.06 28.64 10.52 10.05 16.18 8.39 1.96 10.81];
d1=pdist(b);%计算b中每行之间欧氏距离,输出为行向量
% D=squareform(d1)%以矩阵的形式展示欧氏距离,结果是实对称矩阵
D=tril(squareform(d1))%以下三角矩阵的方式展示不同行的欧氏距离数据。
% 以上两行代码可以省略
z1=linkage(d1)%将数据按照不同的距离进行聚类,直到聚类为同一类
% z2=linkage(d1,'complete');%最长距离
% z3=linkage(d1,'average');%重心距离
% z4=linkage(d1,'centroid');%中间距离
% z5=linkage(d1,'ward');%离差平方和
H=dendrogram(z1)%画出谱系聚类图
T=cluster(z1,3)%将z1分成三组,并输出分类结果
The running results show
the output when the results are divided into three categories. The figure below shows that the first row is a group, the second row is a group, and the third, fourth, and fifth rows are a group.
Output when letting the results fall into two classes. The figure below shows that the first and second rows are divided into one group, and the third, fourth and fifth lines are divided into one group.
K-means clustering algorithm
The K-means (k-means) algorithm uses k as a parameter to divide n objects into k clusters, so that the objects within the clusters have high similarity and the similarity between clusters is low.
The similarity is calculated based on the average value of the objects in a cluster (considered as the center of gravity of the cluster).
The figure below shows the results of three iterations and each iteration.
Find the K-mean clustering
code for the following example.
x=[0 1 0 1 2 1 2 3 6 7 8 6 7 8 9 7 8 9 8 9;
0 0 1 1 1 2 2 2 6 6 6 7 7 7 7 8 8 8 9 9];
figure(1)
plot(x(1,:),x(2,:),'r*')%横轴为第一行所有列,纵轴为第二行所有列
%第一步,选取聚类中心,令K=2
Z1=[x(1,1);x(2,1)];
Z2=[x(1,2);x(2,2)];%聚类中心z1(0,0) z2(1,0)
R1=[];
R2=[];%分成两个聚类,用于存储成员
t=1;
K=1;% 记录迭代次数
dif1=inf;
dif2=inf;
%第二步计算各点,与聚类中心的距离
while (dif1>eps&&dif2>eps)%esp为最小值
for i=1:20
dist1=sqrt((x(1,i)-Z1(1)).^2+(x(2,i)-Z1(2)).^2);
dist2=sqrt((x(1,i)-Z2(1)).^2+(x(2,i)-Z2(2)).^2);
temp=[x(1,i),x(2,i)]';
if dist1<dist2
R1=[R1,temp];%将正在计算的点归类到R1
else
R2=[R2,temp];%将正在计算的点归类到R2
end
end
Z11=mean(R1,2);%mean(A,2)表示包含每一行的平均值的列向量(对行求平均)
Z22=mean(R2,2);%得到聚类中心
t1=Z1-Z11;%测试两次是不是相等,方法有很多种,这里只是简单的一种
t2=Z2-Z22;
dif1=sqrt(dot(t1,t1));
dif2=sqrt(dot(t2,t2));%dot两个向量的点积
Z1=Z11;
Z2=Z22;%将新的聚类中心赋予原来的变量
K=K+1;%迭代次数加1
R1=[];%
R2=[];
end
hold on
plot ([Z1(1);Z2(1)],[Z1(2),Z2(2)],'g+')%最终的聚类中心绿色加号表示