Learn mathematical modeling algorithms and applications [Introduction and practice of basic cluster analysis]

1.1 What is cluster analysis

Clustering:

  • Clustering is a process of dividing a data set into several groups (classes) or clusters
    (clusters), so that data objects in the same group have a high degree of similarity: while data objects in different groups are dissimilar.
  • Similarity or dissimilarity is determined based on the value of the data description attribute, which is usually represented by the distance between each data object.
  • Cluster analysis is particularly suitable for exploring the correlation between samples to make a preliminary evaluation of the structure of a sample.
    Insert image description here
    Insert image description here

The difference between clustering and classification:
Clustering is a learning method without (teacher) supervision . Unlike classification, it does not rely on predetermined data categories and a set of learning training samples labeled with data categories.
Therefore, clustering is observational learning rather than example-based learning.
Applications of cluster analysis:
Market analysis: Help market analysts discover different customer groups from the basic customer base, and use purchase patterns to characterize different customer groups; World Wide Web: Cluster
WEB log data to find the same User access pattern;
image processing;
pattern recognition;
isolated point detection, etc.
Insert image description here
Insert image description here

1.2 Similarity measure between samples—distance

Insert image description here
Insert image description here
Insert image description here

Euclidean distance is the most commonly used
. Different distance formulas will result in different classification results.

1.3 Similarity measure between variables - similarity coefficient

Insert image description here
Insert image description here
Insert image description here
Taking the data in Example 1 as an example, calculate the correlation coefficient and the cosine of the angle between each indicator.

a=[7.9 39.77 8.49 12.94 19.27 11.05 2.04 13.29
7.68 50.37 11.35 13.3 19.25 14.59 2.75 14.87
9.42 27.93 8.2 8.14 16.17 9.42 1.55 9.76
9.16 27.98 9.01 9.32 15.99 9.1 1.82 11.35
10.06 28.64 10.52 10.05 16.18 8.39 1.96 10.81];
R=corrcoef(a)%指标之间的相关系数
a1=normc(a)%将a的各列化为单位向量
J=a1'*a1%计算各列的夹角余弦

Display of running results
Insert image description here

1.4 Distance between classes

Insert image description here
Insert image description here
Insert image description here
Insert image description here
Insert image description here

1.5 Steps of pedigree clustering method

Insert image description here
Insert image description here
Insert image description here
Take analysis example 1 as an example
Insert image description here
Insert image description here
Insert image description here
Insert image description here

Cluster Analysis Example

Insert image description here

b=[7.9 39.77 8.49 12.94 19.27 11.05 2.04 13.29
7.68 50.37 11.35 13.3 19.25 14.59 2.75 14.87
9.42 27.93 8.2 8.14 16.17 9.42 1.55 9.76
9.16 27.98 9.01 9.32 15.99 9.1 1.82 11.35
10.06 28.64 10.52 10.05 16.18 8.39 1.96 10.81];
d1=pdist(b);%计算b中每行之间欧氏距离,输出为行向量
% D=squareform(d1)%以矩阵的形式展示欧氏距离,结果是实对称矩阵
D=tril(squareform(d1))%以下三角矩阵的方式展示不同行的欧氏距离数据。
% 以上两行代码可以省略
z1=linkage(d1)%将数据按照不同的距离进行聚类,直到聚类为同一类
% z2=linkage(d1,'complete');%最长距离
% z3=linkage(d1,'average');%重心距离
% z4=linkage(d1,'centroid');%中间距离
% z5=linkage(d1,'ward');%离差平方和
H=dendrogram(z1)%画出谱系聚类图
T=cluster(z1,3)%将z1分成三组,并输出分类结果

The running results show
Insert image description here
Insert image description here
Insert image description here
Insert image description here
the output when the results are divided into three categories. The figure below shows that the first row is a group, the second row is a group, and the third, fourth, and fifth rows are a group.
Insert image description here

Output when letting the results fall into two classes. The figure below shows that the first and second rows are divided into one group, and the third, fourth and fifth lines are divided into one group.
Insert image description here

K-means clustering algorithm

The K-means (k-means) algorithm uses k as a parameter to divide n objects into k clusters, so that the objects within the clusters have high similarity and the similarity between clusters is low.
The similarity is calculated based on the average value of the objects in a cluster (considered as the center of gravity of the cluster).
Insert image description here
Insert image description here
Insert image description here
Insert image description here
The figure below shows the results of three iterations and each iteration.
Insert image description here
Find the K-mean clustering
Insert image description here
code for the following example.

x=[0 1 0 1 2 1 2 3 6 7 8 6 7 8 9 7 8 9 8 9;
    0 0 1 1 1 2 2 2 6 6 6 7 7 7 7 8 8 8 9 9];
figure(1)
plot(x(1,:),x(2,:),'r*')%横轴为第一行所有列,纵轴为第二行所有列
%第一步,选取聚类中心,令K=2
Z1=[x(1,1);x(2,1)];
Z2=[x(1,2);x(2,2)];%聚类中心z1(0,0) z2(1,0)
R1=[];
R2=[];%分成两个聚类,用于存储成员
t=1;
K=1;% 记录迭代次数
dif1=inf;
dif2=inf;
%第二步计算各点,与聚类中心的距离
while (dif1>eps&&dif2>eps)%esp为最小值
    for i=1:20
        dist1=sqrt((x(1,i)-Z1(1)).^2+(x(2,i)-Z1(2)).^2);
        dist2=sqrt((x(1,i)-Z2(1)).^2+(x(2,i)-Z2(2)).^2);
        temp=[x(1,i),x(2,i)]';
        if dist1<dist2
            R1=[R1,temp];%将正在计算的点归类到R1
        else
            R2=[R2,temp];%将正在计算的点归类到R2
        end
    end
    Z11=mean(R1,2);%mean(A,2)表示包含每一行的平均值的列向量(对行求平均)
    Z22=mean(R2,2);%得到聚类中心
    t1=Z1-Z11;%测试两次是不是相等,方法有很多种,这里只是简单的一种
    t2=Z2-Z22;
    dif1=sqrt(dot(t1,t1));
    dif2=sqrt(dot(t2,t2));%dot两个向量的点积
    Z1=Z11;
    Z2=Z22;%将新的聚类中心赋予原来的变量
    K=K+1;%迭代次数加1
    R1=[];%
    R2=[];
end
hold on
plot ([Z1(1);Z2(1)],[Z1(2),Z2(2)],'g+')%最终的聚类中心绿色加号表示

Guess you like

Origin blog.csdn.net/Luohuasheng_/article/details/128545839