Mathematical modeling----clustering analysis

Cluster analysis concepts

Cluster analysis is the process of grouping data objects based on the information found in the given data describing the objects and relationships.

Clustering is a technique for finding the inherent structure between data. Clustering organizes all data instances into some similar groups. These similar groups are called clusters. Data instances in the same cluster are identical (related) to each other and are in different clusters. The instances in are different from each other (not related).

Cluster analysis is a kind of unsupervised learning. Different from supervised learning, there is no classification or information indicating the data category in the cluster. Instead, the samples of the location category are divided into several clusters according to certain rules to reveal There are laws therein.

In mathematical modeling, cluster analysis can be applied in the data preprocessing process. For multi-dimensional data with complex structures, cluster analysis can be used to aggregate the data to standardize the complex structure data. Cluster analysis can also be used to discover dependencies between data items, thereby removing or merging closely dependent data items. Cluster analysis can also provide preprocessing functions for certain data mining methods (such as association rules, rough set methods). In business issues, cluster analysis is an effective tool for market segmentation and is used to discover different customer groups. It is also used to study consumer behavior and find new customers by characterizing the characteristics of different customer groups. Potential market, etc.

Cluster analysis algorithm

Cluster analysis algorithms are mainly divided into five categories: partition-based clustering methods, hierarchical-based clustering methods, density-based clustering methods, grid-based clustering methods and model-based clustering methods.

  1. Partition-based clustering (k-means algorithm, k-medoids algorithm, k-prototype algorithm)
  2. Hierarchy-based clustering
  3. Density-based clustering (DBSCAN algorithm, OPTICS algorithm, DENCLUE algorithm)
  4. grid-based clustering
  5. Model-based clustering (fuzzy clustering, Kohonen neural network clustering)

Common algorithms for mathematical modeling

There are many kinds of cluster analysis algorithms. The partition-based clustering method is the most commonly used in mathematical modeling. This article mainly introduces K-means clustering.

The k-means clustering algorithm divides sample points with similar cluster centroids into the same cluster by calculating the distance between the sample points and the cluster centroids. The distance between samples is used to measure the similarity between them. The farther the two samples are, the lower the similarity, and vice versa.

k-means algorithm:

  1. Select K initial centroids (K needs to be specified by the user). The initial centroids can be randomly selected. Each centroid is a class.
  2. For each remaining sample point, calculate the Euclidean distance between them and each centroid, and classify them into the cluster with the centroid with the smallest distance between them. Calculate the centroid of each new cluster.
  3. After all sample points are divided, the location of the centroid of each cluster is recalculated according to the division situation, and then the distance from each sample point to the centroid of each cluster is iteratively calculated to re-divide all sample points.
  4. Repeat 2. and 3. until the center of mass no longer changes or the maximum number of iterations is reached.

Advantages and disadvantages of k-means algorithm

  1. The k-means algorithm has a simple principle, is easy to implement, and has relatively high operating efficiency (advantages)
  2. The k-means algorithm clustering results are easy to interpret and suitable for clustering high-dimensional data (advantages)
  3. The k-means algorithm uses a greedy strategy, which results in easy local convergence and slow solution on large-scale data sets (disadvantages)
  4. The k-means algorithm is very sensitive to outliers and noise points. A small number of outliers and noise points may have a great impact on the averaging of the algorithm, thereby affecting the clustering results (disadvantages)
  5. The selection of the initial clustering center in the k-means algorithm also has a great impact on the algorithm results. Different initial centers may lead to different clustering results. In this regard, researchers proposed the k-means++ algorithm, whose idea is to make the initial cluster centers as far away from each other as possible
     

k‐means++ algorithm:

  1. Randomly select a sample point τx1 from the sample as the first cluster center
  2. Calculate the distance d(x) from other sample points x to the nearest cluster center
  3. Select a new sample point x2 to join the cluster center set with probability \frac{d(x)^{2}}{\sum d(x)^{2}}. The larger the distance value, the higher the probability of being selected.
  4. Repeat 2 and 3 to select k cluster centers
  5. Perform k-means operation based on these k cluster centers

Advantages and disadvantages of k-means++ algorithm

  1. Improve the quality of local optimal points and converge faster (advantages)
  2. Compared with randomly selecting the center point, the calculation is larger (disadvantage)

Cluster analysis evaluation

Cluster analysis evaluation is the final step in the clustering process

clustering process

  1. Data preparation: including feature standardization and dimensionality reduction;
  2. Feature selection: select the most effective features from the initial features and store them in the vector;
  3. Feature extraction: forming new prominent features by transforming selected features;
  4. Clustering (or grouping): First select a certain distance function of the appropriate feature type (or construct a new distance function) to measure proximity, and then perform clustering or grouping;
  5. Clustering result evaluation: refers to the evaluation of clustering results. There are three main types of evaluation: external validity evaluation, internal validity evaluation and correlation test evaluation.

A good clustering algorithm should have good scalability, the ability to handle different types of data, the ability to handle noisy data, insensitivity to the order of sample data, good performance under constraints, ease of interpretation and ease of use. sex.

The quality of cluster analysis results can be judged from internal indicators and external indicators:

  • External indicators refer to using a pre-specified clustering model as a reference to judge the quality of clustering results.
  • Internal indicators refer to using only the samples participating in the clustering to judge the quality of the clustering results without resorting to any external reference.

Case

Five varieties and eight attributes are clustered:

% 五个品种八个属性进行聚类
%Matlab程序如下:
X=[7.90    39.77  8.49   12.94  19.27    11.05     2.04    13.29
7.68    50.37  11.35    13.30  19.25    14.59   2.75     14.87
9.42    27.93   8.20    8.14  16.17    9.42     1055     9.76                                                                        
9.16    27.98     9.01      9.32    15.99    9.10     1.82     11.35                                                              
10.06    28.64   10.52    10.05  16.18    8.39     1.96    10.81  ]';
Y=pdist(X);
SF=squareform(Y);
Z=linkage(Y,'average');
dendrogram(Z);
T=cluster(Z,'maxclust',3)

 

K-means clustering matlab code

function [Idx, Center] = K_means(X, xstart)
% K-means聚类
% Idx是数据点属于哪个类的标记,Center是每个类的中心位置
% X是全部二维数据点,xstart是类的初始中心位置

len = length(X);        %X中的数据点个数
Idx = zeros(len, 1);    %每个数据点的Id,即属于哪个类

C1 = xstart(1,:);       %第1类的中心位置
C2 = xstart(2,:);       %第2类的中心位置
C3 = xstart(3,:);       %第3类的中心位置

for i_for = 1:100
    %为避免循环运行时间过长,通常设置一个循环次数
    %或相邻两次聚类中心位置调整幅度小于某阈值则停止
    
    %更新数据点属于哪个类
    for i = 1:len
        x_temp = X(i,:);    %提取出单个数据点
        d1 = norm(x_temp - C1);    %与第1个类的距离
        d2 = norm(x_temp - C2);    %与第2个类的距离
        d3 = norm(x_temp - C3);    %与第3个类的距离
        d = [d1;d2;d3];
        [~, id] = min(d);   %离哪个类最近则属于那个类
        Idx(i) = id;
    end
    
    %更新类的中心位置
    L1 = X(Idx == 1,:);     %属于第1类的数据点
    L2 = X(Idx == 2,:);     %属于第2类的数据点
    L3 = X(Idx == 3,:);     %属于第3类的数据点
    C1 = mean(L1);      %更新第1类的中心位置
    C2 = mean(L2);      %更新第2类的中心位置
    C3 = mean(L3);      %更新第3类的中心位置
end

Center = [C1; C2; C3];  %类的中心位置


%演示数据
%% 1 random sample
%随机生成三组数据
a = rand(30,2) * 2;
b = rand(30,2) * 5;
c = rand(30,2) * 10;
figure(1);
subplot(2,2,1); 
plot(a(:,1), a(:,2), 'r.'); hold on
plot(b(:,1), b(:,2), 'g*');
plot(c(:,1), c(:,2), 'bx'); hold off
grid on;
title('raw data');

%% 2 K-means cluster
X = [a; b; c];  %需要聚类的数据点
xstart = [2 2; 5 5; 8 8];  %初始聚类中心
subplot(2,2,2);
plot(X(:,1), X(:,2), 'kx'); hold on
plot(xstart(:,1), xstart(:,2), 'r*'); hold off
grid on;
title('raw data center');

[Idx, Center] = K_means(X, xstart);
subplot(2,2,4);
plot(X(Idx==1,1), X(Idx==1,2), 'kx'); hold on
plot(X(Idx==2,1), X(Idx==2,2), 'gx');
plot(X(Idx==3,1), X(Idx==3,2), 'bx');
plot(Center(:,1), Center(:,2), 'r*'); hold off
grid on;
title('K-means cluster result');

disp('xstart = ');
disp(xstart);
disp('Center = ');
disp(Center);

references

https://blog.csdn.net/weixin_43584807/article/details/105539675                                                    

https://zhuanlan.zhihu.com/p/139924042

https://www.bilibili.com/video/BV1kC4y1a7Ee?p=19&vd_source=08ffbcb9832d41b9a520bccfe1600cc9

Guess you like

Origin blog.csdn.net/m0_51260564/article/details/124236947