Machine learning - K-means (clustering) and face recognition

    For details of Yiru’s complete project/code, please refer to github: https://github.com/yiru1225 (reprinted and marked with the source, do not star for projects thanks)

Table of contents

Series Article Directory

1. The principle, process and analysis of K-means clustering algorithm

1. Principle of K-means algorithm

2. K-means algorithm process

3. K-means algorithm analysis

1. Advantages

2. Disadvantages

2. Simple practice and visualization of K-means clustering

3. K-means realizes face and object clustering and visualization

1. Data import

2. K-means clustering

3.LDA dimensionality reduction

4. Dimensionality reduction and visualization

5. Results and Analysis

5.1 Visualization Results

5.2 The influence of data set and K on clustering accuracy

5.3 Comparison before and after K-means optimization

4. Innovative Clustering Algorithm Design

5. Analysis and optimization of some problems in K-means clustering

① Inaccurate clustering

② There is an infinite loop in the program

③ Selection of K value

④ The program is prone to empty clusters (the final number of clusters is less than K)

6. Others

1. Datasets and resources

2. References

Summarize


Series Article Directory

This series of blogs focuses on the concepts, principles and code practices of machine learning, and does not include tedious mathematical derivations (if you have any questions, please discuss and point them out in the comment area, or contact me directly by private message).

The code can be copied in full, and     it makes sense for everyone to understand the principle and process to reproduce it! ! !

Chapter 1  Machine Learning - PCA (Principal Component Analysis) and Face Recognition

Chapter 2  Machine Learning - LDA (Linear Discriminant Analysis) and Face Recognition_@李忆如的博客-CSDN博客

Chapter 3  Machine Learning - LR (Linear Regression), LRC (Linear Regression Classification) and Face Recognition_@李梦如的博客

Chapter 4  Machine Learning - SVM (Support Vector Machine) and Face Recognition_@李梦如的博客

Chapter 5 Machine Learning - K-means (clustering) and face recognition


synopsis

This blog mainly introduces K-means (clustering) algorithm, including algorithm principle, process, analysis, using classic K-means to realize simple clustering and visualization, and using K-means and its optimization to realize face recognition and its visualization ( Enclosed data set and matlab code)

K-means implements clustering of MNIST handwritten datasets:

Optimization method - K-means realizes clustering of handwritten digital images_@李敬如的博客-CSDN博客


1. The principle, process and analysis of K-means clustering algorithm

1. Principle of K-means algorithm

K-means is a kind of unsupervised learning, mainly by continuously taking the data closest to the seed point (centroid) to automatically group similar objects into the same cluster (co-clustering k clusters), often used for clustering class analysis. The most important method used in K-means is the algorithm for finding the point group center : that is, the Euclidean distance . The formula (taking n- dimensional data as an example) is as follows:

A simple example of the K-Means algorithm (K=2) is shown below:

2. K-means algorithm process

1. Randomly select k samples in the data set as the initial cluster centers α = α 1 , α 2 , ... , αk ;

2. For each sample xi in the data set, calculate its distance to k cluster centers and classify it into the class corresponding to the cluster center with the smallest distance;

3. For each category aj , recalculate its cluster center a_{j}=\frac{1}{\left|c_{i}\right|} \sum_{x \in c_{i}} x(that is, the centroid of all samples belonging to this category);

4. Repeat the above two steps 2 and 3 until a certain termination condition is reached (maximum number of iterations, minimum error change (the position change of the centroid is less than the specified threshold (default is 0.0001)), etc.), so as to determine the optimal aggregation class center.

3. K-means algorithm analysis

1. Advantages

① The algorithm is simple, easy to understand, and the clustering effect is good

② When dealing with large data sets, the algorithm can guarantee better scalability

③ When the cluster approximates the Gaussian distribution, the effect is relatively good

2. Disadvantages

① The K value needs to be set manually, and different K values ​​have a great influence on the experimental results

② Sensitive to the initial cluster center, different selection methods have a greater impact on the experimental results

③ Sensitive to outliers

④ Each sample can only be classified into one category, which is not suitable for multi-classification tasks

⑤ Not suitable for too discrete classification, classification of unbalanced sample categories, classification of non-convex shapes

2. Simple practice and visualization of K-means clustering

① Problem description: In order to master the K-means clustering algorithm and the display of the results more proficiently, do a clustering experiment of 2~3 types of points (each type has 10 points) in 2D or 3D space, and the clustering results Indicated by different colors and symbols.

② The core of the algorithm implementation: first use the mvnrnd function to generate 3 sets of data that satisfy the Gaussian distribution (the clustering effect is relatively good), and then follow the K-means algorithm process (or library adjustment) in 1.2 to iteratively determine the generated k cluster centroids , to achieve clustering.

③ Manual typing code is as follows (also adjustable library implementation):

clear;
clc;
times = 0;
N = input('请设置聚类数目:');%设置聚类数目
%% 第一组数据
mu1=[0 0];  %均值
S1=[0.1 0 ; 0 0.1];  %协方差
data1=mvnrnd(mu1,S1,10);   %产生高斯分布数据
%% 第二组数据
mu2=[-1.25 1.25];
S2=[0.1 0 ; 0 0.1];
data2=mvnrnd(mu2,S2,10);
%% 第三组数据
mu3=[1.25 1.25];
S3=[0.1 0 ; 0 0.1];
data3=mvnrnd(mu3,S3,10);
%% 显示数据
plot(data1(:,1),data1(:,2),'b+');
hold on;
plot(data2(:,1),data2(:,2),'b+');
plot(data3(:,1),data3(:,2),'b+');
%%  初始化工作
data = [data1;data2;data3];
[m,n] = size(data); % m = 30,n = 2
center = zeros(N,n);% 初始化聚类中心,生成N行n列的零矩阵
pattern = data;     % 将整个数据拷贝到pattern矩阵中
%% 算法
for x = 1 : N
    center(x,:) = data(randi(3,1),:); % 第一次随机产生聚类中心 randi返回1*1的(1,300)的数
end
while true
distence = zeros(1,N);   % 产生1行N列的零矩阵
num = zeros(1,N);        % 产生1行N列的零矩阵
new_center = zeros(N,n); % 产生N行n列的零矩阵
%% 将所有的点打上标签1 2 3...N
for x = 1 : m
    for y = 1 : N
        distence(y) = norm(data(x,:) - center(y,:)); % norm函数计算到每个类的距离
    end
    [~,temp] = min(distence); %求最小的距离 ~是距离值,temp是第几个
    pattern(x,n + 1) = temp;         
end
times = times+1;
k = 0;
%% 将所有在同一类里的点坐标全部相加,计算新的中心坐标
for y = 1 : N
    for x = 1 : m
        if pattern(x,n + 1) == y
           new_center(y,:) = new_center(y,:) + pattern(x,1:n);
           num(y) = num(y) + 1;
        end
    end
    new_center(y,:) = new_center(y,:) / num(y);
    if norm(new_center(y,:) - center(y,:)) < 0.0001 %设定最小误差变化(阈值)
        k = k + 1;
    end
end
if k == N || times > 10000 % 设置终止条件(加入最大迭代次数限制)
     break;
else
     center = new_center;
end
end
[m, n] = size(pattern); %[m,n] = [30,3]
 
%% 最后显示聚类后的数据
figure;
hold on;
for i = 1 : m
    if pattern(i,n) == 1 
         plot(pattern(i,1),pattern(i,2),'r*');
         plot(center(1,1),center(1,2),'ko');
    elseif pattern(i,n) == 2
         plot(pattern(i,1),pattern(i,2),'g*');
         plot(center(2,1),center(2,2),'ko');
    elseif pattern(i,n) == 3
         plot(pattern(i,1),pattern(i,2),'b*');
         plot(center(3,1),center(3,2),'ko');
    elseif pattern(i,n) == 4
         plot(pattern(i,1),pattern(i,2),'y*');
         plot(center(4,1),center(4,2),'ko');
    else
         plot(pattern(i,1),pattern(i,2),'m*');
         plot(center(5,1),center(5,2),'ko');
    end
end

④ Use K-means clustering (the termination condition is that the position change of the centroid is less than the specified threshold (0.0001) or the number of iterations is greater than the threshold (10000)), the results of K=2 and K=3 are shown in the following figure:

Tips: The same color is the same cluster of K-means clustering, and the origin is the best centroid

K=2 before and after clustering

K=3 before and after clustering

Analysis : As can be seen from the above two figures, the program performs clustering according to the manually set K value, and the effect is better.

3. K-means realizes face and object clustering and visualization

Description of the problem: Realize the clustering experiment of face images (take the face images of the first 2~3 people) and rotating objects ( take the images of the first 2~3 classes in the COIL20 dataset ), and express the results with different colors and symbols , and put the corresponding image next to the corresponding point, and list its clustering accuracy in different databases at different K. 

1. Data import

Use imread to import faces or object databases in batches, or directly load the corresponding mat files, and continuously pull the faces into column vectors to form reshaped_faces when importing, and take out 2~3 categories as the data to be clustered, and abstract the imported data into a framework, which can match the import of different datasets (this experimental framework is adapted to ORL, AR, FERET, COIL20 datasets).

Tips: The code can be seen in the second article of this series (LDA and face recognition), which is basically the same.

2. K-means clustering

K = 3; % 设置K-means的K
% K-means训练
test_data = reshaped_faces(:,1:pic_num_of_each * 3);
[idx,center] = kmeans(test_data',K); %idx是分类类别,center是质心集

3.LDA dimensionality reduction

The code is basically the same as the second article in this series (LDA and face recognition). The dimension reduction method is the pseudo-inverse LDA used in this experiment.

4. Dimensionality reduction and visualization

% 降维与可视化
class_num_to_show = 3;
pic_num_in_a_class = pic_num_of_each;
pic_to_show = class_num_to_show * pic_num_in_a_class;
m = 3; % 制定可视化维数
% 取出相应数量特征向量
    project_matrix = eigen_vectors(:,1:m);
    % 投影
    projected_test_data = project_matrix' * (reshaped_faces - all_mean);
    projected_test_data = projected_test_data(:,1:pic_to_show);
    pattern = projected_test_data';

%可视化
if(m ==2)
figure;
[max_xy,index]=max(pattern); %用于在图像上标记未聚类原类别
for i = 1 : pic_num_of_each * 3
    if(i <= pic_num_of_each)
        if idx(i,1) == 1 
         scatter(pattern(i,1),pattern(i,2),'o','r*');
    elseif idx(i,1) == 2
         scatter(pattern(i,1),pattern(i,2),'o','g*');
    elseif idx(i,1) == 3
         scatter(pattern(i,1),pattern(i,2),'o','b*');
    elseif idx(i,1) == 4
         scatter(pattern(i,1),pattern(i,2),'o','y*');
        end
    elseif(i <= pic_num_of_each * 2)
        if idx(i,1) == 1 
         scatter(pattern(i,1),pattern(i,2),'^','r*');
    elseif idx(i,1) == 2
         scatter(pattern(i,1),pattern(i,2),'^','g*');
    elseif idx(i,1) == 3
         scatter(pattern(i,1),pattern(i,2),'^','b*');
    elseif idx(i,1) == 4
         scatter(pattern(i,1),pattern(i,2),'^','y*');
        end
    elseif(i <= pic_num_of_each * 3)
        if idx(i,1) == 1 
         scatter(pattern(i,1),pattern(i,2),'x','r*');
    elseif idx(i,1) == 2
         scatter(pattern(i,1),pattern(i,2),'x','g*');
    elseif idx(i,1) == 3
         scatter(pattern(i,1),pattern(i,2),'x','b*');
    elseif idx(i,1) == 4
         scatter(pattern(i,1),pattern(i,2),'x','y*');
        end
    end 
hold on;
end
text(max_xy(1,1)-10,max_xy(1,2),'第一类:o');
text(max_xy(1,1)-10,max_xy(1,2)-15,'第二类:▲');
text(max_xy(1,1)-10,max_xy(1,2)-30,'第三类:x');
end

if(m==3)
figure
[max_xyz,index]=max(pattern); %用于在图像上标记未聚类原类别
for i = 1 :pic_num_of_each * 3
    if(i <= pic_num_of_each)
         if idx(i,1) == 1 
         scatter3(pattern(i,1),pattern(i,2),pattern(i,3),'o','r*');
    elseif idx(i,1) == 2
         scatter3(pattern(i,1),pattern(i,2),pattern(i,3),'o','g*');
    elseif idx(i,1) == 3
         scatter3(pattern(i,1),pattern(i,2),pattern(i,3),'o','b*');
    elseif idx(i,1) == 4
         scatter3(pattern(i,1),pattern(i,2),pattern(i,3),'o','y*');
         end
    elseif(i <= pic_num_of_each * 2)
         if idx(i,1) == 1 
         scatter3(pattern(i,1),pattern(i,2),pattern(i,3),'^','r*');
    elseif idx(i,1) == 2
         scatter3(pattern(i,1),pattern(i,2),pattern(i,3),'^','g*');
    elseif idx(i,1) == 3
         scatter3(pattern(i,1),pattern(i,2),pattern(i,3),'^','b*');
    elseif idx(i,1) == 4
         scatter3(pattern(i,1),pattern(i,2),pattern(i,3),'^','y*');
         end
    elseif(i <= pic_num_of_each * 3)
         if idx(i,1) == 1 
         scatter3(pattern(i,1),pattern(i,2),pattern(i,3),'x','r*');
    elseif idx(i,1) == 2
         scatter3(pattern(i,1),pattern(i,2),pattern(i,3),'x','g*');
    elseif idx(i,1) == 3
         scatter3(pattern(i,1),pattern(i,2),pattern(i,3),'x','b*');
    elseif idx(i,1) == 4
         scatter3(pattern(i,1),pattern(i,2),pattern(i,3),'x','y*');
         end
    end 
    hold on;
end    
text(max_xyz(1,1)-10,max_xyz(1,2),max_xyz(1,3),'第一类:o');
text(max_xyz(1,1)-10,max_xyz(1,2)-15,max_xyz(1,3)-15,'第二类:▲');
text(max_xyz(1,1)-10,max_xyz(1,2)-30,max_xyz(1,3)-30,'第三类:x');
end

5. Results and Analysis

The data sets used in this experiment: face (ORL5646, AR5040), object (COIL20), the code is suitable for other data sets

5.1 Visualization Results

Use K-means clustering (K=2, K=3) for different data sets, and the two-dimensional and three-dimensional visualization results of the clustering are shown in the figure below (K=2 takes the AR data set as an example, and K=3 takes the ORL data set example):

Tips: The same color is the same cluster of K-means clustering, and the same shape is the same class of the original data set

Clustering two-dimensional visualization of AR dataset K=2

Clustering 3D Visualization of AR Dataset K=2

 Clustering two-dimensional visualization of ORL dataset K=3

Clustering 3D Visualization of ORL Dataset K=3

Analysis: From the above four figures, it can be seen that for AR and ORL datasets, K-means can cluster different faces more correctly (different shapes correspond to different colors), and the clustering accuracy is high and the effect is good.

5.2 The influence of data set and K on clustering accuracy

The K-means clustering test was performed on different data sets and different K, and each data set was tested for 20 experiments per K to obtain the average clustering accuracy analysis. The results are as follows:

Analysis: As can be seen from the above two figures, the choice of K and the difference in data sets will have an impact on the clustering effect, which is consistent with the theoretical analysis. In this experiment, the clustering accuracy of K-means under the COIL20 dataset is much lower than that of the other two datasets, and the accuracy decreases as K increases. The reason is that K-means clustering is difficult to handle rotating objects. For the ORL and AR datasets, the clustering accuracy fluctuates slightly with K, but the average accuracy is higher and the clustering effect is better.

5.3 Comparison before and after K-means optimization

Under different data sets (K=3), use K-means, K-means++, and the innovative clustering methods mentioned above to perform clustering tests to explore the efficiency and clustering accuracy of different clustering algorithms. Each data 20 experiments were carried out on the set to take the average clustering accuracy analysis, and the results are as follows:

Analysis: As can be seen from the above two figures, K-means++ and innovative K-means, as optimized versions of K-means, have greatly improved the efficiency and clustering accuracy of K-means under different data sets, especially The innovative clustering algorithm mentioned above, through experiments and comparisons, can analyze its superiority over the classic K-means and its optimization, and its efficiency, clustering effect and stability have been significantly improved.

4. Innovative Clustering Algorithm Design

① Deficiencies of classic K-means: Difficulty in selecting K value and determination of initial centroid, requirements on data distribution, sensitivity to outliers, not suitable for multi-sample classification problems.

② Improvements have been made: K-means++, Xmeans, ISODATA, nuclear K-means, etc.

③ Brief description of innovative clustering algorithm:

one. Select K: Since the K-means algorithm is greatly affected by the K value, the artificial selection is discarded, and the optimal K value is determined using the Gap statistic method for optimization.

two. Initial clustering center point: Since the K-means algorithm is greatly affected by the initial center point, the random selection of the centroid of the traditional K-means is discarded, and the data is preprocessed by the method of hierarchical clustering, and the k center points obtained are taken as the K-means Algorithm center point.

three. Centroid iteration: The traditional clustering center points are updated after a cycle. The clustering center of this method adopts a real-time update strategy, that is, every time a pattern is assigned to a new clustering center, the corresponding center and center are immediately updated. The central value of the original clustering center enhances the convergence of the algorithm.

Four. Add variable K selection: In order to achieve the principle of minimizing intra-class variance and maximizing inter-class variance, considering that the often set K value may not be able to achieve a good clustering effect, the previous fixed cluster center is changed to A floating range. The original K is the minimum number of cluster centers, and an upper limit maxK of the number of cluster centers is set. Its specific implementation is as follows:

4.1) When a mode to be clustered gets its nearest center, calculate the intra-class variance of the cluster center and the intra-class variance after assigning this mode to the center, if the difference between the two is greater than a certain threshold, then use the mode Based on the data, a new cluster center is obtained.

4.2) When the number of current cluster centers is equal to the set maximum cluster center, merge the two most adjacent clusters. In order to make the obtained clustering effect more balanced, clustering categories with smaller dimensions should be merged first.

5. Termination judgment: In order to prevent inaccurate clustering and endless loops, the maximum number of iterations and the minimum error change (small threshold) are used as the termination judgment conditions. If a certain condition is met, the clustering image and result will be output.

5. Analysis and optimization of some problems in K-means clustering

The summary and optimization of the problems existing in K-means in theoretical analysis and experiments are as follows:

① Inaccurate clustering

Examples of situations where the clustering is inaccurate are as follows:

Analysis: For the sample set on the right, we can observe with the naked eye that the clustering should be as shown in the red box, but the results obtained after using K-Means clustering are quite different from expectations. There are many reasons, including but not limited to data The degree of random distribution of the set, the setting of the threshold, and the selection of K.

Optimization: reduce the threshold (that is, the position change of the centroid) to achieve more accurate clustering

② There is an infinite loop in the program

Analysis: For a data set, there are more than one possible clustering methods, and there are cases where it is indeed impossible to achieve that all cluster center differences are less than the threshold.

Optimization: Add a variable times to record how many times the while loop (iteration) has been executed. When the times reaches a large value and the program is still not stopped, it can be judged that an infinite loop has occurred, and the result is output directly without calculation.

③ Selection of K value

Analysis: Different K values ​​have a great influence on the experimental results, but the K value is an artificial choice in the classic K-means.

Optimization: Use the elbow method (core: take the K value corresponding to the inflection point of the distance and curve change) or the Gap statistic method instead of direct artificial selection. The core optimization problem of the Gap statistic method is as follows (K corresponding to the largest Gap (K)):

\operatorname{Gap}(K)=E\left(\log D_{k}\right)-\log D_{k}

where Dk is the loss function, where E( log Dk ) refers to the expectation of log Dk ( produced by Monte Carlo simulation ).

④ The program is prone to empty clusters (the final number of clusters is less than K)

Analysis: Classical K-means is easily affected by the initial centroid and may converge to a local minimum. Therefore, when the algorithm clusters, it is easy to generate empty clusters.

Optimization: Use K-Means++ instead of K-means, or use other methods to optimize the initial centroid.

6. Others

1. Datasets and resources

Data sets used in this experiment: ORL5646, AR5040, COIL20.

The commonly used face data sets are as follows (don't prostitute hahaha)

Link: https://pan.baidu.com/s/12Le0mKEquGMgh5fhNagZGw 
Extraction code: yrnb

K-means and simple practice complete code: Li Yiru/Yiru's machine learning - Gitee.com

2. References

1. [Machine Learning] K-means (very detailed) - Zhihu (zhihu.com)

2. K-Means Algorithm Implementation (Matlab)_Mathematicians are my ideal blog-CSDN blog_k-means++ matlab

3. k-means clustering - MATLAB kmeans - MathWorks Nordic

4. [Machine Learning] K-Means algorithm and various optimization and improvement algorithms, clustering model evaluation_Day-yong's Blog-CSDN Blog

5. Improvement of K-means algorithm in pattern recognition_Improvement of k-means algorithm-C++ code resources-CSDN library


Summarize

As a classic clustering algorithm, K-means divides the data into k clusters by iteratively determining the best centroid to achieve clustering. It still performs well in many fields of machine learning (data clustering, language image processing, recommendation system). Moreover, the K-means algorithm is simple in principle and less difficult to implement. However, as an unsupervised learning method, K-means does not use the original information of the data, and there are still problems such as inaccurate clustering, prone to empty clusters, and being affected by centroid selection and k value. In addition, the data assumed by K-means Attributes are often difficult to achieve in real-world problems, thus affecting the experimental results. This blog has proposed some optimization methods and ideas, and subsequent blogs will analyze other algorithm optimizations or solve the above problems.

Guess you like

Origin blog.csdn.net/weixin_51426083/article/details/125015975