Comparison of two-dimensional projection of iris data set based on LDA and PCA algorithms

1. Introduction of the author

Xin Wang, male, graduate student of 2022, School of Electronic Information, Xi'an Polytechnic University
Research direction: Machine Vision and Artificial Intelligence
Email: [email protected]

Lu Zhidong, male, School of Electronic Information, Xi'an Polytechnic University, 2022 graduate student, Zhang Hongwei's artificial intelligence research group
Research direction: machine vision and artificial intelligence
Email: [email protected]

2. Introduction to LDA and PCA Algorithms

2.1 LDA Algorithm

LDA (Linear Discriminant Analysis) is a classic binary classification algorithm, which belongs to the supervised learning algorithm in machine learning, and is often used for feature extraction, data dimensionality reduction and task classification. It plays an important role in the fields of face recognition and face detection.
The main idea: Given a training sample set, try to project the samples onto a straight line, so that the projections of the same kind of samples are as close as possible, and the projections of heterogeneous samples are as far away as possible (that is, minimizing the intra-class distance and maximizing the inter-class distance ).

2.2 PCA Algorithm

PCA is the full name of Principal Component Analysis Algorithm, which is the most commonly used linear dimensionality reduction method. Its goal is to map high-dimensional data to low-dimensional space through some kind of linear projection, and expect the data in the projected dimension to be The amount of information is the largest (the largest variance), so as to use fewer data dimensions while retaining more characteristics of the original data points.
The purpose of PCA dimensionality reduction is to reduce the dimensionality of the original features while trying to ensure that "the amount of information is not lost", that is, to project the original features to the dimension with the largest amount of projected information as much as possible. Project the original features onto these dimensions to minimize the loss of information after dimensionality reduction.

2.3 The difference and connection between the two algorithms

The same points:
1. Both are dimensionality reduction methods.
2. The idea of ​​matrix eigendecomposition is used in dimensionality reduction.
3. Both assume that the data fits a Gaussian distribution.
Differences:
1. PCA is an unsupervised dimensionality reduction method, while LDA is a supervised dimensionality reduction method.
2. In addition to dimensionality reduction, LDA can also be used for classification.
3.LDA dimensionality reduction can be reduced to the dimension of the number of categories k-1 (k is the dimension of the sample category 4.LDA chooses the projection direction with the best classification performance, while PCA chooses the direction with the largest variance in the sample point projection. This
point As can be seen from Figure 1 below, LDA is better than PCA for dimensionality reduction under certain data distributions. However, PCA is better than LDA for dimensionality reduction under certain data distributions, as shown in Figure 2 below:
insert image description here

3. Experimental procedure

3.1 Dataset Introduction

Iris, also known as the iris flower data set, is a data set for multivariate analysis. Use the four attributes of sepal length, sepal width, petal length, and petal width to predict which of the three types of iris flowers belong to (Setosa (mountain iris), Versicolour (variegated iris), Virginica (Virginia iris)) . Iris (iris) data set, which has 4 attribute columns and a variety category column: sepal length (sepal length), sepal width (sepal width), petal length (petal length), petal width (petal width), the unit is is centimeters. The 3 breed categories are Setosa, Versicolour, and Virginica, and the number of samples is 150, 50 for each category.
insert image description here
Part of the data set is shown in the figure
insert image description here

3.2 Algorithm process

The algorithm implementation process of the two-dimensional projection comparison of the iris data set based on the LDA and PCA algorithms is as follows:
1. Load the iris data set : use the relevant library to load the iris data set, including feature vectors and corresponding category labels.
2.LDA Dimensionality Reduction : Use the LDA algorithm to reduce the dimensionality of the feature vector. First, compute the mean vector and the within-class scatter matrix for each class. Then, the between-class scatter matrix is ​​calculated and the generalized eigenvalue problem is solved to obtain the projection directions (eigenvectors). Select the eigenvector corresponding to the largest k eigenvalues ​​as the projection direction, where k is the number of dimensions to reduce to. Finally, the feature vectors are mapped into a new low-dimensional space using the selected projection direction to obtain the reduced-dimensional data.
3. PCA dimensionality reduction : use the PCA algorithm to reduce the dimensionality of the feature vector. First, the feature vectors are mean-centered. Then, compute the covariance matrix of the eigenvectors. Next, solve for the eigenvalues ​​and eigenvectors of the covariance matrix. Select the eigenvector corresponding to the largest k eigenvalues ​​as the projection direction, where k is the number of dimensions to reduce to. Finally, the feature vectors are mapped into a new low-dimensional space using the selected projection direction to obtain the reduced-dimensional data.
4. Drawing results : Use the visualization library to draw the two-dimensional projection results obtained by the two dimensionality reduction methods. You can use a scatterplot to display samples of different categories, using different colors to mark different categories.
5. Comparison results : compare the projection results of LDA and PCA in the same screen. You can use subplots to show the results of LDA and PCA separately, or use different colors or markers to show the results of both methods in the same subplot.
By comparing the dimensionality reduction results of LDA and PCA, we can observe the difference in their effects on the iris dataset, as well as their performance in retaining data information and class separability.

3.3 Core Algorithm Introduction

# 加载鸢尾花数据集
iris = datasets.load_iris()
X = iris.data  #将鸢尾花数据集的特征数据存储在X变量中
y = iris.target  #将鸢尾花数据集的标签存储在y变量中
target_names = iris.target_names  #将鸢尾花数据集的目标类别名称存储在target_names变量中

# 使用LDA进行降维
lda = LinearDiscriminantAnalysis(n_components=2)  #创建一个LinearDiscriminantAnalysis对象,指定要降到的维度为2
X_lda = lda.fit_transform(X, y)  # 使用线性判别分析对特征数据进行降维,得到降维后的结果存储在X_lda变量中

# 使用PCA进行降维
pca = PCA(n_components=2)  #创建一个PCA对象,指定要降到的维度为2
X_pca = pca.fit_transform(X)  #使用主成分分析对特征数据进行降维,得到降维后的结果存储在X_pca变量中

Load the iris dataset, store the feature data of the iris dataset in the X variable, store the label of the iris dataset in the y variable, and store the target category name of the iris dataset in the target_names variable.
Create a LinearDiscriminantAnalysis object, specify the dimension to be reduced to 2; use linear discriminant analysis to reduce the dimension of the feature data, and store the result after dimensionality reduction in the X_lda variable; create a PCA object, specify the dimension to be reduced to 2 ; Use principal component analysis to reduce the dimensionality of the feature data, and store the result after dimensionality reduction in the X_pca variable.

# 绘制LDA结果
plt.figure(figsize=(12, 6))  #创建一个图形窗口,指定图形的大小为12x6
plt.subplot(121)  #创建一个子图,在整个图形窗口中创建一个1x2的子图布局,并定位到第一个子图
for color, i, target_name in zip(['navy', 'turquoise', 'darkorange'], [0, 1, 2], target_names):  #迭代处理颜色列表、[0, 1, 2]列表和目标类别名称列表中的元素。zip()函数用于将这三个列表中的对应元素组合在一起,使得在每次循环中,color、i和target_name分别表示三个列表中对应位置的元素
    plt.scatter(X_lda[y == i, 0], X_lda[y == i, 1], color=color, alpha=.8, lw=2, label=target_name)  #使用scatter()函数绘制LDA降维后的结果的散点图。X_lda[y == i, 0]表示选择标签值等于i的样本在降维结果中的第一维度数据,X_lda[y == i, 1]表示选择标签值等于i的样本在降维结果中的第二维度数据。color表示散点图的颜色,alpha表示散点图的透明度,lw表示散点图中点的边框线宽度,label表示每个散点图的标签。
plt.legend(loc='best', shadow=False, scatterpoints=1)  #添加图例,loc='best'表示图例放置在最佳位置,shadow=False表示不显示图例的阴影,scatterpoints=1表示图例中只显示一个散点
plt.title('LDA of IRIS dataset')  #添加子图的标题为'LDA of IRIS dataset'

# 绘制PCA结果
plt.subplot(122)  #定位到第二个子图
for color, i, target_name in zip(['navy', 'turquoise', 'darkorange'], [0, 1, 2], target_names):  #同样地,迭代处理颜色列表、[0, 1, 2]列表和目标类别名称列表中的元素
    plt.scatter(X_pca[y == i, 0], X_pca[y == i, 1], color=color, alpha=.8, lw=2, label=target_name)  #使用scatter()函数绘制PCA降维后的结果的散点图。与步骤17类似,但是这里是使用PCA降维后的结果进行绘制
plt.legend(loc='best', shadow=False, scatterpoints=1)  # 添加图例
plt.title('PCA of IRIS dataset')  #添加子图的标题为'PCA of IRIS dataset'

Create a graphics window, specify the size of the graphics as 12x6; create a subgraph, create a 1x2 subgraph layout in the entire graphics window, and locate the first subgraph; iterate the color list, [0, 1, 2 ] list and elements in the list of target class names. The zip() function is used to combine the corresponding elements in the three lists, so that in each cycle, color, i and target_name respectively represent the elements in the corresponding positions in the three lists; use the scatter() function to draw the LDA drop A scatterplot of the post-dimension results. X_lda[y = =i, 0] means to select the first dimension data of the sample whose label value is equal to i in the dimensionality reduction result, and X_lda[y = = i, 1] means to select the sample whose label value is equal to i in the dimensionality reduction result The second dimension data of . color indicates the color of the scatterplot, alpha indicates the transparency of the scatterplot, lw indicates the border line width of the point in the scatterplot, label indicates the label of each scatterplot; add a legend, loc='best' indicates that the legend is placed in The best position, shadow=False means that the shadow of the legend is not displayed, scatterpoints=1 means that only one scatter point is displayed in the legend; the title of the added subgraph is 'LDA of IRIS dataset'.
Locate to the second subgraph; similarly, iteratively process elements in the color list, [0, 1, 2] list, and target category name list; use the scatter() function to draw a scatter plot of the result of PCA dimensionality reduction. Similar to before, but here is the result of PCA dimensionality reduction; add a legend; add a subgraph titled 'PCA of IRIS dataset'.

3.4 Complete code

import numpy as np
import matplotlib.pyplot as plt  #导入Matplotlib库的pyplot模块,用于绘制图形
from sklearn import datasets  #从Scikit-learn库中导入datasets模块,用于加载数据集
from sklearn.decomposition import PCA  #从Scikit-learn库中导入PCA模块,用于主成分分析降维
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis  #从Scikit-learn库中导入LinearDiscriminantAnalysis模块,用于线性判别分析降维

# 加载鸢尾花数据集
iris = datasets.load_iris()
X = iris.data  #将鸢尾花数据集的特征数据存储在X变量中
y = iris.target  #将鸢尾花数据集的标签存储在y变量中
target_names = iris.target_names  #将鸢尾花数据集的目标类别名称存储在target_names变量中

# 使用LDA进行降维
lda = LinearDiscriminantAnalysis(n_components=2)  #创建一个LinearDiscriminantAnalysis对象,指定要降到的维度为2
X_lda = lda.fit_transform(X, y)  # 使用线性判别分析对特征数据进行降维,得到降维后的结果存储在X_lda变量中

# 使用PCA进行降维
pca = PCA(n_components=2)  #创建一个PCA对象,指定要降到的维度为2
X_pca = pca.fit_transform(X)  #使用主成分分析对特征数据进行降维,得到降维后的结果存储在X_pca变量中

# 绘制LDA结果
plt.figure(figsize=(12, 6))  #创建一个图形窗口,指定图形的大小为12x6
plt.subplot(121)  #创建一个子图,在整个图形窗口中创建一个1x2的子图布局,并定位到第一个子图
for color, i, target_name in zip(['navy', 'turquoise', 'darkorange'], [0, 1, 2], target_names):  #迭代处理颜色列表、[0, 1, 2]列表和目标类别名称列表中的元素。zip()函数用于将这三个列表中的对应元素组合在一起,使得在每次循环中,color、i和target_name分别表示三个列表中对应位置的元素
    plt.scatter(X_lda[y == i, 0], X_lda[y == i, 1], color=color, alpha=.8, lw=2, label=target_name)  #使用scatter()函数绘制LDA降维后的结果的散点图。X_lda[y == i, 0]表示选择标签值等于i的样本在降维结果中的第一维度数据,X_lda[y == i, 1]表示选择标签值等于i的样本在降维结果中的第二维度数据。color表示散点图的颜色,alpha表示散点图的透明度,lw表示散点图中点的边框线宽度,label表示每个散点图的标签。
plt.legend(loc='best', shadow=False, scatterpoints=1)  #添加图例,loc='best'表示图例放置在最佳位置,shadow=False表示不显示图例的阴影,scatterpoints=1表示图例中只显示一个散点
plt.title('LDA of IRIS dataset')  #添加子图的标题为'LDA of IRIS dataset'

# 绘制PCA结果
plt.subplot(122)  #定位到第二个子图
for color, i, target_name in zip(['navy', 'turquoise', 'darkorange'], [0, 1, 2], target_names):  #同样地,迭代处理颜色列表、[0, 1, 2]列表和目标类别名称列表中的元素
    plt.scatter(X_pca[y == i, 0], X_pca[y == i, 1], color=color, alpha=.8, lw=2, label=target_name)  #使用scatter()函数绘制PCA降维后的结果的散点图。与步骤17类似,但是这里是使用PCA降维后的结果进行绘制
plt.legend(loc='best', shadow=False, scatterpoints=1)  # 添加图例
plt.title('PCA of IRIS dataset')  #添加子图的标题为'PCA of IRIS dataset'

# 显示图形
plt.tight_layout()  #调整子图的布局,使其紧凑显示
plt.show()  #显示图形

The function of this code is to load the iris data set, use LDA and PCA algorithms to reduce the dimensionality of the data, and then draw the dimensionality reduction results in two subgraphs, where the LDA results are on the left and the PCA results are on the right.
The scatterplots in each subplot represent samples from different classes and are labeled by a legend. Finally, display the graph and present the comparison results.

3.5 Experimental results and analysis

insert image description here
The comparison between LDA and PCA two-dimensional projection based on the iris dataset is summarized as follows:
LDA is excellent in retaining category information. LDA can maximize the separability between different classes while reducing the dimension by maximizing the between-class divergence and minimizing the intra-class divergence. In two-dimensional projection, LDA can effectively separate different iris categories and show obvious clustering effect.
PCA has advantages in data presentation and compression. By selecting the projection direction with the largest variance, PCA can achieve better data compression effect while retaining the main information of the data. In a two-dimensional projection, PCA distributes the data set in two main directions and shows the overall distribution of the data.
In summary, for the iris data set, LDA performs well in retaining category information and category distinction, while PCA is more suitable for data display and compression. According to the specific task requirements, we can choose the appropriate dimensionality reduction algorithm to obtain the best data representation.

Guess you like

Origin blog.csdn.net/m0_37758063/article/details/131289620