Dimensionality reduction of MNIST dataset based on LPP algorithm

1. Introduction of the author

Liu Chenyu, male, 2022 graduate student, School of Electronic Information, Xi'an Polytechnic University
Research direction: medical image segmentation
Email: [email protected]

Chen Mengdan, female, School of Electronic Information, Xi'an Polytechnic University, 2022 graduate student, Zhang Hongwei's artificial intelligence research group
Research direction: machine vision and artificial intelligence
Email: [email protected]

2. Introduction to LPP Algorithm

2.1 Basic concepts and principles

LPP (Locality Preserving Projection), that is, the locality preserving projection algorithm, is a commonly used dimensionality reduction algorithm , which preserves local information through the linear approximation LE algorithm (Laplace feature map), which can be regarded as a substitute for PCA.

The LPP algorithm preserves the local structure of the data by constructing the distance relationship between each sample pair in the space, and retaining such relationship as much as possible in the dimensionality reduction projection.

2.2 Algorithm process

The core idea of ​​LPP algorithm is to project high-dimensional data into low-dimensional space through linear mapping, so that the local relationship of samples can be maintained in low-dimensional space.

The specific steps of the algorithm are as follows:

  • Construct a proximity map: First, calculate the distance between each sample and its nearest neighbors according to the Euclidean distance between the data, and construct a proximity map.
  • Build a weight matrix: Calculate the weight between each sample and its nearest neighbor samples according to the proximity graph. The commonly used weight calculation method is to calculate the weight through the radial basis function, and the samples with a closer distance have higher weights.
  • Construct the reconstruction error matrix: For each sample, reconstruct it by linear combination of its nearest neighbor samples to calculate the reconstruction error. Reconstruction error indicates that the original sample cannot be perfectly reconstructed by projection in a low-dimensional space.
  • Construction of the objective function: The goal of the LPP algorithm is to minimize the reconstruction error while preserving the local relationship between samples. Therefore, the objective function is constructed so that the reconstruction error is minimized. The objective function can usually be expressed in the form of a matrix, and the optimal projection matrix can be obtained by decomposing the eigenvalue of the matrix.
  • Dimensionality reduction: According to the obtained optimal projection matrix, the high-dimensional data is mapped to the low-dimensional space to complete the dimensionality reduction process.

3. Implementation of LPP algorithm

3.1 Dataset of Handwritten Digits

The handwritten digital data set (digits) is a data set that comes with the commonly used sklearn library, which can be called directly; the main data contained in digits are divided into images, data, target, and target_names. in:

  • imgaes is a three-dimensional matrix of 1797 8×8 images;
  • data is the specific data containing 1797 samples, each sample includes an image of 8 × 8 pixels, in fact, the 8 × 8 images are expanded by rows;
  • target indicates the label of each picture, that is, the number represented by each picture;
  • All labels [0,1,2,3,4,5,6,7,8,9] in the target_names dataset.
    insert image description here
    insert image description here
    insert image description here

3.2 Code implementation

3.2.1 Complete code

# 导入包
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import load_digits
# 求欧氏距离
# X维度[N,D]
def cal_pairwise_dist(X):
    N,D = np.shape(X)
    # 数据扩展
    tile_xi = np.tile(np.expand_dims(X,1),[1,N,1])
    tile_xj = np.tile(np.expand_dims(X,axis=0),[N,1,1])
    # 欧式距离公式
    dist = np.sum((tile_xi-tile_xj)**2,axis=-1)
    # 返回任意两个点之间距离
    return dist
# 求取rbf径向基函数
def rbf(dist, t = 1.0):
    return np.exp(-(dist/t))
# 求取两两点的位置关系矩阵W
def cal_rbf_dist(data, n_neighbors = 10, t = 1):
    # 根据输入的data数据调用欧式距离公式计算两两点之间的距离
    dist = cal_pairwise_dist(data)
    # 如果距离小于0则直接为0
    dist[dist < 0] = 0
    # 样本点的数目N为距离的第一个维度
    N = dist.shape[0]
    # 径向基
    rbf_dist = rbf(dist, t)
    # 初始矩阵N*N
    W = np.zeros([N, N])
    # 逐行遍历
    for i in range(N):
        # 按照欧氏距离从小到大进行排序,取n个最近的样本点
        index_ = np.argsort(dist[i])[1:1 + n_neighbors]
        # 径向基算出这些点的距离
        W[i, index_] = rbf_dist[i, index_]
        W[index_, i] = rbf_dist[index_, i]
    return W
# X为输入维度 格式 [N,D];n_neighbors为K近邻的数目; t为距离计算的参数
def lpp(X,n_dims = 2,n_neighbors = 30, t = 1.0):
    N = X.shape[0]
    W = cal_rbf_dist(X, n_neighbors, t)
    D = np.zeros_like(W)
    # 计算对角线矩阵D
    for i in range(N):
        D[i,i] = np.sum(W[i])
    L = D - W
    XDXT = np.dot(np.dot(X.T, D), X)
    XLXT = np.dot(np.dot(X.T, L), X)
    # 求上述式子的特征值和特征向量
    eig_val, eig_vec = np.linalg.eig(np.dot(np.linalg.pinv(XDXT), XLXT))
    # 返回一个由小到大的排序后的序号
    sort_index_ = np.argsort(np.abs(eig_val))
    # 特征值根据序号重新排列
    eig_val = eig_val[sort_index_]
    # 输出前十个特征值
    print("输出前十个特征值:", eig_val[:10])
    # 判断特征值太小的舍去
    j = 0
    while eig_val[j] < 1e-6:
        j+=1
    # 最终选择一个j出来
    print("\nj: ", j)
    # 最终取n_dims个比较小的特征值的序号
    sort_index_ = sort_index_[j:j+n_dims]
    # 选取的特征值
    eig_val_picked = eig_val[j:j+n_dims]
    print("选取的特征值:", eig_val_picked)
    # 得到转换矩阵A:特征值所对应的特征向量
    A = eig_vec[:, sort_index_]
    # 求取的那个公式
    Y = np.dot(X, A)
    return Y

if __name__ == '__main__':
    # 测试 load_digits 数据
    X = load_digits().data
    Y = load_digits().target
    n_neighbors = 5
    dist = cal_pairwise_dist(X)
    max_dist = np.max(dist)
    data_2d_LPP = lpp(X, n_neighbors = n_neighbors, t = 0.01*max_dist)
    data_2d_PCA = PCA(n_components=2).fit_transform(X)
    # 画图
    plt.figure(figsize=(12,6))
    plt.subplot(121)
    plt.title("LPP")
    plt.scatter(data_2d_LPP[:, 0], data_2d_LPP[:, 1], c = Y)
    plt.subplot(122)
    plt.title("PCA")
    plt.scatter(data_2d_PCA[:, 0], data_2d_PCA[:, 1], c = Y)
    plt.show()

3.2.2 Running results

From the LPP result graph, it can be seen that the data distribution is more concentrated, and from the PCA result graph, it can be seen that the data distribution is more dispersed.
insert image description here

4. Reference link

[1] Parsing reference for locality-preserving projection algorithm: https://zhuanlan.zhihu.com/p/340121889
[2] LPP algorithm implementation code reference: https://github.com/heucoder/dimensionality_reduction_alo_codes/blob/master/codes/ LPP/LPP.py

Guess you like

Origin blog.csdn.net/m0_37758063/article/details/131110470