07 Unsupervised Learning - Dimensionality Reduction

1. Overview of Dimensionality Reduction

维数灾难(Curse of Dimensionality): Usually refers to a phenomenon in which the calculation amount increases exponentially as the number of dimensions increases in problems involving vector calculations.

1.1 What is dimensionality reduction?

1. Dimensionality Reduction is to convert the samples (instances) in the training data from a high-dimensional space to a low-dimensional space.

2. There are many kinds of algorithms that can complete the dimensionality reduction of the original data. In these methods, the dimensionality reduction is realized through the linear transformation of the original data.

1.2 Why dimensionality reduction?

1. High-dimensional data increases the difficulty of calculation. The higher the dimension, the more difficult it is to search the algorithm.

2. High-dimensionality weakens the generalization ability of learning algorithms, and dimensionality reduction can increase the readability of data, which is conducive to discovering meaningful structures of data.

1.3 The main function of dimensionality reduction:

  • Reduce redundant features and reduce data dimensions

  • data visualization

Advantages of dimensionality reduction :

  • By reducing the dimensionality of features, the space required for data set storage is also reduced accordingly, which reduces the calculation and training time required for feature dimensionality;
  • Dimensionality reduction of dataset features helps to visualize data quickly;
  • Eliminate redundant features by handling multicollinearity.

Disadvantages of dimensionality reduction :

  • Some data may be lost due to dimensionality reduction;
  • In the principal component analysis (PCA) dimensionality reduction technique, sometimes it is difficult to determine how many principal components need to be considered, often using the rule of thumb

2. Singular value decomposition

Singular value decomposition (Singular Value Decomposition, hereinafter referred to as SVD) is an algorithm widely used in the field of machine learning. It can be used not only for feature decomposition in dimensionality reduction algorithms, but also for recommendation systems and natural language processing. It is the cornerstone of many machine learning algorithms.

sVD can decompose a matrix A into the product of three matrices:
an orthogonal matrix U (orthogonal matrix),
a diagonal matrix (diagonal matrix) Σ \SigmaΣ
the transpose of an orthogonal matrix V

The role of decomposition: linear transformation = rotation + stretch + rotation

SVD decomposition can decompose a matrix, and the eigenvalues ​​on the diagonal of the diagonal matrix are stored in descending order, and the reduction of singular values ​​is particularly fast. In many cases, the sum of the first 10% or even 1% of singular values ​​accounts for More than 99% of the sum of all singular values.

That is to say, for the singular value, it is similar to the eigenvalue in our eigendecomposition, and we can also use the largest k singular values ​​and the corresponding left and right singular vectors to approximate the description matrix.

3. Principal component analysis

Principal Component Analysis (PCA) is a dimensionality reduction method, by converting a large feature set into a smaller feature set, which still contains most of the information in the original data, thereby reducing The dimensionality of the original data.
Reducing the number of features in a dataset naturally comes at the expense of accuracy, but the trick to dimensionality reduction is to trade a bit of accuracy for simplicity. Because smaller datasets are easier to explore and visualize, and it will be faster and easier for machine learning algorithms to analyze the data without having to deal with additional features.

PCA identifies the axis that accounts for the largest amount of variance in the training set.

PCA的算法两种实现方法:
(1) PCA algorithm based on SVD decomposition covariance matrix

PCA reduces n-dimensions to lk-dimensions.
There are m pieces of n-dimensional data. The original data is composed of n-rows and m-columns matrix X.
The first step is mean normalization. We need to calculate the mean and standard deviation of all features, and then do Z value.
The second step is to calculate the covariance matrix ( covariance matrix )) Σ \SigmaΣ , whose eigenvectors are the principal components we want to solve.

(2) Realize PCA algorithm based on eigenvalue decomposition covariance matrix

PCA reduces n-dimensions to k-dimensions:
with m pieces of n-dimensional data, the original data is formed into a matrix X with n rows and m columns.
The first step is mean normalization. We need to calculate the mean and standard deviation of all features, and then do z-value.
The second step is to calculate the covariance matrix (covariance matrix) 2, whose eigenvectors are the principal components we want to solve. Eigenvalue Decomposition Matrix
For matrix A, there is a set of eigenvectors v, and this set of vectors is orthogonalized and unitized to obtain a set of intersection unit vectors. The eigenvalue decomposition is to decompose the matrix A into the following formula:
A = P Σ P − 1 {\rm{A = P}}\Sigma {P^{^{ - 1}}}A=PΣP1
Among them, Р is a matrix composed of eigenvectors of matrix A, Xi is a diagonal matrix, and the elements on the diagonal are eigenvalues.

Disadvantages of PCA:

What PCA pursues is to maximize the intrinsic information of the data after dimensionality reduction, and measure the importance of the direction by measuring the variance of the data in the projection direction. However, after such projection, the effect of distinguishing the data is not great, and it may make the data points mixed together and cannot be distinguished.
This is also the biggest problem with PCA, which makes the classification effect of PCA not good in many cases.

4.t-distributed domain embedding algorithm t-SNE (t-distributedstochastic neighbor embedding)

step:

  • The data is normalized before processing like PCA
  • Calculate the similarity between all points of data and a point in low-dimensional space
  • Map it on the horizontal axis of the t distribution function
  • Calculate the difference between the high-dimensional and low-dimensional similarity matrices, design the loss function, and then use gradient descent to optimize it

Guess you like

Origin blog.csdn.net/qq_45801179/article/details/132392919