### PCA Dimensionality Reduction (Principal Component Analysis)

# The basic idea of PCA

**Principal component analysis is to find out the most important aspects in the data and replace the original data with the most important aspects in the data** . To put it bluntly, it is to **reduce the data from n-dimensional to n'-dimensional** , and hope that the data set of n'-dimensional features retains most of the information as much as possible.

# PCA mathematical derivation (maximum variance method)

For example, if the data is a two-dimensional plum, we project the original data (blue dots) onto the new most marked axis (yellow and blue crosshairs). The way to find this new coordinate axis is to **find the point (red point) where the data is projected on the new coordinate axis and the distance from the origin of the new coordinate axis is the largest** , which is the maximum variance method.

## The first step, data decentralization

The discussion of the new coordinate axis and the old coordinate axis is very troublesome. We directly decentralize the data (that is, the mean value of the data is at the far point). If the data is not decentralized, we cannot find the optimal dimensionality reduction.**This step is necessary**

## The second step is to find the new most marked axis

How do we find the best most standard axis to achieve principal component analysis?

That is, the greater the distance between the projected point and the origin of the coordinate axis, the better (this is the maximum variance)

As shown in the figure, the red dotted line is the new coordinate axis, we call it PC1; the green dot is the initial data sample point; the green cross is the point projected on the new coordinate; d1, d2, d3...d6 are the projected ones The distance from the point to the far point.

**All we need to do is find the largest sum of squares (that's ∑di^2).**

**Here comes the math! ! ! ! **

As can be seen from the figure above, when we are looking for the best newest axis, we are actuallyFind eigenvalues and eigenvectors by finding the data **correlation coefficient matrix**。

## The third step is to choose a few percent of the data you need

When we perform PCA dimensionality reduction to find new coordinates, the number of coordinates is the same as the number of data features. But we project the data on the new coordinate axis in order to express the information of the entire data with as few features as possible.

In the figure below, we can see that we have obtained two coordinate axes, PC1 and PC2, respectively. The two coordinate axes are perpendicular to each other and do not interfere with each other.

Among them, **the information on the data on PC1 accounts for 83%, and the information on the data on PC2 accounts for 17%** . Anyone can see which coordinate axis should be chosen to represent the information after dimensionality reduction.

When we look at three dimensions, we can see that if we use PC1 and PC2 to reduce the dimension

# Advantages and disadvantages of PCA algorithm

As a **dimensionality reduction method for unsupervised learning, it only needs eigenvalue decomposition to compress and denoise data** . Therefore, it is widely used in practical scenarios.

**The main advantages** of the PCA algorithm are:

1) Only the variance needs to be used to measure the amount of information, and it is not affected by factors other than the data set.

2) The principal components are orthogonal to each other, which can eliminate the mutual influence factors between the original data components.

3) The calculation method is simple, and the main operation is eigenvalue decomposition, which is easy to implement.

**The main disadvantages** of the PCA algorithm are:

1) The meaning of each feature dimension of the principal component has a certain degree of ambiguity, which is not as strong as the interpretation of the original sample features.

2) Non-principal components with small variance may also contain important information on sample differences, and discarding due to dimensionality reduction may have an impact on subsequent data processing.

3) When the data distribution is not a normal distribution, the effect is not very good

# question

1. **Why square it** ? It is because if the distance is not squared, there will be positive and negative distances, which will cancel each other out.

2.**Why is the correlation coefficient matrix instead of the covariance matrix? ** You will know this when you look down

# Does PCA need to be dimensioned?

(1) When: when the unit of each attribute is the same (for example, both are kg, both are meters), each attribute is comparable. Therefore, it is enough to directly calculate the covariance between attributes. **The size of the original covariance does not indicate the degree of correlation (covariance only indicates positive or negative correlation),But when the units are the same, we can think that the greater the covariance, the greater the correlation**。

(2) When the units of each attribute are different (for example, one is kg and the other is meter), at this time, due to the different units, the covariance does not indicate the degree of correlation. At this time, we need to use the correlation coefficient to describe.

The formula of the correlation coefficient (that is, the correlation coefficient matrix is divided by two standard deviations, where dividing by the standard deviation is a way of dimensioning). **It eliminates the influence of the range of change of two variables, but simply reflects the degree of similarity between the two variables per unit change** .