PCA: Principal Component Analysis

PCA concept:

The main idea is to map the n-dimensional feature to dimension k, which is the k-dimensional orthogonal new features, the k-dimensional feature is called the main component, constructed in the original data again on the basis of k-dimensional. It is from the original space to find a set of mutually orthogonal axes in order to select new data and the coordinate axis itself has a great relationship. Wherein the first axis data from the original direction of the greatest variance, the second choice is the new coordinate axes the coordinate axis orthogonal to the first plane such that the maximum variance, the third axis is the axis of the first two plane orthogonal to the largest variance, and so on. And so on, can be n such axes. The new axes obtained in this manner, we found that most of the variance in the front are contained in the k axis, the rear axis of the variance contained almost zero. Thus, we can ignore the rest of the axes, leaving only the front of the k axis contains most of the variance. In fact, this is equivalent to retain only dimensional features include most of the variance, variance and ignore contains almost as feature dimensions 0, realize dimension reduction processing on the data characteristics.

PCA algorithm:

Advantages: reducing the complexity of the data, identify the most important features of a plurality of

Disadvantages: does not necessarily need to, possible loss of useful information

Applicable Data Type: numerical data

 

Dataset Download Link: http://archive.ics.uci.edu/ml/machine-learning-databases/

Application data set in the PCA: http://archive.ics.uci.edu/ml/machine-learning-databases/ SECOM /

(1) Open the data set number is calculated as the number of characteristic features :() row for a centralized data secom data, the average value to the non-nan nan values

(2) removing the eigenvalues

(3) calculation of the covariance matrix, the matrix eigenvalue analysis.

Converts the data into pseudo-code n principal components:

(1) removal of the average value

(2) calculation of the covariance matrix

Eigenvalues ​​and eigenvectors (3) calculation of the covariance matrix

(4) feature values ​​sorted in descending order

(5) to retain the n top eigenvectors

(6) to convert the data into a new space of the n eigenvectors of Construction

note:

Reference: https://zhuanlan.zhihu.com/p/37777074

Sample mean:

 

 

 Sample variance:

 

Sample x and sample y covariance:

 

 

 Sample mean, variance and covariance of the difference between:

Sample mean: different samples averaged according to the same dimension

The variance is the same dimension data, calculated for the n samples obtained,

Covariance: at least two-dimensional data showing the relationship between the sample (multi-dimensional well) and the sample , positive covariance: sample x and sample y are positive relationship, negative, negative relationship is equal to 0, indicating the x and y independent.

eg: for three-dimensional data (x, y, z), the covariance of:

 

 

 

Guess you like

Origin www.cnblogs.com/shuangcao/p/11670765.html