Task II data compression _ _ Principal component analysis PCA

Job 2:
principal component analysis: a step, and the application code to achieve

First, what is the PCA?

1, PCA Introduction

In reducing the need for analysis of the index while minimizing the loss of information contained in the original index, it has reached the purpose of the collected data to conduct a comprehensive analysis. Because between the variables exist certain correlation , it is possible the comprehensive information of all kinds each variable with the comprehensive index of less respectively . The core elements in that the heavy weight of the feature or lower discard information.

2, PCA application

PCA is mainly used for data dimensionality reduction, image in a matrix, some elements not obvious and can not be used for identification, and some elements characteristic is apparent in the large variance (variance of an element can be measured relative to the entire discrete), samples of these elements is often the basis for image recognition. PCA's role is to remove those small variance, Generic dimension, a relatively large variance in favor of recognition dimension. This reduces the dimension of the image matrix, reducing the amount of computation.
Reference Documents

3, data reduction

Reference "Linear Algebra":, in general, will be able to analyze a n-1 dimensional subspace in the data center and the rotating coordinate system n-dimensional space of the n points.

For example three-dimensional space, although the above-described operation will be three-dimensional data into two-dimensional reduction, but the data after the dimension reduction is not lost any information, since the data component in the third dimension are zero.
In this case, assuming that the data in the z 'axis has a slight jitter, we can still represented by the above-described two-dimensional data, because we can assume that x', y 'axis is a main component of the information data, z' on jitter may be noise .
In other words, this set of data would be correlated, the noise results in incomplete introduction of relevant data, but the data in the z 'axis angle distribution of the origin composed of very small, the z' axis having a great correlation, taken together, can be considered data x ', y' constitutes a projection on the main component data.

PCA is thought to n-dimensional k-dimensional feature map to the (k <n), the k new dimension orthogonal characteristic, called a main component.

Note that:
1, new k-dimensional re-construction, rather than simply removing the remaining nk-dimensional feature from the n-dimensional features.
2, the actual significance of the new concrete dimension represented by k do not know, only know the machine.
Reference Documents

Second, linear algebra knowledge base

1, the matrix representation of the transform group

In general, if we have M N-dimensional vector, by converting it wants the new N-dimensional vector space R is represented, then the first two groups of R rows by matrix A, and a vector composed of a matrix in columns B, then the product of two matrices AB is transformed result, wherein AB of a m as the result of converting the m-th column. Here Insert Picture Description
R may be less than N, and R determines the number of dimensions transformed data. That is, we can transform the data into an N-dimensional space of a lower dimension to the dimension after the conversion depends on the number of groups. Therefore this matrix multiplication representation dimension reduction may also be represented transformation.

Matrix multiplication to find a physical explanation :
Significance is two matrix multiplication of each column to the right column vector matrix transformation matrix to the left of each vector is trekking represented by the space group to (each column represents the results of a dimension m).
More abstract that a linear transformation matrix may represent one.

2, the covariance matrix

- covariance (assuming zero mean):

C O v ( a , b ) = 1 m i = 1 m a i b i Cov\left( a,b\right) =\frac{1}{m} \sum^{m}_{i=1} a_{i}b_{i} From Intuitively, the covariance is represented by two variables of the desired total error.

- covariance matrix:

Computing the covariance matrix is the covariance between the different dimensions, rather than between different samples.
Here Insert Picture Description
The diagonal positions of the matrix values are the elements of the variance , and values at other positions is the covariance between the elements .

  • The larger the elements on the diagonal, the stronger the signal, the more important variable; element indicates the smaller noise may exist or secondary variables.
  • On-diagonal elements of the size corresponding to the size of the correlation between the degree of redundancy of observed variables pair.

Provided A is the raw data matrix, M is the coordinate transformation matrix, K is the data matrix dimension reduction, we are:
A N × D M P × K = L N × K A_{N\times D}\cdot M_{P\times K}=L_{N\times K} Wherein, A corresponding to the covariance matrix C , L corresponding to the covariance matrix D , C and D relation is derived as follows:
C P × P = 1 N A T A C_{P\times P}=\frac{1}{N} A^{T}\cdot A D = 1 N ( A M ) T ( A M ) = 1 N M T A T A M D=\frac{1}{N} \left( AM\right)^{T} \cdot \left( AM\right) =\frac{1}{N} \cdot M^{T}A^{T}AM = M T ( 1 N A T A ) M = M T C M =M^{T}\left( \frac{1}{N} A^{T}A\right) M=M^{T}CM Objective: TheDinto diagonals only have value diagonal matrix

Since, solving the coordinate transformation matrix M problem can be converted into a real symmetric matrix A do diagonalization to solve the problem consists of the eigenvector matrix.

Feature value corresponding to the size of the feature vector represents the degree of importance .
The original data matrix A is multiplied by the size of the feature vector from left to right with the first k columns of the matrix M , the data matrix is obtained after the dimensionality reduction we need L .

Three, PCA implementation steps

It has the m n-dimensional data.

  1. Raw data columns by m rows and n columns matrix X.
  2. Each column (attribute field representing a) X is the zero-mean, i.e., by subtracting the mean of the column, for data pre-processing.
  3. Calculated covariance matrix Cov.
  4. Obtains an eigenvalue of the covariance matrix and the corresponding.
  5. The eigenvectors by a corresponding eigenvalue matrix are arranged by size from left to right, taking the first k columns matrix P.
  6. Y = XP is the data dimension reduction to dimension k

Fourth, the program implements

Here it is based on Python numpy library implementation

import numpy as np


def zeroMean(dataMat):
    average = np.mean(dataMat, axis=0)  # 按列求均值
    new_dataMat = dataMat - average
    return new_dataMat


def pca(dataMat, n):
    new_dataMat = zeroMean(dataMat)
    covMat = np.cov(new_dataMat, rowvar=0)  # covMat极为所求的协方差矩阵

    # Compute the eigenvalues and right eigenvectors of a square array.
    # 返回矩阵的特征值和特征向量
    # eigVects 每一列代表一个特征向量
    eigVals, eigVects = np.linalg.eig(np.mat(covMat))

    eigValsSort = np.argsort(eigVals)  # argsort对特征值是从小到大排序,提取其对应索引号

    n_eigValsSort = eigValsSort[-1:-(n + 1):-1]  # 提取最大的n个特征值的下标
    n_eigVects = eigVects[:, n_eigValsSort]
    low_dataMat = new_dataMat * n_eigVects

    return low_dataMat
Released three original articles · won praise 0 · Views 114

Guess you like

Origin blog.csdn.net/weixin_41821317/article/details/104442386