【ML】主成分分析 PCA (Principal Component Analysis)

原理 Theory

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.

这里写图片描述

其效果,一言以蔽之,就是 降维。 PCA reduces the dimensionality (the number of variables) of a data set by maintaining as much variance as possible.

对于 p 个样本,每个样本采集 n 个features,就可以构成一个 p × n feature map, 从这个 feature map 中最多可以提取出 min ( n , p ) 个正交的主成分, 通常 p >> n .

PCA 主要用于 data pre-processing中的 特征提取(feature extraction)

feature selection : 直接从原始features 中选择特定的一组 features
feature extration : 从原始的 features 中构建出新的一组 features

细节 Details

不同 视角下观察PCA

这里写图片描述

PCA of a multivariate Gaussian distribution centered at (1,3) with a standard deviation of 3 in roughly the (0.866, 0.5) direction and of 1 in the orthogonal direction. The vectors shown are the eigenvectors of the covariance matrix scaled by the square root of the corresponding eigenvalue, and shifted so their tails are at the mean.

概率视角下的方差最大化

Low variance can often be assumed to represent undesired background noise. The dimensionality of the data can therefore be reduced, without loss of relevant information, by extracting a lower dimensional component space covering the highest variance. Using a lower number of principal components instead of the high-dimensional original data is a common pre-processing step that often improves results of subsequent analyses such as classification.
For visualization, the first and second component can be plotted against each other to obtain a two-dimensional representation of the data that captures most of the variance (assumed to be most of the relevant information), useful to analyze and interpret the structure of a data set.

空间几何视角下

PCA 学习一种线性正交投影,一个旋转 w ,使得最大方差的方向和新空间的轴依次对齐。

性质

  • PCA 将数据变换为元素之间彼此不相关表示,这可以消除数据中未知变化因素,即噪音.

算法流程 Algo Flow

去平均值,即每一位特征减去各自的平均值(当然,为避免量纲以及数据数量级差异带来的影响,先标准化是必要的)
计算协方差矩阵
计算协方差矩阵的特征值与特征向量
对特征值从大到小排序
保留最大的个特征向量
将数据转换到个特征向量构建的新空间中

应用场景 Application Scenarios


Ref

猜你喜欢

转载自blog.csdn.net/baishuo8/article/details/81869116