Principal Component Analysis (PCA) Principal Component Analysis

Theoretical analysis on principal component analysis (PCA dimensionality reduction algorithm)

A feature extraction and feature selection

On principal component analysis, Strictly speaking, it should belong to the feature extraction , rather than feature selection .

Let's take a look at what is the feature selection?

For example, now we training data set is:
\ [\ left \ {(X_ {. 1}, Y_ {. 1}), (X_ {2}, Y_ {2}), (X_ {. 3}, Y_ {. 3}) , ... (x_ {p},
y_ {p}) \ right \} \] wherein:
\ [I} = X_ {[X_ {I1}, {X_} I2, I3 X_ {}, ... X_ {in}] ^ {T}
\] ie \ (x_ {i} \) is an n-dimensional column vectors.

So what is the problem of feature selection is it?

Q: n redundant dimensions, i.e. \ (x_ {i} \) of some dimensions for the study had no effect problem, how to select the most useful m (m <n) from the n th dimension feature?

This is a feature selection problem. In any case, the final out of the feature, only the original features chosen, the newly added features are not present.

So what is the problem of feature extraction is it?

n dimensions \ ([X_ {I1}, {X_} I2, I3 X_ {}, ... X_ {in}] \) , configured to:
\ [\ left \ {F_. 1 {} (X_ {I1}, x_ {i2}, x_ {i3 }, ... x_ {in}), f_ {2} (x_ {i1}, x_ {i2}, x_ {i3}, ... x_ {in}), f_ { 3} (x_ {i1}, x_ {i2}, x_ {i3}, ... x_ {in}), ..., f_ {m} (x_ {i1}, x_ {i2}, x_ {i3} , ... x_ {in}) \
right \} \] That is: configured \ (\ left ({F_. 1}, {2} F_, F_ {}. 3, .. F_ {m} \ right) \) , in each \ (F \) mapped \ ([X_ {I1}, {X_} I2, I3 X_ {}, ... X_ {in}] \) , whereby the m \ (f_ {i} ( {I1} X_, X_ {} I2, I3 X_ {}, ... X_ {in}) \) , where if m <n, then the dimensionality reduction to achieve the same purpose, which is called feature extraction , and finally each feature produced are brand new.

So clear, principal component analysis (PCA) is part of feature extraction .

Two theoretical derivation Principal Component Analysis

Derivation of a main component

Principal component analysis of the idea is very simple, difficult to prove that theory. Principal component analysis is essentially the nature of matrix multiplication, matrix multiplication of each column to the right of the matrix is essentially projected into a new space to the left of the matrices for each matrix acts yl group which, if left to the right of the matrix is less than the number of matrix rows the number of rows, row view will achieve dimensionality reduction. For Liezi:
\ [{A_. 3 \ 5} Times \ B_ {5 Times \ Times. 8. 3} = {C_ \ Times. 8} \]
Such simple look \ (B \) matrix to row view, from 5 to dimension fall into three-dimensional. ok, with this understanding, we look at the principal component analysis.

Principal Component Analysis doing these things: configured \ (A_ {m \ Times n-}, B_ {m \ Times. 1} \) , so \ (Y_ {m \ times 1 } = A_ {m \ times n} X_ {n \ Times. 1} {+ m B_ \ Times. 1} \) , m <n-

Thus, we put \ (X_ {n \ times 1 } \) dimension reduction became \ (Y_ {m \ Times. 1} \) .

Is it still looks very simple, just to find satisfying the above dimension reduction \ ((A \ b) \ ) on the line, but we can not help but think about a problem, so we use a matrix dimensionality reduction, will not lose \ (X \) the information contained in it?

The answer is certain, dimension reduction is almost certainly lose information , so we certainly can not easily find the group \ ((A \ b) \) , what we need is to retain as much as possible after the dimension reduction \ (X \) contained all information. But the problem is coming:

1, \ (the X-\) information Where?

2, what to measure?

3, how to find to retain the most information \ ((A \ b) \) ?

In fact, statistics inside, a major information distribution is the variance or standard deviation to measure, so the principal component analysis to do is:

Found after the projected direction of the largest variance, and projection in this direction

for example:

Each point represents one sample red \ (X_ {I} \) , \ (X_ {I} = [{I1} X_, X_ {I2}] T ^ {} \)

That is, each sample \ (x_ {i} \) is two dimensions, then if you want to reduce the dimension, it can only fall into one dimension. Here are my two projection directions, ab, in a direction to draw a green dot after the projection. Then the principal component analysis to a better direction of the projection. The reason you can think so, if the projected point after dispersing the less prone coincidence point, coincidence, then necessarily result in complete loss of information. After we get to a projection called the first principal component, b to the projection obtained as a second principal component. And how to measure 'dispersed' it? Is the variance! Therefore, the larger the variance of the projected data, the more information they contain, the less information loss.

ok, then we come to the best \ ((A \ B) \) , but then, click here to rewrite:
\ [Y_ {m \ A_ Times. 1} = {m \} n-Times (n-X_ {\ times 1} - \ overline {X_
{n \ times 1}}) \] Thus only the \ (b_ {m \ times 1 } \) becomes \ (- A_ {m \ times n} \ overline {X_ { n-\ Times. 1}} \) , just need to find \ (A_ {m \ times n } \)

First, let the \ (A_ {m \ times n } \) written row vector form:
\ [A_ {m \} = n-Times \ A_ the begin {} {bmatrix A_. 1} {2} \\ \\ ... \\ a_ {m} \ end
{bmatrix} \] where \ (a_ {i} = [ a_ {i1}, a_ {i2}, a_ {i3}, ..., a_ {in}] \ ) , each \ (a_ {i} \) indicates a projection direction.

In this case:
\ [Y_ {m \ Times. 1} = \ A_ the begin {} {bmatrix A_. 1} {2} \\ \\ ... \\ {m} A_ \ bmatrix End {} _ {m \ n-Times } \ times (X_ {n \ times 1} - \ overline {X_ {n \ times 1}}) = \ begin {bmatrix} a_ {1} (X_ {n \ times 1} - \ overline {X_ {n \ times 1}}) \\ a_ { 2} (X_ {n \ times 1} - \ overline {X_ {n \ times 1}}) \\ ... \\ a_ {m} (X_ {n \ times 1 } - \ overline {X_ {n
\ times 1}}) \ end {bmatrix} \] here we assume that a total of \ (P \) a \ (X-\) , namely: \ (\ left \ {X_ {I} \ right \} _ {i = 1 \ sim p} \)

那么同样有:
\[ \left \{ Y_{i} \right \}_{m\times 1} = \begin{bmatrix} a_{1}(\left \{ X_{i} \right \}_{n\times 1}-\overline{X_{n\times 1}})\\ a_{2}(\left \{ X_{i} \right \}_{n\times 1}-\overline{X_{n\times 1}})\\ ...\\ a_{m}(\left \{ X_{i} \right \}_{n\times 1}-\overline{X_{n\times 1}})) \end{bmatrix} ,i=1\sim p \]

Similarly, where we have written in the form of a row vector formula:
\ [\ left \ {Y_ {I} \ right \} _ {m \ Times. 1} = \ the begin bmatrix} {Y_ {Y_ {I1 I2} \\ } \\ ... \\ y_ {im}
\ end {bmatrix}, i = 1 \ sim p \] Well now, let's look at the things said earlier, each \ (a_ {i} \) on behalf of with a projection direction.

那么既然:\(y_{i1}=a_{1}(\left \{ X_{i} \right \}_{n\times 1}-\overline{X_{n\times 1}})\)

Then \ (y_ {i1} \) represents the first in a direction (a_ {1} \) \ the result of the projection, called the first principal component, we need is the variance of the result of the projection maximum, namely:
\ [max: \ sum_ {I =. 1} ^ {P} (Y_ {I1} - \ overline {Y} _ {I1}) ^ 2 \]
so we first calculate \ (\ overline { Y} _ {I1} \)
\ [\ overline {Y} _ {I1} = \ FRAC {. 1} {P} \ sum_ {I =. 1} ^ py_ {I1} = \ FRAC {. 1} {P} \ sum_ {i = 1} ^ pa_ {1} (\ left \ {X_ {i} \ right \} _ {n \ times 1} - \ overline {X_ {n \ times 1}}) = \ frac {a_ { 1}} {p} [( \ sum_ {i = 1} ^ p \ left \ {X_ {i} \ right \} _ {n \ times 1}) - p \ overline {X_ {n \ times 1}} ] \]
and
\ [p \ overline {X_ { n \ times 1}} = p * \ frac {1} {p} \ sum_ {i = 1} ^ p \ left \ {X_ {i} \ right \} _ {n \ times 1} =
\ sum_ {i = 1} ^ p \ left \ {X_ {i} \ right \} _ {n \ times 1} \] so
\ [\ Overline {y} _ {i1} = \ frac {1} {p} \ sum_ {i = 1} ^ py_ {i1} = \ frac {1} {p} \ sum_ {i = 1} ^ pa_ {1} (\ left \ { X_ {i} \ right \} _ {n \ times 1} - \ overline {X_ {n \ times 1}}) = \ frac {a_ {1}} {p} [( \ sum_ {i = 1} ^
p \ left \ {X_ {i} \ right \} _ {n \ times 1}) - p \ overline {X_ {n \ times 1}}] = 0 \] is transformed into : \
[max: \ sum_ {I =. 1} ^ {P} Y_ {I1} ^ 2 = [A_ {. 1} (\ left \ {X_ {I} \ right \} _ {n-\ Times. 1} - \ overline {X_ {n \ times 1
}})] ^ 2 \] explain here, \ (A_. 1} = {(A_ {}. 11, 12 is A_ {}, 13 is A_ {}, ... {1N} A_ ) \) , so \ (a_ {1} (\ left \ {X_ {i} \ right \} _ {n \ times 1} - \ overline {X_ {n \ times 1}}) \) is a constant !

It is possible to continue the above formula can be rewritten :( transpose itself equal to a constant, the following formula because the right side of the bit permutation ) \
[max: \ sum_ {I} = ^ {P}. 1 Y_ {I1} = ^ 2 \ sum_ {i = 1} ^ {p} a_ {1} (\ left \ {X_ {i} \ right \} _ {n \ times 1} - \ overline {X_ {n \ times 1}}) * [ a_ {1} (\ left \
{X_ {i} \ right \} _ {n \ times 1} - \ overline {X_ {n \ times 1}})] ^ {T} \] and according to \ ((AB ) ^ {T} = B ^ the TA ^ T \) , the transpose of the right by the formula expand:
\ [max: A_ {. 1} [\ sum_ {I =. 1} ^ {P} (\ left \ {X_ { i} \ right \} _ { n \ times 1} - \ overline {X_ {n \ times 1}}) (\ left \ {X_ {i} \ right \} _ {n \ times 1} - \ overline { X_ {n \ times 1}}
) ^ {T}] a_ {1} ^ {T} \] this final result of the whole expression is a constant ! Look:

ok, so now, we use a symbol instead of the middle part! The formula:
\ [[\ sum_ {I =. 1} ^ {P} (\ left \ {X_ {I} \ right \} _ {n-\ Times. 1} - \ overline {X_ {n-\ Times. 1}}) (\ left \ {X_ {i
} \ right \} _ {n \ times 1} - \ overline {X_ {n \ times 1}}) ^ {T}] = \ Sigma \] the above statistics in this formula in called covariance matrix (covariance matrix)

这里再来补充一个知识点:\((A+B)^{T}=A^{T}+B^{T}\)

那么来看看下面这个计算:
\[ \Sigma^{T}=[\sum_{i=1}^{p}(\left \{ X_{i} \right \}_{n\times 1}-\overline{X_{n\times 1}})(\left \{ X_{i} \right\}_{n\times 1}-\overline{X_{n\times 1}})^{T}]^{T}=\\ [\sum_{i=1}^{p}(\left \{ X_{i} \right \}_{n\times 1}-\overline{X_{n\times 1}})(\left \{ X_{i}\right\}_{n\times 1}-\overline{X_{n\times 1}})^{T}]=\Sigma \]
这个我们下面会用到。

按理说这里我们的优化问题已经求出来了,协方差矩阵是已经知道的,求\(a_{1}\),但是我们必须再对\(a_{i}\)进行一定的限制,为什么呢?我们说了这里\(a_{i}\)是一个投影的方向,在数学的表达里就是一个多维的向量,一个向量具有方向和模长。而我们主要关注的是方向,必须保证模长一定,所以我们必须对\(a_{i}\)进行一些归一化限制。

于是我们给出下面完整的优化问题:
\[ 最大化:a_{1}\Sigma a_{1}^{T}\\ 限制条件:a_{1}a_{1}^{T}=\left | a_{1} \right |^{2}=1 \]
要解此优化问题要用到拉格朗日乘子法:

不了解的课参考下这篇文章看下:https://zhuanlan.zhihu.com/p/38625079

拉格朗日变换如下:
\[ F(a_{1})=a_{1}\Sigma a_{1}^{T}-\lambda (a_{1}a_{1}^{T}-1) \]
\(a_{1}\)求导:先给出一些求导的结果,矩阵求导比较陌生,不清楚的网上查一查。

\(\frac{\mathrm{d} (a_{1}\Sigma a_{1}^{T})}{\mathrm{d} a_{1}}=2(\Sigma a_{1}^{T})^{T},\ \ \ \frac{\mathrm{d} (a_{1}a_{1}^{T})}{\mathrm{d} a_{1}}=2a_{1}\)

所以求导后的结果是:
\[ \frac{\partial F}{\partial a_{1}}=2(\Sigma a_{1}^{T})^{T}-2\lambda a_{1} \]
令其等于0:
\[ 2(\Sigma a_{1}^{T})^{T}-2\lambda a_{1}=0\\ \Rightarrow (\Sigma a_{1}^{T})^{T}=\lambda a_{1} \]
上式两边再同时转置:注意\(\lambda\)是一个常数!
\[ \Sigma a_{1}^{T}=(\lambda a_{1})^{T}=\lambda a_{1}^{T} \]
根据特征值与特征向量的定义,上式中:重点来了!

\(a_{1}^{T}\)是协方差矩阵\(\Sigma\)的特征向量,\(\lambda\)是其对应的特征值!

我们要最大化 \(a_{1}\Sigma a_{1}^{T}\\\),可作如下变换:
\[ a_{1}\Sigma a_{1}^{T}=a_{1}(\Sigma a_{1}^{T})=a_{1}\lambda a_{1}^{T}=\lambda(a_{1}a_{1}^{T})=\lambda \]
那么到此我们可以得出结论,我们要想最大化\(a_{1}\Sigma a_{1}^{T}\),就要最大化\(\lambda\)。而\(\lambda\)就是协方差矩阵\(\Sigma\)最大的特征值,此时所求的\(a_{1}\)就是最大特征值所对应的特征向量。

此时求得的\(a_{1}\)就是使得方差最大化的方向,这时得到的第一个主成分就保留了最多的信息。

第二个主成分的推导

我们开始来求第二个主成分,也就是要求\(a_{2}\),但是\(a_{2}\)的限制条件除了归一化限制以外,还要求\(a_{2}\)\(a_{1}\)保持正定,你也可以认为是垂直。

所以其完整优化问题就是如下:
\[ 最大化:a_{2}\Sigma a_{2}^{T}\\ 限制条件:a_{2}a_{2}^{T}=\left | a_{2} \right |^{2}=1\\ a_{2}a_{1}^{T}=a_{1}a_{2}^{T}=0 (a_{1} , a_{2}要正定) \]
构造拉格朗日函数如下:
\[ F(a_{2})=a_{2}\Sigma a_{2}^{T}-\lambda (a_{2}a_{2}^{T}-1)-\beta a_{2}a_{1}^{T} \]
其中:\(\frac{\mathrm{d} (a_{2}a_{1}^{T})}{\mathrm{d} a_{2}}=a_{1}\)

则原式对\(a_{2}\)求导结果如下:
\[ \frac{\mathrm{d} F}{\mathrm{d} a_{2}}=2(\Sigma a_{2}^{T})^{T}-2\lambda a_{2}-\beta a_{1} \]
令上式等于0:
\[ 2(\Sigma a_{2}^{T})^{T}-2\lambda a_{2}-\beta a_{1}=2a_{2}\Sigma ^{T}-2\lambda a_{2}-\beta a_{1}=0 \]
现在需要先证明\(\beta\)=0

这里要用到我们前面所说的:\(\Sigma\)=\(\Sigma^{T}\),带入上面的第二个等式,得到:
\[ 2a_{2}\Sigma -2\lambda a_{2}-\beta a_{1}=0 \]
我们看看这个式子的结构:

现在上式两边同时乘以\(a_{1}^{T}\):
\[ (2a_{2}\Sigma -2\lambda a_{2}-\beta a_{1})a_{1}^{T}=0,这里0是一个常数! \]

化简:
\[ (2a_{2}\Sigma-2\lambda a_{2}-\beta a_{1})a_{1}^{T}=0\\ \Rightarrow 2a_{2}\Sigma a_{1}^{T}-2\lambda a_{2}a_{1}^{T}-\beta a_{1}a_{1}^{T}=0\\ \Rightarrow 2a_{2}\lambda_{1}a_{1}^{T}-\beta =0\\ \Rightarrow \beta=2a_{2}\lambda_{1}a_{1}^{T}=0 \]
则:
\[ \frac{\mathrm{d} F}{\mathrm{d} a_{2}}=2(\Sigma a_{2}^{T})^{T}-2\lambda a_{2}=0\\ \Rightarrow (\Sigma a_{2}^{T})^{T}-\lambda a_{2}=0\\ \Rightarrow (\Sigma a_{2}^{T})^{T}=\lambda a_{2}\\ \Rightarrow \Sigma a_{2}^{T}=\lambda a_{2}^{T} \]
同样根据矩阵的特征值与特征向量的概念,可以推出\(a_{2}^{T}\)是协方差矩阵的特征向量,\(\lambda\)是其对应的特征值。

我们的目的是最大化:\(a_{2}\Sigma a_{2}^{T}\),同样可根据以下变换:
\[ a_{2}\Sigma a_{2}^{T}=a_{2}(\Sigma a_{2}^{T})=a_{2}\lambda a_{2}^{T}=\lambda a_{2} a_{2}^{T}=\lambda \]
也就是说,我要取一个最大的\(\lambda\),但是我们在第一个主成分里已经取得了协方差矩阵\(\Sigma\)最大的特征值,那我们这里只能取第二大的特征值,此时求得的投影方向就是第二大特征值所对应的特征向量。

数学归纳

前面通过推导,我们可以求出第一个主成分和第二个主成分,具体我们求多少个主成分,看自己需求,每一个的求法和前面都差不多,唯一的区别在于优化问题中,要新增限制条件,新求的\(a_{i}\)必须保持与之前求得的\((a_{i-1},a_{i-2},....)\)全部保持正定。这样我们理论上就证明了如何通过协方差矩阵进行降维。

实际主成分分析(PCA降维算法)的算法流程

第一步:求协方差矩阵,
\[ \Sigma =\sum _{i=1}^{p}(X_{i}-\overline{X})(X_{i}-\overline{X})^{T} \]
第二步:求出上面的协方差矩阵的特征值,并按从大到小排列\((\lambda _{1},\lambda _{2},\lambda _{3},...)\)

​ 对应的特征向量排序为\((a_{1}^{T},a_{2}^{T},a_{3}^{T},....)\)

第三步:归一化所有\(a_{i}\),使得\(a_{i}a_{i}^{T}=1\)

第四步:选取前m(m<n)个特征值所对应的特征向量,求出降维矩阵\(A\)
\[ A=\begin{bmatrix} a_{1}\\ a_{2}\\ ...\\ a_{m} \end{bmatrix}_{m\times n} \]
第五步:降维
\[ Y_{m\times 1}= \begin{bmatrix} a_{1}\\ a_{2}\\ ...\\ a_{m} \end{bmatrix}_{m\times n} \times (X_{n\times 1}-\overline{X_{n\times 1}}) = \begin{bmatrix} a_{1}(X_{n\times 1}-\overline{X_{n\times 1}})\\ a_{2}(X_{n\times 1}-\overline{X_{n\times 1}})\\ ...\\ a_{m}(X_{n\times 1}-\overline{X_{n\times 1}}) \end{bmatrix} \]
这样就把维度从\(n\times 1\)降成了\(m\times 1\)

Guess you like

Origin www.cnblogs.com/LUOyaXIONG/p/11356201.html