机器学习入门 05 主成分分析 PCA（Principal Component Analysis）

1 介绍

1.1 特点

一个非监督的机器学习算法
主要用于数据的降维，通过降维，可以发现便于人类理解的特征
其他应用：可视化；去噪

1.2 数学意义

这里写图片描述

找到让样本间间距最大的轴
定义样本间间距，使用方差： $Var(x)=\frac { 1 }{ m } \sum _{ i=1 }^{ m }{ { ({ x }_{ i }-\bar { x } ) }^{ 2 } }$
找到一个轴，使得样本空间的所有点映射到这个轴后，方差最大。

1.3 步骤

对所有样本进行demean处理，即将样本的均值归为0，则 $Var(x)=\frac { 1 }{ m } \sum _{ i=1 }^{ m }{ { ({ x }_{ i }) }^{ 2 } } ,\quad \bar { x } =0$
求一个轴的方向 $w=({ w }_{ 1 },{ w }_{ 2 })$ ，使得我们所有的样本映射到 $w$ 以后，有 $Var({ x }_{ pro })=\frac { 1 }{ m } \sum _{ i=1 }^{ m }{ { ({ x }_{ pro }^{ (i) }-{ \bar { x } }_{ pro }) }^{ 2 } }$ 的值最大。

1.4 公式推导

1） $Var({ x }_{ pro })=\frac { 1 }{ m } \sum _{ i=1 }^{ m }{ { ({ x }_{ pro }^{ (i) }-{ \bar { x } }_{ pro }) }^{ 2 } } \\ \qquad =\frac { 1 }{ m } \sum _{ i=1 }^{ m }{ { \left\| { x }_{ pro }^{ (i) }-{ \bar { x } }_{ pro } \right\| }^{ 2 } } \\ \qquad =\frac { 1 }{ m } \sum _{ i=1 }^{ m }{ { \left\| { x }_{ pro }^{ (i) } \right\| }^{ 2 } }$

2） $\left\| { x }_{ pro }^{ (i) } \right\|$
这里写图片描述

${ x }_{ 1 }={ x }^{ (i) }=({ x }_{ 1 }^{ (i) },{ x }_{ 2 }^{ (i) }),\quad { x }_{ 2 }={ x }_{ pro }^{ (i) }=({ x }_{ pro1 }^{ (i) },{ x }_{ pro2 }^{ (i) })$

${ x }^{ (i) }\cdot w=\left\| { x }^{ (i) } \right\| \cdot \left\| w \right\| \cdot cos\theta \\$
令 $\left\| w \right\| =1$ ，有 ${ x }^{ (i) }\cdot w=\left\| { x }^{ (i) } \right\| \cdot cos\theta =\left\| { x }_{ pro }^{ (i) } \right\|$

$\because \left\| { x }_{ pro }^{ (i) } \right\| ={ x }^{ (i) }\cdot w\\ \therefore Var({ x }_{ pro })=\frac { 1 }{ m } \sum _{ i=1 }^{ m }{ { \left\| { x }_{ pro } \right\| }^{ 2 } } =\frac { 1 }{ m } \sum _{ i=1 }^{ m }{ { ({ x }^{ (i) }\cdot w) }^{ 2 } }$

3）目标：求 $w$ ，使 $Var({ x }_{ pro })=\frac { 1 }{ m } \sum _{ i=1 }^{ m }{ { ({ x }^{ (i) }\cdot w) }^{ 2 } }$ 的值最大

2 梯度上升法求梯度

$Var({ x }_{ pro })=\frac { 1 }{ m } \sum _{ i=1 }^{ m }{ { ({ x }^{ (i) }\cdot w) }^{ 2 } } \\ \qquad =\frac { 1 }{ m } \sum _{ i=1 }^{ m }{ { ({ x }_{ 1 }^{ (i) }{ w }_{ 1 }+{ x }_{ 2 }^{ (i) }{ w }_{ 2 }+\cdots +{ x }_{ n }^{ (i) }{ w }_{ n }) }^{ 2 } } \\ \qquad =\frac { 1 }{ m } \sum _{ i=1 }^{ m }{ { (\sum _{ j=1 }^{ n }{ { x }_{ j }^{ (i) }{ w }_{ j } } ) }^{ 2 } }$

令 $f(x)=\frac { 1 }{ m } \sum _{ i=1 }^{ m }{ { ({ x }_{ 1 }^{ (i) }{ w }_{ 1 }+{ x }_{ 2 }^{ (i) }{ w }_{ 2 }+\cdots +{ x }_{ n }^{ (i) }{ w }_{ n }) }^{ 2 } }$ ，有

$\nabla f=\begin{pmatrix} \frac { \partial f }{ \partial { w }_{ 1 } } \\ \frac { \partial f }{ \partial { w }_{ 2 } } \\ \vdots \\ \frac { \partial f }{ \partial { w }_{ n } } \end{pmatrix}=\frac { 2 }{ m } \begin{pmatrix} \sum _{ i=1 }^{ m }{ { ({ x }_{ 1 }^{ (i) }{ w }_{ 1 }+{ x }_{ 2 }^{ (i) }{ w }_{ 2 }+\cdots +{ x }_{ n }^{ (i) }{ w }_{ n }) } } ({ x }_{ 1 }^{ (i) }) \\ \sum _{ i=1 }^{ m }{ { ({ x }_{ 1 }^{ (i) }{ w }_{ 1 }+{ x }_{ 2 }^{ (i) }{ w }_{ 2 }+\cdots +{ x }_{ n }^{ (i) }{ w }_{ n }) } } ({ x }_{ 2 }^{ (i) }) \\ \vdots \\ \sum _{ i=1 }^{ m }{ { ({ x }_{ 1 }^{ (i) }{ w }_{ 1 }+{ x }_{ 2 }^{ (i) }{ w }_{ 2 }+\cdots +{ x }_{ n }^{ (i) }{ w }_{ n }) } } ({ x }_{ n }^{ (i) }) \end{pmatrix}\\ =\frac { 2 }{ m } \begin{pmatrix} \sum _{ i=1 }^{ m }{ { ({ x }^{ (i) }w) } } ({ x }_{ 1 }^{ (i) }) \\ \sum _{ i=1 }^{ m }{ { ({ x }^{ (i) }w) } } ({ x }_{ 2 }^{ (i) }) \\ \vdots \\ \sum _{ i=1 }^{ m }{ { ({ x }^{ (i) }w) } } ({ x }_{ n }^{ (i) }) \end{pmatrix}=\frac { 2 }{ m } ({ x }^{ (1) }w,{ x }^{ (2) }w,\cdots ,{ x }^{ (m) }w)\cdot \begin{pmatrix} { x }_{ 1 }^{ (1) } & { x }_{ 2 }^{ (1) } & \cdots & { x }_{ n }^{ (1) } \\ { x }_{ 1 }^{ (2) } & { x }_{ 2 }^{ (2) } & \cdots & { x }_{ n }^{ (2) } \\ \vdots & \vdots & \ddots & \vdots \\ { x }_{ 1 }^{ (m) } & { x }_{ 2 }^{ (m) } & \cdots & { x }_{ n }^{ (m) } \end{pmatrix}\\ =\frac { 2 }{ m } { ({ xw }^{ \intercal }x) }^{ \intercal }=\frac { 2 }{ m } { x }^{ \intercal }(xw)$

3 求前n个主成分

1）求出第一主成分后，如何求出下一主成分？

–> 数据进行改变，将数据在第一个主成分上的分量去掉，然后在新的数据上求第一主成分。

这里写图片描述

${ x }^{ (i) }\cdot w=\left\| { x }_{ pro }^{ (i) } \right\| \\ { x }_{ pro }^{ (i) }=\left\| { x }_{ pro }^{ (i) } \right\| \cdot w\\ { x }^{ '(i) }={ x }^{ (i) }-{ x }_{ pro }^{ (i) }$

4 降维

降维可以使得数据处理时间减少，但会损失准确度。

1）高维数据向低维数据映射

$x=\begin{pmatrix} { x }_{ 1 }^{ (1) } & { x }_{ 2 }^{ (1) } & \cdots & { x }_{ n }^{ (1) } \\ { x }_{ 1 }^{ (2) } & { x }_{ 2 }^{ (2) } & \cdots & { x }_{ n }^{ (2) } \\ \vdots & \vdots & \ddots & \vdots \\ { x }_{ 1 }^{ (m) } & { x }_{ 2 }^{ (m) } & \cdots & { x }_{ n }^{ (m) } \end{pmatrix}$

前 k 个成分：

${ w }_{ k }=\begin{pmatrix} { w }_{ 1 }^{ (1) } & { w }_{ 2 }^{ (1) } & \cdots & { w }_{ n }^{ (1) } \\ { w }_{ 1 }^{ (2) } & { w }_{ 2 }^{ (2) } & \cdots & { w }_{ n }^{ (2) } \\ \vdots & \vdots & \ddots & \vdots \\ { w }_{ 1 }^{ (k) } & { w }_{ 2 }^{ (k) } & \cdots & { w }_{ n }^{ (k) } \end{pmatrix}$

从 n 维映射到 k 维：

$\underset { m\times n }{ x } \cdot \underset { n\times k }{ { w }_{ k }^{ \intercal } } =\underset { m\times k }{ { x }_{ k } } =\begin{pmatrix} { x }_{ 1 }^{ (1) } & { x }_{ 2 }^{ (1) } & \cdots & { x }_{ k }^{ (1) } \\ { x }_{ 1 }^{ (2) } & { x }_{ 2 }^{ (2) } & \cdots & { x }_{ k }^{ (2) } \\ \vdots & \vdots & \ddots & \vdots \\ { x }_{ 1 }^{ (m) } & { x }_{ 2 }^{ (m) } & \cdots & { x }_{ k }^{ (m) } \end{pmatrix}$

从 k 维还原成 n 维：（还原后会损失信息）

$\underset { m\times k }{ { x }_{ k } } \cdot \underset { k\times n }{ { w }_{ k }^{ } } =\underset { m\times n }{ { x }_{ m } }$

4 代码

1）PCA.py

import numpy as np
import matplotlib.pyplot as plt


class PCA:

    def __init__(self, n_components):
        """ 初始化PCA"""
        assert n_components >= 1, "n_components must be valid."
        self.n_components = n_components
        # 获取的主成分矩阵
        self.components_ = None

    def fit(self, xx, eta=0.01, n_iters=1e4):
        """ 获得数据集xx的前n个主成分"""
        assert self.n_components <= xx.shape[1], "n_components must be greater than the feature num of xx."

        def demean(xx):
            return xx - np.mean(xx, axis=0)

        def f(w, xx):
            """ 函数f """
            return np.sum((xx.dot(w) ** 2)) / len(xx)

        def df(w, xx):
            """ 函数f的梯度 """
            return xx.T.dot(xx.dot(w)) * 2. / len(xx)

        def direction(w):
            """ 求w的单位向量"""
            return w / np.linalg.norm(w)

        def first_component(xx, initial_w, eta=0.01, n_iters=1e4, epsilon=1e-8):
            w = direction(initial_w)
            cur_iter = 0

            while cur_iter < n_iters:
                gradient = df(w, xx)
                last_w = w
                w = w + eta * gradient
                w = direction(w)  # 注意1：每次求一个单位方向
                if (abs(f(w, xx) - f(last_w, xx)) < epsilon):
                    break
                cur_iter += 1
            return w

        xx_pca = demean(xx)
        self.components_ = np.empty(shape=(self.n_components, xx.shape[1]))
        for i in range(self.n_components):
            # 注意2：初始位置不能为0，不能从0向量开始
            initial_w = np.random.random(xx_pca.shape[1])
            w = first_component(xx_pca, initial_w, eta, n_iters)
            self.components_[i, :] = w

            plt.scatter(xx_pca[:, 0], xx_pca[:, 1])
            plt.plot([0, w[0] * 30], [0, w[1] * 30], color='r')
            plt.show()

            # 去除前一个主成分
            xx_pca = xx_pca - xx_pca.dot(w).reshape(-1, 1) * w

        # 注意3：不能使用StandardScaler标准化数据（归一化）
        return self

    def transform(self, xx):
        """ 将给定的xx，映射到各个主成分分量中，降维 """
        assert xx.shape[1] == self.components_.shape[1]

        return xx.dot(self.components_.T)

    def inverse_transform(self, xx):
        """ 将给定的xx，反向映射回原来的特征空间"""
        assert xx.shape[1] == self.components_.shape[0]

        return xx.dot(self.components_)

    def __repr__(self):
        return "PCA(n_components=%d)" % self.n_components

2）测试代码：pca_test.py

import numpy as np
import matplotlib.pyplot as plt
from PCA_pro.PCA import PCA

# 1 准备数据
xx = np.empty((100, 2))
xx[:, 0] = np.random.uniform(0., 100., size=100)
xx[:, 1] = 0.75 * xx[:, 0] + 3. + np.random.normal(0, 10., size=100)

plt.scatter(xx[:, 0], xx[:, 1])
plt.show()

# 2 求主成分和其他成分
mpca2=PCA(2)
mpca2.fit(xx)
print(mpca2.components_)

# 3 降维
xx_reduction = mpca2.transform(xx)
print("xx_reduction.shape:", xx_reduction.shape)

# 恢复降维,损失了信息
xx_restore = mpca2.inverse_transform(xx_reduction)
print("xx_restore.shape", xx_restore.shape)

plt.scatter(xx[:, 0], xx[:, 1], color='g')
plt.scatter(xx_restore[:, 0], xx_restore[:, 1], color='r', alpha=0.5)
plt.show()

3）运行结果：
这里写图片描述

[[ 0.76608005  0.64274518]
 [-0.64274259  0.76608221]]
 xx_reduction.shape: (100, 2)
xx_restore.shape (100, 2)