Andrew Ng's "Machine Learning" - PCA Dimensionality Reduction

PCA dimensionality reduction

1. Principal component analysis
- 1.1 Motivation for data dimensionality reduction
- 1.2 Analysis of PCA dimensionality reduction target problem
2. PCA mathematical principle analysis
- 2.1 Some thoughts on finding the covariance matrix
- 2.2 PCA implementation method
3. Python implementation
- 3.1 Compress face data

The data set and source files can be obtained in the Github project
link: https://github.com/Raymond-Yang-2001/AndrewNg-Machine-Learing-Homework

1. Principal component analysis

1.1 Motivation for data dimensionality reduction

In some cases, the data used for machine learning may be very large. In this case, data compression is necessary. Suppose a certain data set includes two feature dimensions: the radius of a circle and the area of a circle. According to simple mathematical knowledge, these two features are actually related to each other. From the perspective of describing the characteristics of a circle, there is no difference between radius and area; in machine learning, it is completely possible to use only one feature for learning.

This is data dimensionality reduction, which allows certain "redundant" feature dimensions in the data to be compressed to obtain a lower data representation - of course, this compression is based on minimizing the loss of data information.

1.2 Analysis of PCA dimensionality reduction target problem

Suppose we have the following data:
Insert image description here
If we want to compress the data dimensionally from two dimensions to one, obviously the horizontal direction is more suitable as the compressed dimension. From a mathematical point of view, it is because the variance in the horizontal direction is greater than that in the vertical direction, which means that the features in the horizontal direction have a larger feature space than the vertical direction, and there are richer selection rules for feature values. , that is, the samples are more distinguishable.

Therefore, the optimization objectives of PCA can be summarized as the following two:

The variance of the same dimension is the largest after dimensionality reduction
The correlation between different dimensions is 0

If there is a correlation between features of different dimensions, it proves that there is some internal correlation between the two features, and one feature can be inferred from the other feature. Obviously, this does not achieve the best goal of dimensionality reduction.

2. PCA mathematical principle analysis

In order to simultaneously represent the variance between the same dimension and the correlation between different dimensions, we introduce the covariance matrix . Let the data dimension be $D$ ，则协方差矩阵为：
$C=\left[\begin{matrix} cov(x_{1},x_{1}) & cov(x_{1},x_{2}) & \cdots & cov(x_{1},x_{D}) \\ \vdots & \vdots & \ddots & \vdots \\ cov(x_{D},x_{1}) & cov(x_{D},x_{2}) & \cdots & cov(x_{D},x_{D}) \\ \end{matrix}\right]_{D\times D}$
The elements on the diagonal indicate the variance of the same dimension, and the variance between different dimensions indicates the correlation between different dimensions. The optimized ideal matrix is:
$RC^{\prime}=\left[\begin{matrix} \delta_ {11} & 0 & \cdots & 0 \\ 0 & \delta_{22} & \cdots & 0 \\ 0 & 0 & \ddots & \vdots \\ 0 & 0 & \cdots & \delta_{RR} \ \ \end{matrix}\right]_{R\times R}$
And, $X$ is the original data matrix with shape $(N, D)$ ， $P$ is the dimensionality reduction transformation matrix.

Consider the calculation of the covariance matrix:
$}{N}(XP)^{\top}(XP)=\frac{1}{N}P^{\top}X^{\top}XP=P^{\top}\left(\frac{ 1}{N}X^{\top}X\right)P=P^{\top}CP$
should try to reduce the correlation between different dimensions, let $IP^{\top}P=I$ , that is $The columns of P$ are orthogonal.

The optimization problem is as follows:
$\left\{ \begin{aligned} &\max{\rm{tr}{(P^{\top}CP)}} \\ &P^{\top}P=I\\ \end{aligned} \right.$

Using the Lagrangian operator, we get the unconditional extreme value problem:
$f(P)=\rm{tr}(P^{\ top}CP)+\lambda(IP^{\top}P)$
versus $f (P)$ 求导：
$\frac{\partial{tr(P^{\top}CP)}}{\partial{P}}=\frac{\partial{tr(PP^{\top}C)}}{\partial{P}}=(P^{\top}C)^{\top}=C^{\top}P$
$\frac{\partial{\lambda(IP^{\top}P)}}{\partial{P}}=-\lambda P$
uses some knowledge about the derivation of traces and matrix derivatives. If you don’t understand it, you can look up some linear algebra knowledge by yourself.
$\frac{\partial{f(P)}}{\partial{P}}=C^{\top}P+\lambda P$
due to $C$ is a matrices, so:
$\frac{\partial{f(P)}}{\partial{P}}=CP-\lambda P$
$\frac{\partial{f(P)}}{\partial{P}}=0 \Rightarrow CP=\lambda P$
can be seen here, $P$ is actuallyEigenvector (matrix) of $C.$ In fact, eigenvectors with unequal eigenvalues in a symmetric matrix are orthogonal.

In addition, there are:
$PP^{\top}CP=P^{\top}\lambda P$
Due to the eigenvalue matrix $\lambda$ is a diagonal matrix (all except the diagonal are 0), so $C=P^{\top}\lambda P=\lambda P^{\top} P=\lambda$

It can be found that $\lambda$ actually represents the variance of each dimension, that is to say, eigenvalue = variance. Choosing dimensions with large variance to retain is to retain dimensions with larger eigenvalues.

2.1 Some thoughts on finding the covariance matrix

We mentioned earlier that we need to find a pair $n\_sample=N, n\_feature=D)$ matrixFind the covariance matrix of $X$
$XC=\frac{1}{N}X^{\top}X$
will be theoretically deduced next, why the covariance needs to be calculated like this:

From the knowledge of probability theory, we can know that two random variables $X, Determine the Y$ function:
$cov(X,Y)=\mathbb{E}[(X-\mu_{X}) (Y-\mu_{Y})]$
其中， $\mu_{X},\mu_{Y}$ are the means of the two random variables respectively.

Before we perform PCA on the data, we must first normalize the data so that its mean is 0. The usual method is to subtract the mean, that is, $X_{new}=X-\mu_{X}$ , so when calculating the covariance, we can get:
$cov(X,Y)=\mathbb{E}(XY)$
here, $X, Y$ is all a certain dimension of the data. Converting to matrix operations, we can get:
$Cov=\frac{1}{N}X^{\top}X$

Of course, if we consider unbiased estimation of the sample, when calculating the expectation, the mean will be rewritten as $\mathbb{E(X)}=\frac{1 }{N-1}\sum_{i=1}^{N}X_{i}$ , (if you are not familiar with it, you can review the relevant knowledge of probability theory), so the covariance matrix we get is:
$Cov=\frac{1}{N-1}X^{\ top}X$
is calculated in this way in the numpy library.

2.2 PCA implementation method

As can be seen from the above, the steps to implement PCA are as follows:

Find the covariance matrix $C$
Find matrix Eigenvalues and eigenvectors of $C , sorted in descending order of eigenvalues$
Keep k dimensions and select the first k feature vectors to form the matrix $P$
The new data matrix is $XPC^{\prime}=XP$

3. Python implementation

import numpy as np


class PCA:
    """
    ----------------------------------------------------------------------------
    Attributes:
    components_: ndarray with shape of (n_features, n_components)
        Principal axes in feature space, representing the directions of maximum
        variance in the data.

    explained_variance_ : ndarray of shape (n_components,)
        The amount of variance explained by each of the selected components.

    explained_variance_ratio_ : ndarray of shape (n_components,)
        Percentage of variance explained by each of the selected components.
    """
    def __init__(self, n_components):
        self.n_components = n_components
        self.explained_variance_ = None
        self.explained_variance_ratio_ = None
        self.components_ = None

        self.__mean = None

    def fit(self, x):
        """
        :param x: (n,d)
        :return:
        """
        self.__mean = x.mean(axis=0)
        x_norm = x - self.__mean
        x_cov = (x_norm.T @ x_norm) / (x.shape[0] - 1)
        vectors, variance, _ = np.linalg.svd(x_cov)
        # (n_feature, n_components)
        self.components_ = vectors[:, :self.n_components]
        if len(self.components_.shape) == 1:
            self.components_ = np.expand_dims(vectors[:, :self.n_components], axis=1)
        self.explained_variance_ = variance[:self.n_components]
        self.explained_variance_ratio_ = self.explained_variance_ / variance.sum()

    def transform(self, x):
        """
        :param x: (n, n_feature)
        :return:
        """
        if self.__mean is not None:
            x = x - self.__mean
        x_transformed = x @ self.components_
        return x_transformed

Data set visualization
Insert image description here
performs PCA dimensionality reduction and visualizes data

from PCA import PCA
m_pca = PCA(1)
m_pca.fit(x)
x_reduced = m_pca.transform(x)
print("The principal axes of components is: {}\n"
      "The variance of each components is: {}\n"
      "The variance ratio of selected components is: {}"
      .format(m_pca.components_.tolist(), m_pca.explained_variance_.tolist(), m_pca.explained_variance_ratio_.tolist()))

The principal axes of components is: [[-0.7690815341368202], [-0.6391506816469459]]
The variance of each components is: [2.1098781795840327]
The variance ratio of selected components is: [0.8706238489732337]

Insert image description here

3.1 Compress face data

Display face data

faces = loadmat('data/ex7faces.mat')
face_x = faces['X']
face_x.shape

(5000, 1024)

You can see a total of 5000 face data with 1024 dimensions.
Insert image description here
Use PCA to reduce dimensionality to 100 dimensions

from PCA import PCA
face_pca = PCA(n_components=100)
face_pca.fit(face_x)
face_r = face_pca.transform(face_x)

Insert image description here
It can be seen that most of the facial features are retained.