本文为《Linear algebra and its applications》的读书笔记

Principal component analysis can be applied to any data that consist of lists of measurements made on a collection of objects or individuals. For instance, consider a chemical process that produces a plastic material. To monitor the process, 300 samples are taken of the material produced, and each sample is subjected to a battery of eight tests, such as melting point, density, ductility, tensile strength, and so on. The laboratory report for each sample is a vector in $R^8$ , and the set of such vectors forms an $8\times 300$ matrix, called the matrix of observations (观测矩阵).

Loosely speaking(粗略地讲), we can say that the process control data are eight-dimensional. The next two examples describe data that can be visualized graphically.

EXAMPLE 1
An example of two-dimensional data is given by a set of weights and heights of $N$ college students. Let $\boldsymbol X_j$ denote the observation vector in $R^2$ that lists the weight and height of the $j$ th student. If $w$ denotes weight and $h$ height, then the matrix of observations has the form

在这里插入图片描述
The set of observation vectors can be visualized as a two-dimensional $s c a t t e r$ $p l o t$ . See Figure 1.

在这里插入图片描述

EXAMPLE 2 Multichannel Image Processing
Sensors aboard the satellite acquire seven simultaneous images of any region on earth to be studied. The sensors record energy from separate wavelength bands(波段)— three in the visible light spectrum and four in infrared and thermal bands. Each image is digitized and stored as a rectangular array of numbers, each number indicating the signal intensity at a corresponding small point (or pixel)

The seven Landsat images(地球资源探测卫星图像) of one fixed region typically contain much redundant information, since some features will appear in several images. Yet other features, because of their color or temperature, may reflect light that is recorded by only one or two sensors. One goal of multichannel image processing is to view the data in a way that extracts information better than studying each image separately.

$P r i n c i p a l$ $c o m p o n e n t$ $a n a l y s i s$ is an effective way to suppress redundant information and provide in only one or two composite images most of the information from the initial data. Roughly speaking, the goal is to find a special linear combination of the images, that is, a list of weights that at each pixel combine all seven corresponding image values into one new value. The weights are chosen in a way that makes the range of light intensities—the $s c e n e$ $v a r i a n c e$ (景象差异)—in the composite image (called the $f i r s t$ $p r i n c i p a l$ $c o m p o n e n t$ (第一主成分)) greater than that in any of the original images.

Principal component analysis is illustrated in the photos below, taken over Railroad Valley, Nevada. Images from three Landsat spectral bands are shown in $(a) - (c)$ . The total information in the three bands is rearranged in the three principal component images in $(d) - (f)$ . The first component $(d)$ displays (or “explains”) 93.5% of the scene variance present in the initial data. In this way, the three-channel initial data have been reduced to one-channel data, with a loss in some sense of only 6.5% of the scene variance.

在这里插入图片描述

The first three photographs of Railroad Valley, Nevada, can be viewed as one image of the region, with three spectral components. Each photograph gives different information about the same physical region. To each pixel there corresponds an observation vector in $R^3$ that lists the signal intensities for that pixel in the three spectral bands. Typically, the image is $2000\times 2000$ pixels, so there are 4 million pixels in the image. The data for the image form a matrix with 3 rows and 4 million columns. The data can be visualized as a cluster of 4 million points in $R^3$ , perhaps as in Figure 2.

在这里插入图片描述

注意到点几乎都分布在一个二维平面上，这也就意味着原本的三维坐标可以简化为二维坐标，进而达到降维的目的。那么我们如何描述点沿着哪个方向变化范围最大，而沿着哪些方向几乎不变(几乎不变的方向即为可以进行降维的方向)呢？答案就是下面要介绍的方差

Mean and Covariance 均值和协方差

To prepare for principal component analysis, let $\begin{bmatrix} \boldsymbol X_1&...& \boldsymbol X_N\end{bmatrix}$ be a $p\times N$ matrix of observations. The sample mean (样本均值), $\boldsymbol M$ , of the observation vectors is given by

在这里插入图片描述
For $k = 1, . . ., N$ , let

在这里插入图片描述
The columns of the $p\times N$ matrix

在这里插入图片描述
have a zero sample mean, and $B$ is said to be in mean-deviation form (平均偏差形式).

在这里插入图片描述
The (sample) covariance matrix (样本协方差矩阵) is the $p\times p$ matrix $S$ defined by

在这里插入图片描述
PROOF

Since any matrix of the form $BB^T$ is positive semidefinite, so is $S$ . (See Section 7.2)

For $j = 1, . . ., p$ , the diagonal entry $s_{jj}$ in $S$ is called the variance (方差) of $x_j$ . The total variance (总方差) of the data is the sum of the variances on the diagonal of $S$ . In general, the sum of the diagonal entries of a square matrix $S$ is called the trace (矩阵的迹) of the matrix, written $t r (S)$ . Thus

在这里插入图片描述
The entry $s_{ij}$ in $S$ for $i\neq j$ is called the covariance (协方差) of $x_i$ and $x_j$ . If the covariance between $x_i$ and $x_j$ is 0, statisticians say that $x_i$ and $x_j$ are uncorrelated (无关). Analysis of the multivariate data in $\boldsymbol X_1,..., \boldsymbol X_N$ is greatly simplified when most or all of the variables $x_1,..., x_p$ are uncorrelated, that is, when the covariance matrix of $\boldsymbol X_1,..., \boldsymbol X_N$ is diagonal or nearly diagonal.

Principal Component Analysis 主成分分析

For simplicity, assume that the matrix $\begin{bmatrix} \boldsymbol X_1&...& \boldsymbol X_N\end{bmatrix}$ is already in mean-deviation form. The goal of principal component analysis is to find an orthogonal $p\times p$ matrix $=\begin{bmatrix} \boldsymbol u_1&...& \boldsymbol u_p\end{bmatrix}$ that determines a change of variable, $\boldsymbol X= P\boldsymbol Y$ , or

在这里插入图片描述

with the property that the new variables $y_1,..., y_p$ are uncorrelated and are arranged in order of decreasing variance.

The orthogonal change of variable $\boldsymbol X= P\boldsymbol Y$ means that each observation vector $\boldsymbol X_k$ receives a “new name,” $\boldsymbol Y_k$ , such that $\boldsymbol X_k= P\boldsymbol Y_k$ . Notice that $\boldsymbol Y_k$ is the coordinate vector of $\boldsymbol X_k$ with respect to the columns of $P$ , and $\boldsymbol Y_k= P^T \boldsymbol X_k$ for $k = 1, . . ., N$ .

It is not difficult to verify that for any orthogonal $P$ , the covariance matrix of $\boldsymbol Y_1,...,\boldsymbol Y_N$ is $P^TSP$ [Hint: $\boldsymbol Y_1,...,\boldsymbol Y_N$ are in mean-deviation form]. So the desired orthogonal matrix $P$ is one that makes $P^TSP$ diagonal. Let $D$ be a diagonal matrix with the eigenvalues $\lambda_1,...,\lambda_p$ of $S$ on the diagonal, arranged so that $\lambda_1\geq\lambda_2\geq...\geq\lambda_p\geq 0$ , and let $P$ be an orthogonal matrix whose columns are the corresponding unit eigenvectors $\boldsymbol u_1,...,\boldsymbol u_p$ . Then $S= PDP^T$ and $P^TSP= D$ .

The unit eigenvectors $\boldsymbol u_1,...,\boldsymbol u_p$ of the covariance matrix $S$ are called the principal components of the data (in the matrix of observations). The first principal component (第一主成分) is the eigenvector corresponding to the largest eigenvalue of $S$ , the second principal component is the eigenvector corresponding to the second largest eigenvalue, and so on.

The first principal component $\boldsymbol u_1$ determines the new variable $\boldsymbol y_1$ in the following way. Let $c_1,..., c_p$ be the entries in $\boldsymbol u_1$ . Since $\boldsymbol u_1^T$ is the first row of $P^T$ , the equation $\boldsymbol Y= P^T \boldsymbol X$ shows that

在这里插入图片描述

Thus $y_1$ is a linear combination of the original variables $x_1,..., x_p$ , using the entries in the eigenvector $\boldsymbol u_1$ as weights. In a similar fashion, $\boldsymbol u_2$ determines the variable $y_2$ , and so on.

EXAMPLE 4
The initial data for the multispectral image of Railroad Valley (Example 2) consisted of 4 million vectors in $R^3$ . The associated covariance matrix is

在这里插入图片描述
Find the principal components of the data, and list the new variable determined by the first principal component.
SOLUTION
The eigenvalues of $S$ and the associated principal components (the unit eigenvectors) are

在这里插入图片描述
Using two decimal places for simplicity, the variable for the first principal component is

在这里插入图片描述
This equation was used to create photograph $(d)$ in Example 2. The variables $x_1, x_2$ , and $x_3$ are the signal intensities in the three spectral bands. At each pixel in photograph $(d)$ , the gray scale value is computed from $y_1$ , a weighted linear combination of $x_1, x_2$ and $x_3$ . In this sense, photograph $(d)$ “displays” the first principal component of the data.

In Example 4, the covariance matrix for the transformed data, using variables $y_1, y_2$ , and $y_3$ , is

在这里插入图片描述
Although $D$ is obviously simpler than the original covariance matrix $S$ , the merit of constructing the new variables is not yet apparent. However, the variances of the variables $y_1, y_2$ , and $y_3$ appear on the diagonal of $D$ , and obviously the first variance in $D$ is much larger than the other two. As we shall see, this fact will permit us to view the data as essentially one-dimensional rather than three-dimensional.

Reducing the Dimension of Multivariate Data 多变量数据的降维

Principal component analysis is potentially valuable for applications in which most of the variation, or dynamic range, in the data is due to variations in only a few of the new variables, $y_1,..., y_p$ .

It can be shown that an orthogonal change of variables, $X = P Y$ , does not change the total variance of the data. (It can be shown that if $A$ and $B$ are $n\times n$ matrices, then $t r (A B) = t r (B A)$ (科学归纳法可证). Thus $tr(P^TSP)=tr(S)$ .) This means that if $S = PDP^T$ , then

在这里插入图片描述

The variance of $y_j$ is $\lambda_j$ , and the quotient(商) $\lambda_j = tr(S)$ measures the fraction of the total variance that is “explained” or “captured” by $y_j$ .

EXAMPLE 5
Compute the various percentages of variance of the Railroad Valley multispectral data that are displayed in the principal component photographs, $(d) - (f)$ , shown in Example 2.
SOLUTION

在这里插入图片描述
The percentages of the total variance explained by the principal components are

The calculations in Example 5 show that the data have practically no variance in the third (new) coordinate. The values of $y_3$ are all close to zero. Geometrically, the data points lie nearly in the plane $y_3 = 0$ , and their locations can be determined fairly accurately by knowing only the values of $y_1$ and $y_2$ . In fact, $y_2$ also has relatively small variance, which means that the points lie approximately along a line, and the data are essentially one-dimensional. See Figure 2, in which the data resemble a popsicle stick.

Characterizations of Principal Component Variables

If $y_1,...,y_p$ arise from a principal component analysis of a $p\times N$ matrix of observations, then the variance of $y_1$ is as large as possible in the following sense:
If $\boldsymbol u$ is any unit vector and if $\boldsymbol u^T \boldsymbol X$ , then the variance of the values of $y$ as $\boldsymbol X$ varies over the original data $\boldsymbol X_1,...,\boldsymbol X_N$ turns out to be $\boldsymbol u^T S\boldsymbol u$ ( $\frac{1}{N-1}\sum_{i=1}^Ny_i^2=\frac{1}{N-1}\sum_{i=1}^Ny_i^Ty_i=\frac{1}{N-1}\sum_{i=1}^N\boldsymbol X_i^T\boldsymbol u\boldsymbol u^T\boldsymbol X_i=\frac{1}{N-1}\sum_{i=1}^N\boldsymbol u^T\boldsymbol X_i\boldsymbol X_i^T\boldsymbol u=\frac{1}{N-1}\boldsymbol u^T(\sum_{i=1}^N\boldsymbol X_i\boldsymbol X_i^T)\boldsymbol u=\frac{1}{N-1}\boldsymbol u^TBB^T\boldsymbol u=\boldsymbol u^T S\boldsymbol u$ ). The maximum value of the quadratic form $\boldsymbol u^T S\boldsymbol u$ , over all unit vectors $\boldsymbol u$ , is the largest eigenvalue $\lambda_1$ of $S$ , and this variance is attained when $\boldsymbol u$ is the corresponding eigenvector $\boldsymbol u_1$ . In the same way, $y_2$ has maximum possible variance among all variables $y=\boldsymbol u^T\boldsymbol X$ that are uncorrelated with $\boldsymbol y_1$ . Likewise, $\boldsymbol y_3$ has maximum possible variance among all variables uncorrelated with both $\boldsymbol y_1$ and $\boldsymbol y_2$ , and so on.

在这里插入图片描述

7.5 Applications to image processing and statistics (PCA)

目录

Mean and Covariance 均值和协方差

Principal Component Analysis 主成分分析

Reducing the Dimension of Multivariate Data 多变量数据的降维

Characterizations of Principal Component Variables

猜你喜欢