7.5 Applications to image processing and statistics (PCA)

本文为《Linear algebra and its applications》的读书笔记

The main goal of this section is to explain a technique, called principal component analysis PCA (主成分分析), used to analyze m u l t i v a r i a t e multivariate multivariate d a t a data data(多维数据).

Principal component analysis can be applied to any data that consist of lists of measurements made on a collection of objects or individuals. For instance, consider a chemical process that produces a plastic material. To monitor the process, 300 samples are taken of the material produced, and each sample is subjected to a battery of eight tests, such as melting point, density, ductility, tensile strength, and so on. The laboratory report for each sample is a vector in R 8 \R^8 R8, and the set of such vectors forms an 8 × 300 8\times 300 8×300 matrix, called the matrix of observations (观测矩阵).

Loosely speaking(粗略地讲), we can say that the process control data are eight-dimensional. The next two examples describe data that can be visualized graphically.

EXAMPLE 1
An example of two-dimensional data is given by a set of weights and heights of N N N college students. Let X j \boldsymbol X_j Xj denote the observation vector in R 2 \R^2 R2 that lists the weight and height of the j j j th student. If w w w denotes weight and h h h height, then the matrix of observations has the form

在这里插入图片描述
The set of observation vectors can be visualized as a two-dimensional s c a t t e r scatter scatter p l o t plot plot. See Figure 1.

在这里插入图片描述

EXAMPLE 2 Multichannel Image Processing
Sensors aboard the satellite acquire seven simultaneous images of any region on earth to be studied. The sensors record energy from separate wavelength bands(波段)— three in the visible light spectrum and four in infrared and thermal bands. Each image is digitized and stored as a rectangular array of numbers, each number indicating the signal intensity at a corresponding small point (or pixel)

The seven Landsat images(地球资源探测卫星图像) of one fixed region typically contain much redundant information, since some features will appear in several images. Yet other features, because of their color or temperature, may reflect light that is recorded by only one or two sensors. One goal of multichannel image processing is to view the data in a way that extracts information better than studying each image separately.

P r i n c i p a l Principal Principal c o m p o n e n t component component a n a l y s i s analysis analysis is an effective way to suppress redundant information and provide in only one or two composite images most of the information from the initial data. Roughly speaking, the goal is to find a special linear combination of the images, that is, a list of weights that at each pixel combine all seven corresponding image values into one new value. The weights are chosen in a way that makes the range of light intensities—the s c e n e scene scene v a r i a n c e variance variance(景象差异)—in the composite image (called the f i r s t first first p r i n c i p a l principal principal c o m p o n e n t component component(第一主成分)) greater than that in any of the original images.

Principal component analysis is illustrated in the photos below, taken over Railroad Valley, Nevada. Images from three Landsat spectral bands are shown in ( a ) – ( c ) (a)–(c) (a)(c). The total information in the three bands is rearranged in the three principal component images in ( d ) – ( f ) (d)–(f) (d)(f). The first component ( d ) (d) (d) displays (or “explains”) 93.5% of the scene variance present in the initial data. In this way, the three-channel initial data have been reduced to one-channel data, with a loss in some sense of only 6.5% of the scene variance.

在这里插入图片描述

The first three photographs of Railroad Valley, Nevada, can be viewed as one image of the region, with three spectral components. Each photograph gives different information about the same physical region. To each pixel there corresponds an observation vector in R 3 \R^3 R3 that lists the signal intensities for that pixel in the three spectral bands. Typically, the image is 2000 × 2000 2000\times 2000 2000×2000 pixels, so there are 4 million pixels in the image. The data for the image form a matrix with 3 rows and 4 million columns. The data can be visualized as a cluster of 4 million points in R 3 \R^3 R3, perhaps as in Figure 2.

在这里插入图片描述

注意到点几乎都分布在一个二维平面上,这也就意味着原本的三维坐标可以简化为二维坐标,进而达到降维的目的。那么我们如何描述点沿着哪个方向变化范围最大,而沿着哪些方向几乎不变(几乎不变的方向即为可以进行降维的方向)呢?答案就是下面要介绍的方差

Mean and Covariance 均值和协方差

To prepare for principal component analysis, let [ X 1 . . . X N ] \begin{bmatrix} \boldsymbol X_1&...& \boldsymbol X_N\end{bmatrix} [X1...XN] be a p × N p\times N p×N matrix of observations. The sample mean (样本均值), M \boldsymbol M M, of the observation vectors is given by

在这里插入图片描述
For k = 1 , . . . , N k= 1,...,N k=1,...,N, let

在这里插入图片描述
The columns of the p × N p\times N p×N matrix

在这里插入图片描述
have a zero sample mean, and B B B is said to be in mean-deviation form (平均偏差形式).

在这里插入图片描述
The (sample) covariance matrix (样本协方差矩阵) is the p × p p\times p p×p matrix S S S defined by

在这里插入图片描述
PROOF
在这里插入图片描述

Since any matrix of the form B B T BB^T BBT is positive semidefinite, so is S S S. (See Section 7.2)

For j = 1 , . . . , p j= 1,..., p j=1,...,p, the diagonal entry s j j s_{jj} sjj in S S S is called the variance (方差) of x j x_j xj. The total variance (总方差) of the data is the sum of the variances on the diagonal of S S S. In general, the sum of the diagonal entries of a square matrix S S S is called the trace (矩阵的迹) of the matrix, written t r ( S ) tr(S) tr(S). Thus

在这里插入图片描述
The entry s i j s_{ij} sij in S S S for i ≠ j i\neq j i=j is called the covariance (协方差) of x i x_i xi and x j x_j xj. If the covariance between x i x_i xi and x j x_j xj is 0, statisticians say that x i x_i xi and x j x_j xj are uncorrelated (无关). Analysis of the multivariate data in X 1 , . . . , X N \boldsymbol X_1,..., \boldsymbol X_N X1,...,XN is greatly simplified when most or all of the variables x 1 , . . . , x p x_1,..., x_p x1,...,xp are uncorrelated, that is, when the covariance matrix of X 1 , . . . , X N \boldsymbol X_1,..., \boldsymbol X_N X1,...,XN is diagonal or nearly diagonal.

Principal Component Analysis 主成分分析

For simplicity, assume that the matrix [ X 1 . . . X N ] \begin{bmatrix} \boldsymbol X_1&...& \boldsymbol X_N\end{bmatrix} [X1...XN] is already in mean-deviation form. The goal of principal component analysis is to find an orthogonal p × p p\times p p×p matrix P = [ u 1 . . . u p ] P =\begin{bmatrix} \boldsymbol u_1&...& \boldsymbol u_p\end{bmatrix} P=[u1...up] that determines a change of variable, X = P Y \boldsymbol X= P\boldsymbol Y X=PY, or

在这里插入图片描述

with the property that the new variables y 1 , . . . , y p y_1,..., y_p y1,...,yp are uncorrelated and are arranged in order of decreasing variance.

The orthogonal change of variable X = P Y \boldsymbol X= P\boldsymbol Y X=PY means that each observation vector X k \boldsymbol X_k Xk receives a “new name,” Y k \boldsymbol Y_k Yk, such that X k = P Y k \boldsymbol X_k= P\boldsymbol Y_k Xk=PYk. Notice that Y k \boldsymbol Y_k Yk is the coordinate vector of X k \boldsymbol X_k Xk with respect to the columns of P P P, and Y k = P T X k \boldsymbol Y_k= P^T \boldsymbol X_k Yk=PTXk for k = 1 , . . . , N k= 1,...,N k=1,...,N.

It is not difficult to verify that for any orthogonal P P P, the covariance matrix of Y 1 , . . . , Y N \boldsymbol Y_1,...,\boldsymbol Y_N Y1,...,YN is P T S P P^TSP PTSP [Hint: Y 1 , . . . , Y N \boldsymbol Y_1,...,\boldsymbol Y_N Y1,...,YN are in mean-deviation form]. So the desired orthogonal matrix P P P is one that makes P T S P P^TSP PTSP diagonal. Let D D D be a diagonal matrix with the eigenvalues λ 1 , . . . , λ p \lambda_1,...,\lambda_p λ1,...,λp of S S S on the diagonal, arranged so that λ 1 ≥ λ 2 ≥ . . . ≥ λ p ≥ 0 \lambda_1\geq\lambda_2\geq...\geq\lambda_p\geq 0 λ1λ2...λp0, and let P P P be an orthogonal matrix whose columns are the corresponding unit eigenvectors u 1 , . . . , u p \boldsymbol u_1,...,\boldsymbol u_p u1,...,up. Then S = P D P T S= PDP^T S=PDPT and P T S P = D P^TSP= D PTSP=D.

The unit eigenvectors u 1 , . . . , u p \boldsymbol u_1,...,\boldsymbol u_p u1,...,up of the covariance matrix S S S are called the principal components of the data (in the matrix of observations). The first principal component (第一主成分) is the eigenvector corresponding to the largest eigenvalue of S S S, the second principal component is the eigenvector corresponding to the second largest eigenvalue, and so on.

The first principal component u 1 \boldsymbol u_1 u1 determines the new variable y 1 \boldsymbol y_1 y1 in the following way. Let c 1 , . . . , c p c_1,..., c_p c1,...,cp be the entries in u 1 \boldsymbol u_1 u1. Since u 1 T \boldsymbol u_1^T u1T is the first row of P T P^T PT , the equation Y = P T X \boldsymbol Y= P^T \boldsymbol X Y=PTX shows that

在这里插入图片描述

Thus y 1 y_1 y1 is a linear combination of the original variables x 1 , . . . , x p x_1,..., x_p x1,...,xp, using the entries in the eigenvector u 1 \boldsymbol u_1 u1 as weights. In a similar fashion, u 2 \boldsymbol u_2 u2 determines the variable y 2 y_2 y2, and so on.

EXAMPLE 4
The initial data for the multispectral image of Railroad Valley (Example 2) consisted of 4 million vectors in R 3 \R^3 R3. The associated covariance matrix is

在这里插入图片描述
Find the principal components of the data, and list the new variable determined by the first principal component.
SOLUTION
The eigenvalues of S S S and the associated principal components (the unit eigenvectors) are

在这里插入图片描述
Using two decimal places for simplicity, the variable for the first principal component is

在这里插入图片描述
This equation was used to create photograph ( d ) (d) (d) in Example 2. The variables x 1 , x 2 x_1, x_2 x1,x2, and x 3 x_3 x3 are the signal intensities in the three spectral bands. At each pixel in photograph ( d ) (d) (d), the gray scale value is computed from y 1 y_1 y1, a weighted linear combination of x 1 , x 2 x_1, x_2 x1,x2 and x 3 x_3 x3. In this sense, photograph ( d ) (d) (d) “displays” the first principal component of the data.

In Example 4, the covariance matrix for the transformed data, using variables y 1 , y 2 y_1, y_2 y1,y2, and y 3 y_3 y3, is

在这里插入图片描述
Although D D D is obviously simpler than the original covariance matrix S S S, the merit of constructing the new variables is not yet apparent. However, the variances of the variables y 1 , y 2 y_1, y_2 y1,y2, and y 3 y_3 y3 appear on the diagonal of D D D, and obviously the first variance in D D D is much larger than the other two. As we shall see, this fact will permit us to view the data as essentially one-dimensional rather than three-dimensional.

Reducing the Dimension of Multivariate Data 多变量数据的降维

Principal component analysis is potentially valuable for applications in which most of the variation, or dynamic range, in the data is due to variations in only a few of the new variables, y 1 , . . . , y p y_1,..., y_p y1,...,yp.

It can be shown that an orthogonal change of variables, X = P Y X= PY X=PY, does not change the total variance of the data. (It can be shown that if A A A and B B B are n × n n\times n n×n matrices, then t r ( A B ) = t r ( B A ) tr(AB)=tr(BA) tr(AB)=tr(BA)(科学归纳法可证). Thus t r ( P T S P ) = t r ( S ) tr(P^TSP)=tr(S) tr(PTSP)=tr(S).) This means that if S = P D P T S = PDP^T S=PDPT , then

在这里插入图片描述

The variance of y j y_j yj is λ j \lambda_j λj , and the quotient(商) λ j = t r ( S ) \lambda_j = tr(S) λj=tr(S) measures the fraction of the total variance that is “explained” or “captured” by y j y_j yj.

EXAMPLE 5
Compute the various percentages of variance of the Railroad Valley multispectral data that are displayed in the principal component photographs, ( d ) – ( f ) (d)–(f) (d)(f), shown in Example 2.
SOLUTION

在这里插入图片描述
The percentages of the total variance explained by the principal components are
在这里插入图片描述
The calculations in Example 5 show that the data have practically no variance in the third (new) coordinate. The values of y 3 y_3 y3 are all close to zero. Geometrically, the data points lie nearly in the plane y 3 = 0 y_3 = 0 y3=0, and their locations can be determined fairly accurately by knowing only the values of y 1 y_1 y1 and y 2 y_2 y2. In fact, y 2 y_2 y2 also has relatively small variance, which means that the points lie approximately along a line, and the data are essentially one-dimensional. See Figure 2, in which the data resemble a popsicle stick.

Characterizations of Principal Component Variables

If y 1 , . . . , y p y_1,...,y_p y1,...,yp arise from a principal component analysis of a p × N p\times N p×N matrix of observations, then the variance of y 1 y_1 y1 is as large as possible in the following sense:
If u \boldsymbol u u is any unit vector and if y = u T X y = \boldsymbol u^T \boldsymbol X y=uTX, then the variance of the values of y y y as X \boldsymbol X X varies over the original data X 1 , . . . , X N \boldsymbol X_1,...,\boldsymbol X_N X1,...,XN turns out to be u T S u \boldsymbol u^T S\boldsymbol u uTSu ( 1 N − 1 ∑ i = 1 N y i 2 = 1 N − 1 ∑ i = 1 N y i T y i = 1 N − 1 ∑ i = 1 N X i T u u T X i = 1 N − 1 ∑ i = 1 N u T X i X i T u = 1 N − 1 u T ( ∑ i = 1 N X i X i T ) u = 1 N − 1 u T B B T u = u T S u \frac{1}{N-1}\sum_{i=1}^Ny_i^2=\frac{1}{N-1}\sum_{i=1}^Ny_i^Ty_i=\frac{1}{N-1}\sum_{i=1}^N\boldsymbol X_i^T\boldsymbol u\boldsymbol u^T\boldsymbol X_i=\frac{1}{N-1}\sum_{i=1}^N\boldsymbol u^T\boldsymbol X_i\boldsymbol X_i^T\boldsymbol u=\frac{1}{N-1}\boldsymbol u^T(\sum_{i=1}^N\boldsymbol X_i\boldsymbol X_i^T)\boldsymbol u=\frac{1}{N-1}\boldsymbol u^TBB^T\boldsymbol u=\boldsymbol u^T S\boldsymbol u N11i=1Nyi2=N11i=1NyiTyi=N11i=1NXiTuuTXi=N11i=1NuTXiXiTu=N11uT(i=1NXiXiT)u=N11uTBBTu=uTSu). The maximum value of the quadratic form u T S u \boldsymbol u^T S\boldsymbol u uTSu, over all unit vectors u \boldsymbol u u, is the largest eigenvalue λ 1 \lambda_1 λ1 of S S S, and this variance is attained when u \boldsymbol u u is the corresponding eigenvector u 1 \boldsymbol u_1 u1. In the same way, y 2 y_2 y2 has maximum possible variance among all variables y = u T X y=\boldsymbol u^T\boldsymbol X y=uTX that are uncorrelated with y 1 \boldsymbol y_1 y1. Likewise, y 3 \boldsymbol y_3 y3 has maximum possible variance among all variables uncorrelated with both y 1 \boldsymbol y_1 y1 and y 2 \boldsymbol y_2 y2, and so on.

在这里插入图片描述
在这里插入图片描述

猜你喜欢

转载自blog.csdn.net/weixin_42437114/article/details/109009322