[Machine Learning] Singular Value Decomposition (SVD) and Principal Component Analysis (PCA)

1. Description

        In machine learning (ML), some of the most important linear algebra concepts are singular value decomposition (SVD) and principal component analysis (PCA). With all the raw data collected, how do we discover structure? For example, with interest rates for the past 6 days, can we understand its composition to spot trends?

        For high-dimensional raw data, this becomes more difficult. It's like finding a needle in a haystack. SVD allows us to extract and unpack information. In this article, we will introduce SVD and PCA in detail. We assume that you have basic knowledge of linear algebra, including rank and eigenvectors . If you are having trouble reading this article, I recommend that you refresh these concepts first. At the end of the article, we will answer some questions from the interest rate example above. This article also contains optional sections. Feel free to skip it depending on your interest level.

2. Misunderstanding (optional for beginners)

        I realize some common questions non-beginners might ask. Let me start by talking about the elephant in the room. Does the PCA size decrease? PCA reduces dimensions, but it's much more than that. I like the wiki's description (but if you don't know PCA, it's just gibberish):

Principal Component Analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of potentially correlated variables (with various values ​​for each entity) into a set of linearly uncorrelated variable values, called main ingredient.

        From a simplistic perspective, PCA linearly transforms the data into new attributes that are uncorrelated with each other. For ML, positioning PCA as feature extraction may allow us to explore its potential better than dimensionality reduction.

        What is the difference between SVD and PCA? SVD gives you the whole nine yards of diagonalizing matrices into special matrices that are easy to manipulate and analyze. It lays the foundation for untangling data into independent components. PCA skips less important components. Clearly, we can use SVD to find PCA by truncating less important basis vectors in the original SVD matrix.

Fourth, matrix diagonalization

In the paper         on eigenvalues ​​and eigenvectors , we described a method for decomposing an n × n square matrix  A  into

For example

        If A is a square matrix with n linearly independent eigenvectors , then the matrix can be diagonalized. Now, it's time to develop solutions for all matrices using SVD.

5. Singular vectors and singular values

        The matrices AAT and ATA are very special in linear algebra. Consider any m×n matrix A , we can multiply it with AT to form AAT and ATA , respectively . These matrices are

  • symmetry
  • square
  • is at least positive semidefinite (eigenvalues ​​are zero or positive),
  • Both matrices have the same positive eigenvalues, and
  • Both have   the same rank  r as A.

        Also, the covariance matrix we often use in ML takes this form. Since they are symmetric, we can choose their eigenvectors to be orthonormal (perpendicular to each other with unit length) - a fundamental property of symmetric matrices .

        Let us introduce some frequently used terms in SVD. We  name the eigenvectors of AAT  as  ui  and  ATA  as  vi , and call these set of eigenvectors  u  and  as  the singular vectors  of A. Both matrices have the same positive eigenvalues. The square roots of these eigenvalues ​​are called singular values .

        Not much explanation so far, but let's put everything together first and explain next. We concatenate the vectors  ui  to  U and vi  to  to form an orthogonal matrix.

        Since these vectors are orthogonal, it is easy to show that  U  and  V  obey

6. SVD

        Let's start with the hard part first. SVD states that any matrix  A  can be decomposed as:

        where  U  and  V  are orthogonal matrices, and the orthogonal eigenvectors are selected from  AAT  and  ATA  , respectively . S  is a diagonal matrix with r  elements equal to  the roots of the positive eigenvalues ​​of AAT  or  AT A  (both matrices have the same positive eigenvalues ​​anyway). The diagonal elements consist of singular values.

i.e. an m× n matrix can be factorized as:

We         can arrange the eigenvectors in different orders to produce  U  and  V. To normalize the solution, we order the eigenvectors such that vectors with higher eigenvalues ​​are preferred over vectors with smaller eigenvalues.

        In contrast to eigendecomposition, SVD works for non-square matrices. U  and  V  are invertible for any matrix in SVD, they are orthogonal matrices which we like. If there is no evidence, we also told you that singular values ​​are more stable than eigenvalues.

Example ( example source )

Before going too far, let's demonstrate it with a simple example. This will make things very easy to understand.

We calculate:

These matrices are at least positive semidefinite (all eigenvalues ​​are positive or zero). As shown, they have the same positive eigenvalues ​​(25 and 9). The figure below also shows their corresponding eigenvectors.

The singular values ​​are the square roots of the positive eigenvalues, 5 and 3. Therefore, the SVD composition is

7. Proof (optional)

To demonstrate SVD, we wish to solve for U, S  and  V         with the following formulas  :

We have 3 unknowns. Hopefully we can solve them with the 3 equations above.  The transpose of A is

understanding

We calculate  ATA ,

The last equation is equivalent to the definition of the eigenvectors of the matrix ( ATA ) We just put all the eigenvectors in one matrix.

with  VS²  equal to

V  holds all eigenvectors  vi of ATA , and S  holds  the square root of all eigenvalues ​​of ATA  . We can repeat the same process for AAT and return a similar equation.

        Now, we only need to solve for  U, V  and  S

        and prove the theorem.

8. Review

        The following is a review of SVD.

where

9. Re-enactment of SVD

        Since the matrix  V  is orthogonal,  VTV  is equal to  I. We can rewrite the SVD equation as:

        This equation establishes an important relationship between UI and VI .

        recall

apply  AV=US ,

This can be summarized as

Equivalent to

and

The SVD decomposition can be identified as a series of outer products of ui and vi .

This formulation of SVD is key to understanding the components of A. It provides an important way to decompose m × n arrays of entangled data into  r  components. Since  ui  and  vi are unit vectors, we can even ignore  terms with very small  singular values  ​​σi ( σiuiviT ). (We'll come back to this question later.

Let's start by reusing the previous example and showing how it works.

The above matrix  A  can be decomposed into

10. Column Space, Row Space, Left Space and Empty Space (optional - for advanced users)

        Next, we'll look at what U & V consist of. Suppose  A  is an m × n matrix of rank  r  . ATA  will be an nxn symmetric matrix. All symmetric matrices have a choice of n orthogonal eigenvectors  vj . Since  Avi = σiui  and  vj  are  orthogonal eigenvectors of  ATA  , we can calculate the value of ui T uj  as

        It is equal to zero. That is,  UI  and  uj  are orthogonal to each other. As mentioned earlier, they are also the eigenvectors of AAT .

        From  Avi = σui , we can realize that  ui  is the column vector of A.

        Since  A  has rank r, we can choose these  r  ui  vectors as orthogonal vectors. So what are the remaining mr-orthogonal eigenvectors of AAT ? Since  the left null space of A  is orthogonal to the column space, it is natural to choose them as the remaining eigenvectors. (The zero point N ( AT)  on the left  is  the spatial span of x  in  ATx=0  . Similar parameters apply to  the eigenvectors of ATA  . Therefore

        Returning to the previous SVD equation, from

        We just put the eigenvectors back into the left and null spaces.

11. Moore-Penrose Pseudoinverse

        For a system of linear equations, we can compute   the inverse of  the square matrix A to solve for x .

        But not all matrices are invertible. Also, in ML, finding an exact solution is unlikely in the presence of noise in the data. Our goal is to find the model that best fits the data. To find the most suitable solution, we compute a pseudoinverse

        This minimizes the least squares error below.

        The solution for x can be estimated as,

        In a linear regression problem, x  is our linear model, A  contains the training data, and b  contains the corresponding labels. We can solve x by 

Below is an example.

12. Variance and covariance

        In ML, we recognize patterns and relationships. How do we identify correlations of attributes in data? Let's start the discussion with an example. We sample the height and weight of 12 people and calculate their average. We zero it out by subtracting the original value from its mean. For example, matrix  A below  holds the adjusted zero center height and weight.

        When we plot the data points, we can realize that height and weight are positively correlated. But how do we quantify this relationship?

        First, how is real estate different? We probably start learning the difference in high school. Let us introduce its cousin. The sample variance is defined as:

        Note that it divides by  n-1  instead of  n in the variance . With a finite sample size, the sample mean is biased and sample dependent. The average squared distance from this mean will be smaller than the average squared distance from the general population. Dividing the sample covariance  S ² by  n-1 compensates for small values ​​and can be shown to be  an unbiased estimator of the variance σ ² . (The proof is not very important, so I simply provide a link to the proof here .

Thirteen, covariance matrix

        Variance measures how a variable varies among itself, while covariance varies between two variables ( a  and  b ).

        We can store all these possible combinations of covariances in a matrix called the covariance matrix  Σ.

We can rewrite it in simple matrix form.

The diagonal elements hold the variance of each variable (such as height), and the off-diagonal elements hold the covariance between two variables. Now let's calculate the sample covariance.

A positive sample covariance indicates that weight and height are positively related. Negative if they are negatively correlated, zero if they are independent.

Covariance Matrix and SVD

We can use SVD to decompose the sample covariance matrix. Since σ₂ is relatively small compared to σ₁ , we can even ignore the σ₂ term . When we train an ML model, we can perform linear regression on weight and height to form a new attribute, rather than treating them as two separate and related attributes (entangled data often makes model training more difficult).

U₁  has an important importance. It is   the main component of S.

The sample covariance matrix in the context of SVD has several properties:

  • The total variance of the data is equal to the trace of the sample covariance matrix S, which is equal to the sum of squares of the singular values ​​of S. With this, we can calculate the ratio of variance lost if we remove the smaller σi terms This reflects how much information would be lost if we eliminated them.

  •  The first eigenvector  u₁  of S points in the most important direction of the data. In our example, it quantifies the typical ratio between weight and height.

vertical least squares

  • The error is calculated as the sum of the vertical squared distances from the sample point to  u₁  , which is the minimum value when using SVD.

property

The covariance matrix is ​​not only symmetric, but also positive semidefinite. Since the variance is positive or zero, uTVu below is always greater than or equal to zero. By energy test, V is positive semidefinite.

therefore

Usually, after some linear transformation  of A  , we want to know the covariance of the transformed data. This can be calculated using the transformation matrix  A  and the covariance of the original data.

correlation matrix

The correlation matrix is ​​a scaled version of the covariance matrix. The correlation matrix standardizes (scales) the variables to have a standard deviation of 1.

If the variables are on very different scales, a correlation matrix will be used. Bad scaling can harm ML algorithms like gradient descent.

14. Visualization

        So far we have many equations. Let's visualize SVD in action and develop insights step by step. SVD decomposes matrix  A  into  USVT. Applying  A  to a vector x ( Ax ) can be visualized as   performing a rotation ( VT ), scaling ( S ) and another rotation ( U ) on x .

As shown above,  the eigenvector  vi of V  is transformed into:

or in full matrix form

demonstrate that r = m < n

15. Insights from SVD

As mentioned earlier, SVD can be expressed as

Since  ui  and  vi  have unit length, the most important factor determining the importance of each term is the singular value  σi . We purposely sort σi in descending order . If the eigenvalues ​​become too small, we can ignore the remaining terms (+ σiuiviT + ... ).

This formulaicity has some interesting implications. For example, we have a matrix that contains the return of stock returns traded by different investors.

As fund managers, what information can we learn from this? Finding patterns and structures will be the first step. Perhaps, we can identify the highest yielding stocks and investor combinations. SVD decomposes n × n matrices into  r  components, and the singular value  σi  indicates its significance. Think of this as a way to extract entanglement and correlation properties into fewer principal directions with no correlations.

If the data are highly correlated, we should expect many σi values ​​to be small and negligible.

In our previous example, weight and height were highly correlated. If we have a matrix with weights and heights of 1000 people, the first component in the SVD decomposition will dominate. The u₁  vector does show the ratio between weight and height among the 1000 people we discussed earlier.

16. Principal Component Analysis

        Technically speaking, SVD extracts data separately in the direction of greatest variance. PCA is  a linear model that maps m-  dimensional input features to k-dimensional latent factors ( k  principal components). If we ignore the less important terms, we remove less caring components but keep the main directions with the highest variance (maximum information).

Consider the 3D data points shown below as blue dots. It can be easily approximated with a plane.

source

You may quickly realize that we can use SVD to find the matrix  W . Consider the following data points lying on two-dimensional space.

SVD chooses the projection that maximizes the variance of its output. Therefore, if the variance is higher, PCA will choose the blue line instead of the green line.

As indicated below, we keep the eigenvectors that have the top kth highest singular value.

interest rate

Let's illustrate this concept in more depth by tracing interest rate data from the U.S. Treasury Department. Basis points for 9 different rates (from 3 months, 6 months, ... to 20 years) collected over 6 consecutive business days, A stores the difference from the previous date. During this time,  the elements of A have also had their averages subtracted. i.e. it is zero-centered (across its rows).

The sample covariance matrix is ​​equal to  S = AAT/(5–1).

Now we have the covariance matrix  S that we want to decompose . SVD decomposes into

From the SVD decomposition, we realize that we can focus on the first three principal components.

        As shown, the first principal component is related to the weighted average of the daily changes across all maturities. The second principal component adjusts for daily changes that are sensitive to bond maturity. (A third principal component might be curvature - the second derivative.

        We know all too well the relationship between interest rate changes and maturity dates in our daily lives. Therefore, the principal components reaffirm our view on the behavior of interest rates. However, when we see unfamiliar raw data, PCA is very helpful in extracting the main components of the data to find the underlying information structure. This might answer some questions about how to find the needle in the haystack.

Seventeen, skills

        Scales features before performing SVD.

        For example, if we want to retain 99% of the variance, we can choose  k  like this

Guess you like

Origin blog.csdn.net/gongdiwudu/article/details/131956558