Machine Learning Notes - Math in Principal Component Analysis

1. Principal Component Analysis

        Dimensions are a key attribute in data science, dimensions are all the characteristics of a dataset. For example, if you are looking at a dataset containing music clips, the dimensions might be genre, length of clips, number of instruments, presence of singers, etc.

        You can think of all these dimensions as distinct columns. When there are only two dimensions, it can be plotted using the X and Y axes. If you add color, you can represent the third dimension. It's similar if you have dozens or hundreds of dimensions, it's just harder to visualize it.

        When you have so many dimensions, some of them are relevant. For example, we can take for granted that the genre of a piece of music will be related to the instruments that appear in that piece. One way to reduce the dimensions is to keep only some of them. But it is very likely that highly representative information is lost. So there is a need for a way to reduce these dimensions while maintaining important information in the dataset.

        The purpose of Principal Component Analysis (PCA) is to reduce the dimensionality of a dataset. PCA provides us with a new set of dimensions, the principal components (PC). They are ordered: the first principal component is the dimension associated with the largest variance. Furthermore, the principal components are orthogonal. Remember that orthonormal vectors mean that their dot product equals 0 . This means that each principal component is uncorrelated with the previous one. You can choose to keep only the first few principal components, since each principal component is a linear combination of data features. For example, a principal component can be a linear combination of the length of the piece and the number of instruments played.

        A unit vector is an example of an orthonormal vector.

Orthogonal vector

2. Mathematical derivation

1. Problem description

         The problem can be expressed as finding a function that takes a set of data points from , we want to change the dimension of the dataset from n to l. If l < n , the new dataset will be compressed because it has a reduced number of features. We also need a function that can decode the transformed dataset back to the original dataset:

Principal Component Analysis of Coordinate System Change

         The first step is to understand the shape of the data. x(i) is a data point with n dimensions. There are m data points in total:

         

        Reflecting the transformation of n dimensions, it becomes the following matrix form:

        can also be written as 

        where x1⋯xn is a vector containing each of the m observations.

        Then, our goal is to transform x into

2. Constraints

         The encoding function f(x) converts x to c, and the decoding function converts c back to an approximation of x. For simplicity, PCA will obey some constraints:

        Constraint 1:

        The decoding function must be a simple matrix multiplication: g(c)=Dc; by applying the matrix D to the dataset from the new coordinate system, it should transform back to the original coordinate system.

        Constraint 2:

        The columns of D must be orthogonal.

        Constraint 3:

        The columns of D must have unit norm.

3. Find the encoding function

        For the time being we will consider only one data point. Therefore, these matrices will have the following dimensions:

         We want a decoding function which is a simple matrix multiplication. Therefore, we have g(c)=Dc. Then we will find the encoding function from the decoding function.

        We want to minimize the error between the decoded data points and the actual data points. This means reducing the distance between x and g(c). We will use the squared L2 norm as the error metric.

        Let us call c* the optimal c, 

        This means that we want to find the value of the vector c such that it is as small as possible. 

        The squared L2 norm of a vector y can be expressed as:

        We named the variable y to avoid confusion with x. Here y=x−g(c).

        Therefore, the equation we want to minimize becomes: 

        Available 

         available through the distributive properties

        It can be seen from the transformation properties

         Since the result of is a scalar, so

        So the equation becomes 

         The first term does not depend on c, since we want to minimize the function with respect to c, so we can remove this term. We simplify to:

         

        Since g(c)=Dc:

        

        because 

        Therefore 

        As we know, because D is orthogonal (actually, it is semi-orthogonal if n≠l) and has a unit norm sequence. We can replace in the equation: 

         

4. Minimize the function 

         The goal now is to find the minimum value of the function . A widely used method is to use the gradient descent algorithm. That's not our point, let's just say it briefly, the main idea is that the value derivative of the function at a particular value of x tells you if you need to increase or decrease x to reach the minimum value. When the slope is close to 0, the minimum value should have been reached.

gradient descent

         However, functions with local minima may affect the descent:

Gradient descent can get stuck in local minima

         The examples are two-dimensional, but the principles are general for higher dimensions as well. The gradient is a vector containing the partial derivatives of all dimensions. Its mathematical notation is .

5. Calculate the gradient of the function 

         Here, we want to minimize through each dimension of c. We are looking for a slope of 0. The equation is:

         

        Let's compute the derivative with respect to the two terms separately. 

        

        The second item is 

        then get 

         

         We want c to be a column vector of shape (l, 1), so we need to transpose .

Dimension of matrix/vector dot product

         due to

        we found the encoding function

        To get from c back to x, use g(c)=Dc. r(x) = g(f(x)) = 

refactoring function

6. Calculate D

         The next step is to find the matrix D. The purpose of PCA is to change the coordinate system so that the variance is maximized along the first dimension of the projected space. This is equivalent to minimizing the error between the data point and its reconstruction. See the covariance matrix below for more details.

        Since we have to consider all points (all points will use the same matrix D), we will use the Frobenius norm of the error, which is equivalent to the L2 norm of the matrix. Here is the formula for the Frobenius norm:

        We call D∗ the optimal D (in the sense that the error is as small as possible).

        We have: 

        The constraints used because we chose constraints that make the columns of D orthogonal. 

 7. The first principal component

         We only find the first principal component for now. Therefore, we have l=1. So the shape of matrix D is (n×1): it is a simple column vector. Since it's a vector, let's call it d:

         Because we will take the squared L2 norm:

         another . i.e. the first principal component

         bring into the equation

        Due to constraint 3. (columns of D have unit norm) we have . d is one of the columns of D and therefore has unit norm. 

        Instead of using the sum along m data points x, we can use the matrix X to collect all observations:

        

        existing 

        and

8. Use the trace operator

         We will now use the Trace operator to simplify the equation to minimise.

        

         Because of this ,

         Since we can loop over the order of matrices in Trace, we can write:

        

         

         Let's plug this into our equation:

         We can remove the first term that does not depend on d:

        because 

         simplified

        which is 

        also because 

         which is

         We will see that we can find the maximum value of the function by computing the eigenvectors.

9. Covariance matrix 

        As we wrote above, the optimization problems of maximizing the variance of the components and minimizing the error between the reconstructed data and the real data are equivalent.

        If we center the data around 0 (i.e. center and normalize the data), it is the covariance matrix.

        The covariance matrix is ​​an n × n matrix (n is the number of dimensions). Its diagonal is the variance of the corresponding dimension, and the other cells are the covariance (redundancy) between the two corresponding dimensions.

        This means the largest covariance we have between two dimensions with more redundancy between those dimensions. This also means that if the variance is high, the line of best fit is associated with a small error. Maximizing variance and minimizing covariance (for decorrelation dimension) means that the ideal covariance matrix is ​​a diagonal matrix (only non-zero values ​​in the diagonal). Therefore, diagonalization of the covariance matrix will give us the optimal solution.

3. Application PCA Example

        As an example, let's create a 2D dataset. To see the effect of PCA, we will introduce some correlations between the two dimensions. Let's create 100 data points with 2 dimensions:

np.random.seed(123)
x = 5*np.random.rand(100)
y = 2*x + 1 + np.random.randn(100)

x = x.reshape(100, 1)
y = y.reshape(100, 1)

X = np.hstack([x, y])
X.shape

plt.plot(X[:,0], X[:,1], '*')
plt.show()

         Highly correlated data means that dimensions are redundant. One can be predicted from the other without losing too much information.

        The first thing we do is to center the data around 0. PCA is a regression model with no intercept, so the first component must pass through the origin.

         This is a simple function that subtracts each data point for that column from the mean of each column. It can be used to center data points around 0.

def centerData(X):
    X = X.copy()
    X -= np.mean(X, axis = 0)
    return X

         So let's center the data X in both dimensions around 0:

X_centered = centerData(X)
plt.plot(X_centered[:,0], X_centered[:,1], '*')
plt.show()

         We can now look for principal components. We see that they correspond to values ​​taken by d that maximize the following function:

        

         To find d, we can compute the eigenvectors of d. So let's do this:

eigVals, eigVecs = np.linalg.eig(X_centered.T.dot(X_centered))
eigVecs

         get

array([[-0.91116273, -0.41204669],
       [ 0.41204669, -0.91116273]])

        These are the vectors that maximize our function. Each column vector is associated with an eigenvalue. Vectors associated with larger eigenvalues ​​tell us the direction associated with larger variance in the data.

        First, let's create a function plotVectors() to plot vectors:

def plotVectors(vecs, cols, alpha=1):
    plt.figure()
    plt.axvline(x=0, color='#A9A9A9', zorder=0)
    plt.axhline(y=0, color='#A9A9A9', zorder=0)

    for i in range(len(vecs)):
        x = np.concatenate([[0,0],vecs[i]])
        plt.quiver([x[0]],
                   [x[1]],
                   [x[2]],
                   [x[3]],
                   angles='xy', scale_units='xy', scale=1, color=cols[i],
                   alpha=alpha)

        Re-visualize the data

orange = '#FF9A13'
blue = '#1190FF'
plotVectors(eigVecs.T, [orange, blue])
plt.plot(X_centered[:,0], X_centered[:,1], '*')
plt.xlim(-3, 3)
plt.ylim(-3, 3)
plt.show()

         We can see that the blue vector direction corresponds to the slanted shape of our data. If you project the data points on a line corresponding to the direction of the blue vector, you will end up with maximum variance. This vector has a direction that maximizes the variance of the projected data. Take a look at the image below:

Projection of data points: the line direction is the direction with the greatest variance

         When you project the data points on the pink line, there are more differences. The direction of this line maximizes the variance of the data points. The same goes for the image above: our blue vector has the direction in which the data points project a line with higher variance. Then the second eigenvector is orthogonal to the first eigenvector.

        In the image above, the blue vector is the second eigenvector, so let's check if it is associated with a larger eigenvalue:

        The value of eigVals is

array([  18.04730409,  798.35242844])

        So yes, the second vector corresponds to the largest eigenvalue.

        Now that we have found the matrix d, we will use the encoding function to rotate the data. The goal of the rotation is to end up with a new coordinate system where the data is uncorrelated, so the underlying axes collect all variance. Then only a few axes can be kept: this is what dimensionality reduction is for.

        D is the matrix containing the eigenvectors we calculated earlier. Also, this formula corresponds to only one data point, where the dimension is the row of x. In our case, we'll apply it to all data points, and since X has dimensions on columns, we'll need to transpose it.

X_new = eigVecs.T.dot(X_centered.T)

plt.plot(eigVecs.T.dot(X_centered.T)[0, :], eigVecs.T.dot(X_centered.T)[1, :], '*')
plt.xlim(-5, 5)
plt.ylim(-5, 5)
plt.show()

         We rotated the data to have maximum variance on one axis, the rotation changed our dataset to now have more variance on one base axis. You can just keep this dimension and be very representative as well.

Guess you like

Origin blog.csdn.net/bashendixie5/article/details/124302345