Covariance matrix concept (easy to understand)

1. Basic concepts of statistics

The most basic concepts in statistics are the sample mean, variance, and standard deviation. First, we give a set of n samples. The formulas for these concepts are given below:

Mean:clip_image002

Standard deviation:image

variance:image

The mean value describes the middle point of the sample set, which tells us that the information is limited, and the standard deviation describes the average distance between each sample point of the sample set and the mean value.

Take these two sets as an example, [0, 8, 12, 20] and [8, 9, 11, 12], the mean of both sets is 10, but obviously the difference between the two sets is very large, the calculation The standard deviation of the two is 8.3 and the latter is 1.8. Obviously the latter is more concentrated, so the standard deviation is smaller. The standard deviation describes this "distribution". The reason for dividing by n-1 instead of n is because it allows us to better approximate the standard deviation of the population with a smaller sample set, which is the so-called "unbiased estimate" in statistics. The variance is just the square of the standard deviation.

 

Second, why do you need covariance

Standard deviation and variance are generally used to describe one-dimensional data, but in real life we ​​often encounter data sets containing multi-dimensional data. The simplest thing is that when you go to school, you have to count the test scores of multiple subjects. In the face of such a data set, we can of course calculate the variance independently for each dimension, but usually we still want to know more, for example, whether there is some connection between a boy's lasciviousness and his popularity with girls. Covariance is such a statistic used to measure the relationship between two random variables. We can follow the definition of variance:

        clip_image002[6]

To measure the degree to which each dimension deviates from its mean, the covariance can be defined as follows:

        clip_image002[8]

  What is the significance of the result of covariance? If the result is positive, it means that the two are positively correlated (the covariance can lead to the definition of "correlation coefficient"), that is, the more sloppy a person is, the more popular the girl is. If the result is negative, it means that the two are negatively correlated. If it is 0, there is no relationship between the two. There is no connection between lasciviousness and liking, which is statistically "independent".

We can also see some obvious properties from the definition of covariance, such as:

clip_image002[10]

clip_image002[12]

 

Third, the covariance matrix

The aforementioned intractable and popular problems are typical two-dimensional problems, and covariance can only deal with two-dimensional problems. If the number of dimensions increases, it is necessary to calculate multiple covariances, such as n-dimensional data sets. Calculate clip_image002[16]a covariance, then naturally we will think of using a matrix to organize these data. Given the definition of the covariance matrix:

                   clip_image002[18]

This definition is still easy to understand. We can give a three-dimensional example. Assuming that the data set has three dimensions, the covariance matrix is:

                  clip_image002[20]

It can be seen that the covariance matrix is ​​a symmetric matrix, and the diagonal is the variance of each dimension.

 

Fourth, Matlab covariance actual combat

It must be clear that the covariance matrix calculates the covariance between different dimensions, not between different samples. The following demonstration will use Matlab. In order to explain the calculation principle, Matlab's cov function is not called directly:

First, randomly generate a 10 * 3 dimension integer matrix as the sample set, 10 is the number of samples, and 3 is the dimension of the sample.

                                            wps_clip_image-15418

Figure 1 Using Matlab to generate a sample set

According to the formula, the calculation of the covariance requires the calculation of the mean. It was emphasized in the previous section that the covariance matrix is ​​to calculate the covariance between different dimensions. Keep this in mind. Each row of the sample matrix is ​​a sample, and each column is a dimension, so we need to calculate the mean by column. For the convenience of description, we first assign the data of three dimensions:

wps_clip_image-17278

Figure 2 Assignment of data in three dimensions

Calculate the covariance of dim1 and dim2, dim1 and dim3, dim2 and dim3:

                                wps_clip_image-19087

Figure 3 Calculating three covariances

The elements on the diagonal of the covariance matrix are the variances of each dimension. Below we calculate these variances in turn:

                              wps_clip_image-20207

Figure 4 Calculate the variance on the diagonal

In this way, we have all the data needed to calculate the covariance matrix, we can call the cov function of Matlab to get the covariance matrix directly:

                                          wps_clip_image-25729

Figure 5 Use the cov function of Matlab to directly calculate the covariance matrix of the sample

The result of the calculation is exactly the same as the result after the previous data is filled into the matrix.

 

Update: I suddenly discovered today that the original covariance matrix can also be calculated in this way. First, the sample matrix is ​​centered, that is, the average value of each dimension is subtracted from each dimension, so that the average value in each dimension is 0, and then the new The sample matrix is ​​multiplied by its transpose and then divided by (N-1). In fact, this method is also derived from the previous formula channel, but it is not very intuitive to understand, but it is still commonly used when deriving abstract formulas! The Matlab code implementation is also given:

X = MySample-repmat (mean (MySample), 10,1);% Centralize the sample matrix so that the average of each dimension is 0
C = (X '* X) ./ (size (X, 1) -1);
summarize
the key to understanding the covariance matrix is that it calculates to keep in mind is the covariance between the different dimensions, rather than between different samples, get a sample matrix, we first want to clear that his party A sample is still a dimension, and the whole calculation process in the heart is clear, so that it will not be confused ~

 

V. Summary

The key to understanding the covariance matrix is ​​to remember that its calculation is covariance between different dimensions, not between different samples. When you get a sample matrix, the first thing you need to know is whether a row is a sample or a dimension. I know in my heart that the entire calculation process will flow down the stream, so that I will not be confused.

 

 

Original address:

http://pinkyjie.com/2010/08/31/covariance/

Guess you like

Origin www.cnblogs.com/xwh-blogs/p/12678547.html