[Turn] covariance and correlation coefficient

Reprinted: http: //redstonewill.com/1511/

What is the covariance (Covariance)?

A covariance represents the overall error in two variables, which only represents a different variable error variance. If the trends of the two variables the same, that is to say if one is greater than their expectations, the other is also larger than its own expectations, then the covariance between the two variables is positive. If the two variables opposite trend, 
i.e., wherein a is greater than its desired value, but less than a further desired value itself, then the covariance between the two variables is negative.

Covariance is how come?

Briefly, the covariance is a reflection of the relationship between two variables X and Y. This correlation is broadly divided into three types: positive correlation, negative correlation irrelevant .

What is positively related to it? For example housing area (X) larger housing price (Y) is higher, the housing area of ​​the housing price are positively correlated;

What is negatively related to it? For example, the more a student playing the game time (X), academic (Y) worse, the game playing time and academic achievement are inversely related;

What is it irrelevant? For example, black and white level of a person's skin (X) and his health degree (Y) there is no significant relationship, it is not relevant.

Let's look at the first case, so that the variables X and Y variables are:

X = [11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30]

Y = [12 15 17 21 22 21 18 23 26 25 22 28 24 28 30 33 28 34 36 35]

It depicts the joint distribution of X and Y coordinates on:

 

 

Obviously, the overall trend is on the Y increases as X increases, that changes the X and Y are the same direction. This case, we say X and Y are positively correlated.

We look at the second case, so that the variables X and Y variables are:

X = [11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30]

Y = [35 35 29 29 28 28 27 26 26 23 21 22 25 19 16 19 20 16 15 16]

It depicts the joint distribution of X and Y coordinates on:

 

 

Obviously, Y on the overall trend with increasing X is reduced, that changes the X and Y are reversed. This case, we say X and Y are negatively correlated.

We look at the third case, so that the variables X and Y variables are:

X = [11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30]

Y = [16 16 28 17 20 26 20 17 21 15 12 29 24 25 16 15 21 13 17 25]

It depicts the joint distribution of X and Y coordinates on:

 

 

Obviously, Y on the overall trend is not positive correlation or negative correlation with the X. This case, we say X and Y are not relevant.

Looking back, we look positively related to the case of X and Y, so EX, EY expectations are X and Y. What are the expectations? Here we can see it as a mean value, namely EX is the average of the variable X, EY is the average of variable Y. The EX and EY to give the following pattern represented in the drawings:

 

 

Figure above, the entire area EX and EY is divided into I, II, III, IV four regions, and X and Y are mostly distributed in the I, III region, only a small part in the distribution of II, IV region.

In region I, satisfies X> EX, Y> EY, there are (X-EX) (Y-EY)> 0;

In the region II, satisfies X <EX, Y> EY, there are (X-EX) (Y-EY) <0;

In the region III, satisfies X <EX, Y <EY, there are (X-EX) (Y-EY)> 0;

In the region IV, satisfies X> EX, Y <EY, there are (X-EX) (Y-EY) <0.

Obviously, in the region I, III of, (X-EX) (Y-EY)> 0; in area II, IV in, (X-EX) (Y-EY) <0. While X and Y are positively correlated, most of the data is distributed in the I, III region, only a small part in the distribution of II, IV region. Therefore, from the point of view of the average, a positive correlation met:

E(XEX)(YEY)>0

On average is represented by the formula (X-EX) (Y-EY) is greater than zero is desired, i.e., (X-EX) (Y-EY) is greater than zero.

Then, look at X and Y are negative relevant circumstances:

 

 

Most of the above figure, X and Y distributed in the II, IV region, only a small part in the distribution of I, III regions.

Likewise, in the region I, III of, (X-EX) (Y-EY)> 0; in area II, IV in, (X-EX) (Y-EY) <0. While negative correlation between X and Y, most of the data are distributed in the II, IV region, only a small part in the distribution of I, III regions. Thus, the average angle of view, negative correlation is satisfied:

E ( X - E X ) ( Y - E Y ) < 0

On average is represented by the formula (X-EX) (Y-EY) is desirably less than zero, i.e., (X-EX) (Y-EY) is less than zero.

Finally, look at X and Y are not related to the case:

 

 

Figure above, X and Y are approximately uniformly distributed within the I, II, III, IV region.

Likewise, in the region I, III of, (X-EX) (Y-EY)> 0; in area II, IV in, (X-EX) (Y-EY) <0. While X and Y are uncorrelated, uniformly distributed in each area data, from the perspective of the average, uncorrelated met:

E (X-EX) (Y-EY) = 0

Is represented by formula (X-EX) (Y-EY) of the desired zero, i.e., (X-EX) (Y-EY) an average value equal to zero.

To sum up, we get the following conclusions:

  • When X and Y are positive correlation: E (X-EX) (Y-EY) & gt; 0 E ( X - E X ) ( Y - E Y ) > 0
  • When X and Y are negatively correlated: E (X-EX) (Y-EY) & lt; 0 E ( X - E X ) ( Y - E Y ) < 0

  • When X and Y are uncorrelated: E (X-EX) (EY-Y) = 0 E ( X - E X ) ( Y - E Y ) = 0

Thus, we covariance leads to the concept, which is a feature of the relationship between the digital X and Y. We define the covariance:

C o v = E ( X - E X ) ( Y - E Y )

 

According to the results previously discussed,

  • When Cov (X, Y)> 0, X and Y positive correlation;
  • When Cov (X, Y) <0, X and Y negative correlation;

  • When Cov (X, Y) = 0, X is Y is not associated with.

It is worth mentioning that, E represents the demand expectations. It can also be used to calculate the average covariance:

 

 

Here, the reason is divided by N-1 and N is no reason not to expect an unbiased estimate of the overall sample. By the way, if you make Y = X, then the variance covariance indicated it is X's.

Here, according to the formula covariance we were three cases the covariance of X and Y above.

When X and Y are positively correlated, Cov (X, Y) = 37.3684;

When X and Y are negatively correlated, Cov (X, Y) = -34.0789;

When X and Y are uncorrelated, Cov (X, Y) = -1.0263.

The correlation coefficient covariance What is the relationship?

We already know what is the covariance and covariance equation is how come, if you know the relationship between zero and covariance of two variables X and Y, we can infer that X and Y are positively correlated, negatively correlated or uncorrelated. So there is a question: whether covariance value represents the magnitude of the relevance of it? This means that if the covariance is whether 100 must be better than covariance strong positive correlation of 10 it?

Look at the following example!

Variables X1 and Y1 are as follows:

X1 = [11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30]

Y1 = [12 12 13 15 16 16 17 19 21 22 22 23 23 26 25 28 29 29 31 32]

Variable X2 and Y2 are:

X2 = [110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300]

Y2 = [113 172 202 206 180 184 242 180 256 209 288 255 240 278 319 322 345 289 333 372]

X1, Y1 and X2, Y2 respectively combined profile, as follows:

 

 It is apparent from the figure, X1, Y1 and X2, Y2 were tested positive correlation, and positively correlated with the degree of X1 Y1 and X2 Y2 is significantly larger than that. Next, we compute the covariance look at two figures is not the case.

(X1, y1) = 37.5526

(X2, Y2) = 3730.26

In fact, the reason for this situation is different amplitude change in the value of the two cases (or different dimensions). Calculation of the covariance when we did not put a different variable amplitude differences into account, there is no unified standard dimension in comparison covariance of the time.

So, in order to eliminate this effect, in order to accurately obtain the degree of similarity between the variables, we need the covariance divided by the standard variables of their difference. This resulted in a correlation coefficient of expression:

 

 Be seen, the correlation coefficient is divided by the standard of the difference variables X and Y on the basis of the covariance. Wherein calculating the standard deviation formula is:

 

 Why is standard deviation divided by the magnitude of each variable will be able to eliminate the impact of it? This is because the standard deviation itself reflects the degree of amplitude change variables, divided by the standard deviation just can play the role of offsetting, let covariance standardization. Thus, the range of correlation coefficients was normalized between [1,1] a.

Here, we are calculating the above example X1, Y1 and correlation coefficient X2, Y2's.

p (X1, Y1) = 0.9939

p (X2, Y2) = 0.9180

Well, we get X1 and Y1 correlation coefficient greater than the correlation coefficient of X2 and Y2. This is consistent with the actual situation. In other words, according to the correlation coefficient, we can determine the degree of correlation of two variables, the following conclusions:

  • The correlation coefficient is greater than zero, then the two variables positively correlated, and the greater the correlation coefficient, the higher the positive correlation;
  • Correlation coefficient is less than zero, then the two variables negative correlation, and the correlation coefficient is smaller, the higher the negative correlation;

  • The correlation coefficient is equal to zero, then the two variables are not related.

The relationship between covariance and correlation coefficient look back to, in fact, the correlation coefficient is the covariance of standardization, influence normalized form, eliminating the dimensions, varying the amplitude change. Practical application, comparison between different variables when the correlation, the correlation coefficient using more scientific and accurate. But covariance have applications in many areas of machine learning, and very important!

references:

https://www.cnblogs.com/tsingke/p/6273970.html

https://www.zhihu.com/question/20852004

Guess you like

Origin www.cnblogs.com/huangm1314/p/11509309.html