[Statistics] Pearson's correlation coefficient formula understanding

Pearson's correlation coefficient formula

Insert picture description here
Text description: The correlation coefficient (Px, y) is equal to the covariance cov (X, Y) between X and Y divided by the product of their respective standard deviations (σX, σY)

1. First explain the numerator, why use covariance?

Because we want to study the correlation between the two sets of data, if the two sets of data are related, the most basic condition to be met: the trend of change is similar (for example, positive correlation or negative correlation)

The covariance can tell us this. The formula for covariance is: The
Insert picture description here
text description is: if there are two variables X and Y, the "difference between X value and its mean" at each moment is multiplied by "the difference between Y value and its mean" "Get a product, and then sum the product at each moment and find the mean

If the changing trends of X and Y are regular, such as a positive correlation, when X is lower than the mean, Y will probably be lower than the mean; when X is higher than the mean, Y will probably be higher than the mean. Therefore, the product of the two multiplied by the probability is a positive number (a large number of positive numbers + a small number of negative numbers), so the mathematical expectation is also a positive number, so the final sign of positive correlation is positive,
Insert picture description here
such as negative correlation, then when X is lower than the mean , The probability of Y will also be higher than the mean; when X is higher than the mean, the probability of Y will also be lower than the mean, so the product of the two multiplied by the probability is a negative number, so the mathematical expectation is also a negative number (a large number of negative numbers + A small number of positive numbers), so the final negative correlation sign is negative.
Insert picture description here
If the changing trend of X and Y is irregular, when X is lower than the mean, Y may be lower than the mean or higher than the mean, so after they are multiplied, Some of the product results are positive and some are negative, so when calculating mathematical expectations, they cancel each other out (the number of positive and negative numbers is equal), so the final result of irrelevant data will be very close to 0;
Insert picture description here

2. Next, explain the denominator, why use the product of the standard deviation

Dividing the covariance by the standard deviation is actually a normalization operation. Its meaning is to eliminate the influence of dimensions and simply reflect the similarity of the two variables per unit change.

Why does the dimension matter? For example, in the figure below, the red line curves of Case 1 and Case 2 look much different in amplitude, but in fact, the difference between the two red lines is only that the unit difference is 10000 times, but the change of the green line has the same effect on them. When the green line is at the lowest peak, the red line is also at the lowest peak; when the green line is at the highest peak, the red line is also at the highest peak; the correlation between the two cases should be the same (the correlation coefficient only focuses on the red line and the green line the degree of influence between each other)
Insert picture description here
if you only consider covariance, the situation of a covariance much larger than the two cases, we want to study is the magnitude of change, we do not want a different dimension will affect the results, it is necessary One thing to eliminate the influence of dimensions

Why can the standard deviation be used to eliminate the influence of dimensions?
The formula for standard deviation is a
Insert picture description here
text description: find the deviation of a certain sample from the mean, because the deviation may be positive or negative, so square it, and then add up all the squares of the deviation to get the mathematical expectation of the deviation square, and then Prescribe a square to bring the deviation back to its original magnitude

(In fact, squaring is to solve the problem of sign. If a set of data with a large deviation is a very positive number at one time and a very negative number at one time, the two are directly added, the expected deviation will become 0, it becomes that this set of data has no deviation, which is what we do not want to see, so we need to add the square. The square root is to eliminate the influence of the square and bring the mathematical expectation of deviation back to the original magnitude)

Therefore, the standard deviation represents the degree of deviation within a set of data, which can also be understood as the magnitude of change. This change may be large or small.

Now we hope that the index of correlation coefficient can eliminate the influence of dimension, and the meaning of dimension and change range is actually the same.

For example, in case 1, the unit of X is 1 kilogram, and the unit of Y is 1 yuan. Each increase of 1 yuan in Y will increase X by 1 kilogram. In
case 2, the unit of X is 0.1 grams, and the unit of Y is 1 yuan. Each increase of 1 yuan, X will increase by 0.1 grams. The
two units of X are 10,000 times different, which leads to a 10,000 times difference in the range of their changes (1 kg/yuan and 0.1 g/yuan respectively)

Therefore, if the covariance is divided by the standard deviation, it becomes the covariance when the unit changes, eliminating the influence of the dimension (it can also be said that the influence of the change range is eliminated)

to sum up

The above is the understanding of Pearson's correlation coefficient formula. The simple summary is:

  1. The covariance of the numerator is used to get the correlation
  2. The standard deviation of the denominator is used to eliminate the influence of the dimension (or the magnitude of change)

The source of formulas and legends, this article is very good: how to explain the concepts of "covariance" and "correlation coefficient" in a simple and understandable way?

Guess you like

Origin blog.csdn.net/weixin_38705903/article/details/106805267