[Statistics Notes] Correlation and three correlation coefficients in statistics

Correlation and three correlation coefficients in statistics

The correlation is a non-deterministic relationship, and the correlation coefficient is the amount of linear correlation between the variables studied.

Due to different research objects, the correlation coefficient has the following definitions.

Simple correlation coefficient: also called correlation coefficient or linear correlation coefficient, generally expressed by the letter r, used to measure the linear relationship between two variables.

Complex correlation coefficient: also called multiple correlation coefficient. Complex correlation refers to the correlation between the dependent variable and multiple independent variables. For example, there is a complex correlation between the seasonal demand of a commodity and its price level, employee income level and other phenomena.

Typical correlation coefficient: firstly carry out principal component analysis of the original groups of variables to obtain a new linear relationship comprehensive index, and then study the correlation between the original groups of variables through the linear correlation coefficient between the comprehensive indexes.

Correlation analysis is the description and measurement of the linear relationship between two variables. The problems to be solved include:

Are there relationships between variables?
If there is a relationship, what is the relationship between them?
How strong is the relationship between variables?
Can the relationship between the variables reflected in the sample represent the relationship between the overall variables?
In order to solve these problems, there are two main assumptions for the overall analysis:

First: the two variables are linearly related;

Second: Both variables are random variables;

Determine the existence of the correlation, the form and direction of the correlation, and the closeness of the correlation. The main method is to draw correlation charts and calculate correlation coefficients.
1) Correlation table
Before compiling the correlation table, we must first obtain a series of paired mark value data through actual investigation as the original data of the correlation analysis.
Classification of related tables: simple related tables and group related tables. Univariate grouping related table: independent variables are grouped and counted, and the corresponding dependent variables are not grouped, only the average value is calculated; the characteristics of this table: simplify the lengthy data and reflect the correlation between the two variables more clearly. Bivariate grouping correlation table: a correlation table made by grouping both independent and dependent variables. This table is similar to a chessboard, so it is also called a chessboard correlation table.
2) The correlation diagram
uses the first quadrant of the rectangular coordinate system, puts the independent variable on the horizontal axis, and the dependent variable on the vertical axis, and plots the variable values ​​corresponding to the two variables in the form of coordinate points to indicate the relevant points Graphic of distribution. The correlation graph is called a correlation scatter diagram visually. The factor signs are divided into groups, and the result signs are expressed as group averages. The correlation diagram drawn is a polyline, which is also called a correlation curve.
3) Correlation coefficient
1. The correlation coefficient is calculated according to the product difference method, which is also based on the dispersion between the two variables and their respective averages, and the correlation between the two variables is reflected by multiplying the two dispersions; focus on the linear single Correlation coefficient.
2. Determine the mathematical expression of the correlation.
3. Determine the degree of error in the estimated value of the dependent variable.

When performing correlation analysis, you first need to draw a scatterplot to determine the relationship between the variables. If it is a linear relationship, you can use the correlation coefficient to measure the strength of the relationship between the two variables, and then perform a significant test on the correlation coefficient To determine whether the relationship reflected by the sample represents the overall relationship between the two variables.

According to the scatterplot, when the independent variable takes a certain value, the dependent variable corresponds to a probability distribution. If the probability distribution of all independent variable values ​​is the same, it means that the dependent variable and the independent variable are not related. Conversely, if the value of the independent variable is different and the distribution of the dependent variable is also different, it means that there is a correlation between the two.
 

The scatterplot can determine whether there is a correlation between the two variables, and roughly describe the relationship between the variables, but the scatterplot cannot accurately reflect the strength of the relationship between the variables. Therefore, in order to accurately measure the strength of the relationship between the two variables, the correlation coefficient needs to be calculated.

Correlation coefficient (correlation coefficient) is a statistical measure of the strength of the linear relationship between two variables calculated from the sample data.

If the correlation coefficient is calculated based on all the overall data, it is called the overall correlation coefficient and is recorded as:\rho

If it is calculated based on the sample data, it is called the sample correlation coefficient and is written as:r

The calculation formula of the sample correlation coefficient is:

r = \frac{n\sum xy - \sum x\sum y}{\sqrt{n\sum x^{2}-\left ( \sum x \right )^{2}}\times \sqrt{n\sum y^{2}-\left ( \sum y \right )^{2}}}

The correlation coefficient calculated according to the above formula is also called linear correlation coefficient (Linear Correlation Coefficient), or called Pearson correlation coefficient (Pearson's Correlation Coefficient)

Generally, the overall correlation coefficient is unknown, and the sample correlation coefficient is usually used as an approximate estimate of.

However, because it is calculated based on sample data, it will be affected by sampling fluctuations. Since the sample taken is different, the value of is also different, so it is a random variable. Can you explain the overall degree of correlation based on the sample correlation coefficient? This needs to consider the reliability of the sample correlation coefficient, which is to conduct a significance test.

The purpose of correlation analysis: measure the strength of the relationship between variables.

Tools used: correlation coefficient


Pearson correlation coefficient

Pearson Correlation Coefficient (Pearson CorrelationCoefficient) is used to measure whether two data sets are on a line, it is used to measure the linear relationship between fixed-distance variables.

Such as measuring the linear correlation between national income and residents' savings deposits, height and weight, high school scores and college entrance examination scores and other variables. When both variables are normal continuous variables, and there is a linear relationship between the two, the correlation between the two variables is expressed by the product difference correlation coefficient, mainly including the Pearson simple correlation coefficient.

Applicable conditions:

  1. The sample size is greater than or equal to 30, so as to ensure that the calculated data is representative, and the calculated product difference correlation coefficient can effectively explain the correlation between the two variables.
  2. The populations of the two variables are normally distributed, at least a unimodal distribution close to normal.
  3. Both variables are continuous data from the measurement.
  4. The correlation between the two variables is linear.
  5. Exclude the influence of covariation factors.
  6. Calculate correlation analysis between continuous variables or variables measured at equal intervals .

Spearman correlation coefficient

In statistics, the Spearman rank correlation coefficient named Charles Spearman is the spearman correlation coefficient. It is often represented by the Greek letter ρ. It is a non-parametric indicator that measures the dependence of two variables. It uses monotone equations to evaluate the correlation of two statistical variables. If there are no duplicate values ​​in the data, and when the two variables are completely monotonously correlated, the Spearman correlation coefficient is +1 or −1.

Applicable conditions:

  1. There are only two variables, and both are sequential variables (rank variables), or one column of data is sequential variable data, and the other column of data is continuous variable data.
  2. It is suitable for describing the related situation of name data and sequence data.
  3. The data of two continuous variables observations, at least one column of data was roughly evaluated by non-measurement methods. If the work analysis method is used, the evaluator can only make a rough evaluation based on a certain standard and relying on his own experience.
  4. As can be seen from the use conditions of Spearman rank correlation, it is not limited by the sample size, variable distribution form, and whether the data has continuity conditions. So when the data does not meet the use conditions of Pearson product correlation, Spearman rank correlation can be used. However, Spearman level correlation needs to convert continuous data into sequential data, which will miss the original information of the data, and there is no accuracy related to product difference. Therefore, when the data meets the usage conditions related to product difference, do not use the level correlation for calculation.

The Spearman correlation coefficient is defined as the Pearson correlation coefficient between grade variables. For samples with sample size n, n original data are converted into grade data, and the correlation coefficient ρ is:

 

In practical applications, the links between variables are irrelevant, so ρ can be calculated in simple steps. The difference between the levels of the two variables being observed, then ρ is

Spearman rank correlation is a method to study the correlation between two variables based on rank data. It is calculated based on the difference between the number of pairs of two pairs of levels, so it is also called the "level difference method".

Spearman rank correlation does not require strict product correlation coefficients for the data conditions, as long as the observation values ​​of the two variables are paired rank assessment data, or the rank data converted from the continuous variable observation data, regardless of the two variables. The overall distribution pattern and sample size can be studied by Spearman rank correlation.

Spearman's rank correlation coefficient reflects the closeness of the relationship between the two sets of variables. It is the r same as the correlation coefficient and takes a value between -1 and +1. The difference is that it is calculated on the basis of the rank.

Let's illustrate with an example. A factory conducted an examination of workers' business. To study whether there is a connection between the examination results and monthly output, if a sample is randomly selected, the examination results and production figures are as follows:

Spearman rank correlation coefficient

It can be seen from the figures in the table that the higher the test score of the worker, the higher the output, and the degree of connection between the two is very consistent, but the correlation coefficient r = 0.676 is not too high, because they are The relationship between them is not linear. If they are transformed into grades according to the test scores and output levels (see columns 3 and 4 of the table above), the grade correlation coefficient between them can be calculated as 1. To calculate the level correlation coefficient, you can use the original correlation coefficient formula after transforming the data into a level, or you can calculate the level difference d i of each pair of samples , and then use the following formula to calculate:

  \rho=1-\frac{6\sum d_i^2}{n^3-n}

In the illustrated example since exactly the same level in all D I  = 0, so that r = 1. The rank correlation coefficient is the same as the usual correlation coefficient. It is related to the sample size. Especially when the sample size is relatively small, the degree of variation is large. The significance test of the rank correlation coefficient and the ordinary correlation coefficient the same.

Kendall rank correlation coefficient 

Definition of Kendall coefficient: n similar statistical objects are sorted according to specific attributes, and other attributes are usually out of order. Same sequence of ( concordant pairs ) and of the isobaric ( discordant pairs defined by the ratio) of the difference between the total number of (n * (n-1) / 2) is Kendall (Kendall) coefficients.

If the consistency between the two rankings is perfect (that is, the two rankings are the same), the value of the coefficient is 1.

If the disagreement between the two rankings is perfect (ie, one ranking is opposite to the other ranking), the value of the coefficient is -1.

For all other arrangements, the value is between -1 and 1, and an increase in value means an increase in consistency between rankings. If the ranking is completely independent, the average value of the coefficient is 0.

The Kendall-tau coefficient is defined:

\tau = \frac{2P}{\frac{1}{2}{n(n-1)}} - 1 = \frac{4P}{n(n-1)} - 1

Where n is the number of items and P is the sum of the items ranked by two rankings after a given item among all items.

P can also be interpreted as the number of concordance pairs. The denominator in the definition of τ can be interpreted as the total number of item pairs. Therefore, a high value of P means that most pairs are consistent, which indicates that the two rankings are consistent. Please note that the bound pair is not considered harmonious or discordant. If there are a large number of connections, the total number of pairs should be adjusted accordingly (in the denominator of the τ expression).

Suppose we rank a group of 8 people by height and weight, a person is the highest, the third is the same, and so on:

Person A B C D E F G H
Rank by Height 1 2 3 4 5 6 7 8
Rank by Weight 3 4 1 2 5 7 8 6

We see a certain relationship between these two rankings, but this relationship is far from perfect. We can use the Kendall-tau coefficient to objectively measure the degree of correspondence.

Note that in the weight ranking above, there are seven other elements to the right of the first entry 3 (4,1,2,5,7,8,6). In other rankings, how many elements are on the right of 3?

In the height ranking, the elements to the right of 3 are: 4,5,6,7,8, so in the two rankings, the elements to the right of 3 are 5 (they are 4,5,6,7,8), so this The entry's contribution to P is 5.

Going to the second entry 4, we see that there are six elements to the right of it. Among these elements, the element to the right of 4 in other rankings is 4 (5, 6, 7, 8), so the contribution to P is 4. Continue this way, we found

P = 5 + 4 + 5 + 4 + 3 + 1 + 0 + 0 = 22.

Therefore:  \ tau = \ frac {88} {56} -1 = \ frac {44} {28} -1 = 0.57.

This result shows that, as expected, there is strong consistency between the leaderboards.

Kendall correlation coefficient is a measure statistic of the correlation between two ordered variables or two rank variables, so it also belongs to the category of non-parametric statistics. The difference with Spearman is that a certain comparison data needs to be ordered, and the calculation speed is faster than Spearman in the ordered case.

  1. The index used to reflect the correlation of categorical variables is applicable to the case where both categorical variables are ordered .
  2. Perform nonparametric correlation tests on related ordered variables.
  3. Calculate Kendall's rank correlation coefficient, suitable for ordered variables or equally spaced data that does not satisfy the normal distribution assumption.
  4. If the Kendall rank correlation analysis is inappropriately used, it may be concluded that the correlation coefficient is relatively small.

What is rank correlation coefficient

In practical applications, sometimes the original data obtained does not have specific data performance, and only grades can be used to describe a certain phenomenon. To analyze the correlation between phenomena, only grade correlation coefficients can be used.

The rank correlation coefficient is also called " rank correlation coefficient ", which is a statistical analysis index reflecting the degree of rank correlation. Commonly used rank correlation analysis methods include Spearman rank correlation and Kendall rank correlation.

Calculation steps of rank correlation coefficient

1. Number the specific performance of the quantity mark and quality mark in order of grade.

2. Find the difference between each pair of grade numbers of the two signs in order.

3. Calculate the correlation coefficient as follows:

r_s=1-\frac{6\sum d_i^2}{n(n^2-1)}

Among them: the rank correlation coefficient is recorded as r s , d i is the difference between the ranks of each pair of samples of the two variables, and n is the sample size.

The level correlation coefficient is the same as the correlation coefficient, ranging from -1 to +1. Rs is positive for positive correlation, rs is negative for negative correlation, and rs is equal to zero for zero correlation. The difference is that it is calculated based on the level , More suitable for reflecting the correlation of sequence variables.


What are the similarities and differences between the three correlation analysis methods of Pearson, Kendall and Spearman

When there is a linear correlation between two continuous variables, the Pearson product difference correlation coefficient is used, and when the applicable conditions of the product difference correlation analysis are not met, the Spearman rank correlation coefficient is used to describe.

Spearman correlation coefficient, also known as rank correlation coefficient, uses the rank size of two variables for linear correlation analysis. It does not require the distribution of original variables. It is a non-parametric statistical method and has a wider application range. Spearman correlation coefficient can also be calculated for data subject to Pearson correlation coefficient, but the statistical efficiency is lower. The calculation formula of the Pearson correlation coefficient can be completely applied to the Spearman correlation coefficient calculation formula, but the x and y in the formula can be replaced by the corresponding rank.

When the two variables do not meet the assumption of bivariate normal distribution, Spearman rank correlation is needed to describe the mutual changes between variables.

Kendall's tau-b rank correlation coefficient: an indicator used to reflect the correlation of categorical variables. It is applicable to the case where both categorical variables are ordered. Perform a nonparametric correlation test on the related ordered variables; the value range is between -1-1, this test is suitable for square tables; calculate the product distance pearson correlation coefficient, continuous variables can be used; calculate Spearman rank correlation coefficient, It is suitable for evenly spaced data of ordered variables or does not meet the assumption of normal distribution; Calculate Kendall rank correlation coefficient, suitable for equally spaced data of ordered variables or does not meet the assumption of normal distribution.

Calculate the correlation coefficient: When the data do not obey the bivariate normal distribution or the overall distribution is unknown, or the original data is expressed in grades, spearman or kendall correlation should be used.

 

Published 646 original articles · praised 198 · 690,000 views

Guess you like

Origin blog.csdn.net/seagal890/article/details/105692590