Correlation, the similarities, similarities and differences, and applicable scenarios of Pearson correlation coefficient, Kenda correlation coefficient, and Spearman correlation coefficient

Correlation, the similarities, similarities and differences, and applicable scenarios of Pearson correlation coefficient, Kenda correlation coefficient, and Spearman correlation coefficient

Correlation analysis is used to study the relationship between variables, explore the correlation between variables, and help us understand the influence and effect between variables. In actual data analysis, there may be the following situations where we need to perform correlation analysis:

  • Determine the degree of correlation between two or more variables
  • Identify and exclude several highly correlated variables in machine learning tasks
  • Use correlation to help explore causality between variables

When performing correlation analysis, it is also necessary to pay attention to the precondition that the data has linear correlation, and pay attention to confirm the integrity and correctness of the data, so as to avoid wrong conclusions caused by data errors or deviations.


1 The similarities between Pearson correlation coefficient, Kendall correlation coefficient and Spearman correlation coefficient

  • They are commonly used indicators to measure the correlation between two random variables.

  • Regardless of the correlation coefficient, its application needs to satisfy that the relevant data is a complete and meaningful data set, and it should be noted that the correlation coefficient can only clearly indicate the relationship between variables, but cannot determine the causal relationship.


2 ties

  • Pearson Correlation Coefficient : Usually used to measure twocontinuousLinear relationship between type variables. It is calculated by calculating the relationship between two variablesCovariance, and then normalize it, the value range is from -1 to 1.
  • Kendall correlation coefficient : usually used to measure twoSequentialrelationship between variables. It is calculated by comparing the relative size differences in the two variables, and calculates the ratio of identical and dissimilar pairs, with values ​​ranging from -1 to 1.
  • Spearman's correlation coefficient : also used to measure twoSequentialAn indicator of the relationship between variables. It is calculated in a similar way to the Kendall correlation coefficient , except that it uses the original data withRankIndicates that the covariance between the ranks is then calculated to measure the correlation between the two variables, and the value range is also -1 to 1.

3 Applicable scenarios

  • Pearson's correlation coefficient : suitable for measuring twoContinuousA linear relationship between variables, usually suitable fornormal distributionThe data.
  • Kendall correlation coefficient : suitable for measuring twoSequentialThe relationship between variables, usually applies toequidistantdata andratiodata, and it is usually more accurate than the Pearson correlation coefficient when the data do not meet the normal distribution .
  • Spearman's Correlation Coefficient : Also suitable for measuring twoSequentialThe relationship between variables, usually applies toequidistantdata andratiodata, which is less sensitive to outliers in the data than the Kendall correlation coefficient .

4 Calculate correlation coefficient with python

import pandas as pd

# DataFrame
df = pd.DataFrame({
    
    'x':[1, 2, 3], 'y':[4, 5, 6]})

# 皮尔逊
p = df.corr()

# 肯德尔
k = df.corr(method='kendall')

# 斯皮尔曼
s = df.corr(method='spearman')

pandasWhen calculating the correlation coefficient in , if the correlation coefficient type is not specified, the Pearson correlation coefficient will be calculated by default. In addition scipy, correlation coefficients can also be calculated.

Guess you like

Origin blog.csdn.net/qq_42774234/article/details/130213282