Spearman correlation (spearman) correlation analysis article detailed explanation + python example code


foreword

Correlation analysis is one of the basic knowledge of many algorithms and modeling, very classic. Correlation analysis can be used to calculate and express many feature correlations and related trends. Among them, there are three common correlation coefficients: person correlation coefficient, spearman correlation coefficient, and Kendall's tau-b rank correlation coefficient. Each has its own usage and usage scenarios. Of course, I will write all the calculation algorithms and principles + codes of the above three correlation coefficients in my column. At present, the column on mathematical modeling has written a lot of traditional machine learning prediction algorithms, dimension algorithms, time series prediction algorithms and weight algorithms. Students who are interested in this demand can go and have a look.

Detailed explanation of Pearson correlation analysis + python example code


1. Definition

It is often represented by the Greek letter ρ. It is a nonparametric measure of the dependence of two variables. It evaluates the correlation of two statistical variables using a monotonic equation. If there are no repeated values ​​in the data, and when two variables are perfectly monotonically correlated, the Spearman correlation coefficient is +1 or −1. The Spearman correlation coefficient was defined as the Pearson correlation coefficient between the rank variables. For a sample with a sample size of n , where n raw data are transformed into rank data, the correlation coefficient ρ is:

 In practical applications or specific problems, the connection between variables is irrelevant. If the corresponding elements of the two observed variables are subtracted to obtain a difference d, the above formula can also be transformed into:

 where d_iis the grade difference between and X_{i}.Y_{i}

d_iis calculated as:

2. Spearman-related usage scenarios

The applicable conditions of the Spearman correlation coefficient are wider than those of the Pearson correlation coefficient, as long as the observed values ​​of the two variables are paired rating data, or the grade data converted from the continuous variable observation data, regardless of the two variables The overall distribution shape and the size of the sample size can all be studied with the Spearman rank correlation coefficient. It can be used as long as the data satisfies a monotonic relationship (such as linear function, exponential function, logarithmic function, etc.).

The Spearman correlation coefficient is less sensitive to outliers, because it is calculated based on the ranking, and the difference between the actual values ​​has no direct impact on the calculation results.

3. Calculation of Spearman correlation coefficient

Like the function used in the previous article, you can use the pandas function corr:

DataFrame.corr(method='pearson', 
               min_periods=1,
               numeric_only=_NoDefault.no_default)

 Parameter Description:

method:{‘pearson’, ‘kendall’, ‘spearman’} or callable。Method of correlation。

  • pearson : standard correlation coefficient, Pearson coefficient

  • kendall : Kendall Tau correlation coefficient, Kendall coefficient

  • spearman : Spearman rank correlation, Spearman coefficient

min_periods : int, optional. The minimum number of samples required for each pair of columns. Currently only available for Pearson and Spearman correlations.

numeric_only : bool, default True. Contains only floating point, integer, or boolean data.

It's simple to implement:

rho =df_test.corr(method='spearman')
rho

Heat map:

plt.rcParams['font.family'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
sns.heatmap(rho, annot=True)
plt.title('Heat Map', fontsize=18)

 

Or use scipy's state function, the effect is the same:

import numpy as np
from scipy import stats
 
stats.spearmanr(data1,data2)

 4. Hypothesis test of Spearman's correlation coefficient

Divided into two cases: small sample and large sample

In the case of small samples (n ≤ 30), directly check the critical value table
H0: rs = 0; H1: rs ≠ 0
Use the obtained Spearman correlation coefficient r to compare with the corresponding critical value.

 In the case of large samples, the statistic

 H0: rs = 0; H1: rs ≠ 0, calculate the test value z*, and find the corresponding p value and compare it with 0.05.


Pay attention to prevent getting lost, if there are any mistakes, please leave a message for advice, thank you very much

That's all for this issue. I'm fanstuck, if you have any questions, feel free to leave a message to discuss, see you next time

 refer to

Mathematical modeling - correlation coefficient (4) - Spearman correlation coefficient (spearman)

 

Guess you like

Origin blog.csdn.net/master_hunter/article/details/128609473