Table of contents
1. Configure the virtual environment
2. Library version introduction
1. cal_pearson (calculate Pearson correlation coefficient)
a. Experiment 1 (strong positive correlation):
b. Experiment 2 (almost no linear correlation):
c. Experiment 3 (very strong positive correlation):
d. Experiment 4 (Spearman correlation coefficient matrix):
1. Experiment introduction
This experiment mainly implements custom Pearson correlation coefficient for correlation analysis.
Correlation analysis is a common statistical method used to evaluate the degree of association between two or more variables . In this experiment, we used two common correlation indicators, Pearson correlation coefficient and Spearman correlation coefficient. The Pearson correlation coefficient is used to measure the linear relationship between two continuous variables, while the Spearman correlation coefficient is suitable for evaluating any monotonic relationship between two variables, whether linear or not.
2. Experimental environment
This series of experiments uses the PyTorch deep learning framework, and the relevant operations are as follows (based on the environment of the deep learning series of articles):
1. Configure the virtual environment
The environment of the deep learning series of articles
conda create -n DL python=3.7
conda activate DL
pip install torch==1.8.1+cu102 torchvision==0.9.1+cu102 torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html
conda install matplotlib
conda install scikit-learn
New addition
conda install pandas
conda install seaborn
conda install networkx
conda install statsmodels
pip install pyHSICLasso
Note: My experimental environment installs various libraries in the above order. If you want to try installing them together (God knows if there will be any problems)
2. Library version introduction
software package | This experimental version | The latest version currently |
matplotlib | 3.5.3 | 3.8.0 |
numpy | 1.21.6 | 1.26.0 |
python | 3.7.16 | |
scikit-learn | 0.22.1 | 1.3.0 |
torch | 1.8.1+cu102 | 2.0.1 |
torchaudio | 0.8.1 | 2.0.2 |
torchvision | 0.9.1+cu102 | 0.15.2 |
New
networkx | 2.6.3 | 3.1 |
pandas | 1.2.3 | 2.1.1 |
pyHSICLase | 1.4.2 | 1.4.2 |
seaborn | 0.12.2 | 0.13.0 |
state models | 0.13.5 | 0.14.0 |
3. IDE
It is recommended to use Pycharm (among them, the pyHSICLasso library has an error in VScode, and a solution has not yet been found...)
3. Experimental content
0. Import necessary tools
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
1. cal_pearson (calculate Pearson correlation coefficient)
def cal_pearson(x, y):
n = len(x)
mean_x = np.mean(x)
mean_y = np.mean(y)
x_ = x - mean_x
y_ = y - mean_y
s_x = np.sqrt(np.sum(x_ * x_))
s_y = np.sqrt(np.sum(y_ * y_))
r = np.sum((x_ / s_x) * (y_ / s_y))
t = r / np.sqrt((1 - r * r) / (n - 2)) # p值需要对t值进行查表获得
return r
-
Calculate the length of variable x, that is, the number of samples.
-
Calculate the mean of variables x and y.
-
Calculate the standard deviation of variables x, y.
-
Calculate the Pearson correlation coefficient r by dividing the values at corresponding positions in x_ and y_, then multiplying and summing.
-
Calculate the t value by dividing the value of r by sqrt((1 - r^2) / (n - 2)). Here n - 2 is the correction factor used to correct the effect of sample size on the t value.
-
Returns the calculated Pearson correlation coefficient r.
2. Main program
a. Experiment 1 (strong positive correlation):
x1 = np.random.random(100)
y1 = np.random.random(100) + x1
plt.scatter(x1, y1, marker='.')
plt.show()
pearson1, p1 = stats.pearsonr(x1, y1)
r1 = cal_pearson(x1, y1)
print(pearson1)
print(r1)
print()
- Generate two random arrays x1 and y1 of length 100, where y1 is based on x1 plus some random noise.
- Draw a scatter plot of x1 and y1.
scipy.stats.pearsonr
The Pearson correlation coefficient and p value of x1 and y1 were calculated using the function,cal_pearson
The same correlation coefficient was calculated using a custom function.
0.6991720710419989
0.6991720710419991
b. Experiment 2 (almost no linear correlation):
x2 = np.random.random(100)
y2 = np.random.random(100)
plt.scatter(x2, y2, marker='.')
plt.show()
pearson2, p2 = stats.pearsonr(x2, y2)
r2 = cal_pearson(x2, y2)
print(pearson2)
print(r2)
print()
Two random arrays x2 and y2 of length 100 were generated without adding noise. Scatter plots were also plotted, and Pearson correlation coefficients were calculated separately.
-0.11511730616773974
-0.11511730616773967
c. Experiment 3 (very strong positive correlation):
Two random arrays x3 and y3 with a length of 100 are generated, where y3 is added with some larger random noise on the basis of x3. Scatter plots were also plotted, and Pearson correlation coefficients were calculated separately.
x3 = np.random.random(100)
y3 = np.random.random(100) + x3 * 50
plt.scatter(x3, y3, marker='.')
plt.show()
pearson3, p3 = stats.pearsonr(x3, y3)
r3 = cal_pearson(x3, y3)
print(pearson3)
print(r3)
print()
d. Experiment 4 (Spearman correlation coefficient matrix):
A random array data with a shape of (10, 10) was generated, scipy.stats.spearmanr
the Spearman correlation coefficient and p value between each column in the data were calculated using the function, and the results were printed out.
data = np.random.random((10, 10))
spearman_np, p_np = stats.spearmanr(data)
print(spearman_np, p_np)
3. Code integration
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
def cal_pearson(x, y):
n = len(x)
mean_x = np.mean(x)
mean_y = np.mean(y)
x_ = x - mean_x
y_ = y - mean_y
s_x = np.sqrt(np.sum(x_ * x_))
s_y = np.sqrt(np.sum(y_ * y_))
r = np.sum((x_ / s_x) * (y_ / s_y))
t = r / np.sqrt((1 - r * r) / (n - 2)) # p值需要对t值进行查表获得
return r
if __name__ == '__main__':
np.random.seed(0)
x1 = np.random.random(100)
y1 = np.random.random(100) + x1
plt.scatter(x1, y1, marker='.')
plt.show()
pearson1, p1 = stats.pearsonr(x1, y1)
r1 = cal_pearson(x1, y1)
print(pearson1)
print(r1)
print()
x2 = np.random.random(100)
y2 = np.random.random(100)
plt.scatter(x2, y2, marker='.')
plt.show()
pearson2, p2 = stats.pearsonr(x2, y2)
r2 = cal_pearson(x2, y2)
print(pearson2)
print(r2)
print()
x3 = np.random.random(100)
y3 = np.random.random(100) + x3 * 50
plt.scatter(x3, y3, marker='.')
plt.show()
pearson3, p3 = stats.pearsonr(x3, y3)
r3 = cal_pearson(x3, y3)
print(pearson3)
print(r3)
print()
data = np.random.random((10, 10))
spearman_np, p_np = stats.spearmanr(data)
print(spearman_np, p_np)