[Bioinformatics] Correlation analysis using Pearson correlation coefficient

Table of contents

1. Experiment introduction

2. Experimental environment

1. Configure the virtual environment

2. Library version introduction

3. IDE

3. Experimental content

0. Import necessary tools

1. cal_pearson (calculate Pearson correlation coefficient)

2. Main program

a. Experiment 1 (strong positive correlation):

b. Experiment 2 (almost no linear correlation):

c. Experiment 3 (very strong positive correlation):

d. Experiment 4 (Spearman correlation coefficient matrix):

3. Code integration


1. Experiment introduction

        This experiment mainly implements custom Pearson correlation coefficient for correlation analysis.

        Correlation analysis is a common statistical method used to evaluate the degree of association between two or more variables . In this experiment, we used two common correlation indicators, Pearson correlation coefficient and Spearman correlation coefficient. The Pearson correlation coefficient is used to measure the linear relationship between two continuous variables, while the Spearman correlation coefficient is suitable for evaluating any monotonic relationship between two variables, whether linear or not.

2. Experimental environment

    This series of experiments uses the PyTorch deep learning framework, and the relevant operations are as follows (based on the environment of the deep learning series of articles):

1. Configure the virtual environment

The environment of the deep learning series of articles

conda create -n DL python=3.7 
conda activate DL
pip install torch==1.8.1+cu102 torchvision==0.9.1+cu102 torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html
conda install matplotlib
conda install scikit-learn

New addition

conda install pandas
conda install seaborn
conda install networkx
conda install statsmodels
pip install pyHSICLasso

Note: My experimental environment installs various libraries in the above order. If you want to try installing them together (God knows if there will be any problems)

2. Library version introduction

software package This experimental version The latest version currently
matplotlib 3.5.3 3.8.0
numpy 1.21.6 1.26.0
python 3.7.16
scikit-learn 0.22.1 1.3.0
torch 1.8.1+cu102 2.0.1
torchaudio 0.8.1 2.0.2
torchvision 0.9.1+cu102 0.15.2

New

networkx 2.6.3 3.1
pandas 1.2.3 2.1.1
pyHSICLase 1.4.2 1.4.2
seaborn 0.12.2 0.13.0
state models 0.13.5 0.14.0

3. IDE

        It is recommended to use Pycharm (among them, the pyHSICLasso library has an error in VScode, and a solution has not yet been found...)

Win11 install Anaconda (2022.10) + pycharm (2022.3/2023.1.4) + configure virtual environment_QomolangmaH's blog - CSDN blog https://blog.csdn.net/m0_63834988/article/details/128693741https://blog.csdn. net/m0_63834988/article/details/128693741 icon-default.png?t=N7T8https://blog.csdn.net/m0_63834988/article/details/128693741

3. Experimental content

0. Import necessary tools

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

1. cal_pearson (calculate Pearson correlation coefficient)

def cal_pearson(x, y):
    n = len(x)

    mean_x = np.mean(x)
    mean_y = np.mean(y)

    x_ = x - mean_x
    y_ = y - mean_y

    s_x = np.sqrt(np.sum(x_ * x_))
    s_y = np.sqrt(np.sum(y_ * y_))

    r = np.sum((x_ / s_x) * (y_ / s_y))
    t = r / np.sqrt((1 - r * r) / (n - 2))  # p值需要对t值进行查表获得

    return r
  • Calculate the length of variable x, that is, the number of samples.

  • Calculate the mean of variables x and y.

  • Calculate the standard deviation of variables x, y.

  • Calculate the Pearson correlation coefficient r by dividing the values ​​at corresponding positions in x_ and y_, then multiplying and summing.

  • Calculate the t value by dividing the value of r by sqrt((1 - r^2) / (n - 2)). Here n - 2 is the correction factor used to correct the effect of sample size on the t value.

  • Returns the calculated Pearson correlation coefficient r.

2. Main program

a. Experiment 1 (strong positive correlation):

    x1 = np.random.random(100)
    y1 = np.random.random(100) + x1
    plt.scatter(x1, y1, marker='.')
    plt.show()
    pearson1, p1 = stats.pearsonr(x1, y1)
    r1 = cal_pearson(x1, y1)
    print(pearson1)
    print(r1)
    print()
  • Generate two random arrays x1 and y1 of length 100, where y1 is based on x1 plus some random noise.
  • Draw a scatter plot of x1 and y1.
  • scipy.stats.pearsonrThe Pearson correlation coefficient and p value of x1 and y1 were calculated using the function,
  • cal_pearsonThe same correlation coefficient was calculated using a custom function.
0.6991720710419989
0.6991720710419991

b. Experiment 2 (almost no linear correlation):

    x2 = np.random.random(100)
    y2 = np.random.random(100)
    plt.scatter(x2, y2, marker='.')
    plt.show()
    pearson2, p2 = stats.pearsonr(x2, y2)
    r2 = cal_pearson(x2, y2)
    print(pearson2)
    print(r2)
    print()

        Two random arrays x2 and y2 of length 100 were generated without adding noise. Scatter plots were also plotted, and Pearson correlation coefficients were calculated separately.

-0.11511730616773974
-0.11511730616773967

c. Experiment 3 (very strong positive correlation):

        Two random arrays x3 and y3 with a length of 100 are generated, where y3 is added with some larger random noise on the basis of x3. Scatter plots were also plotted, and Pearson correlation coefficients were calculated separately.

    x3 = np.random.random(100)
    y3 = np.random.random(100) + x3 * 50
    plt.scatter(x3, y3, marker='.')
    plt.show()
    pearson3, p3 = stats.pearsonr(x3, y3)
    r3 = cal_pearson(x3, y3)
    print(pearson3)
    print(r3)
    print()

d. Experiment 4 (Spearman correlation coefficient matrix):

        A random array data with a shape of (10, 10) was generated, scipy.stats.spearmanrthe Spearman correlation coefficient and p value between each column in the data were calculated using the function, and the results were printed out.

    data = np.random.random((10, 10))
    spearman_np, p_np = stats.spearmanr(data)
    print(spearman_np, p_np)

3. Code integration

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt


def cal_pearson(x, y):
    n = len(x)

    mean_x = np.mean(x)
    mean_y = np.mean(y)

    x_ = x - mean_x
    y_ = y - mean_y

    s_x = np.sqrt(np.sum(x_ * x_))
    s_y = np.sqrt(np.sum(y_ * y_))

    r = np.sum((x_ / s_x) * (y_ / s_y))
    t = r / np.sqrt((1 - r * r) / (n - 2))  # p值需要对t值进行查表获得

    return r


if __name__ == '__main__':
    np.random.seed(0)

    x1 = np.random.random(100)
    y1 = np.random.random(100) + x1
    plt.scatter(x1, y1, marker='.')
    plt.show()
    pearson1, p1 = stats.pearsonr(x1, y1)
    r1 = cal_pearson(x1, y1)
    print(pearson1)
    print(r1)
    print()

    x2 = np.random.random(100)
    y2 = np.random.random(100)
    plt.scatter(x2, y2, marker='.')
    plt.show()
    pearson2, p2 = stats.pearsonr(x2, y2)
    r2 = cal_pearson(x2, y2)
    print(pearson2)
    print(r2)
    print()

    x3 = np.random.random(100)
    y3 = np.random.random(100) + x3 * 50
    plt.scatter(x3, y3, marker='.')
    plt.show()
    pearson3, p3 = stats.pearsonr(x3, y3)
    r3 = cal_pearson(x3, y3)
    print(pearson3)
    print(r3)
    print()

    data = np.random.random((10, 10))
    spearman_np, p_np = stats.spearmanr(data)
    print(spearman_np, p_np)

Guess you like

Origin blog.csdn.net/m0_63834988/article/details/133497929