[Mathematical Modeling] - Correlation Coefficient

Part 1: Calculation of Pearson correlation coefficient and descriptive statistics of data

In this lecture we will introduce two of the most common correlation coefficients: Pearson person correlation coefficient and Spearman rank correlation coefficient. They can be used to measure the magnitude of the correlation between two variables. According to the different conditions that the array satisfies, we need to select different correlation coefficients for row calculation and analysis (the most error-prone method in modeling papers).

Population and sample:

Overall Pearson correlation coefficient:

Formulas/definitions for the various terms in the Pearson correlation coefficient:

Overall Pearson correlation coefficient:

Sample Pearson correlation coefficient (denominator changed to n-1)


Correlation visualization (spss version):

Some misunderstandings about the Pearson correlation coefficient: (Before we use the Pearson coefficient, buying two variables must satisfy a linear relationship)

 

 

 

(Before using the Pearson correlation coefficient, we can draw a scatter plot in advance to judge whether it conforms to a linear relationship)

Explanation of the size of the correlation relationship:

Column title: Find the correlation between the data of the eighth grade girls' body test:

Solution 1: Use matlab to solve the relationship between various data

(The function used to find various data in matlab)

Code:

Save the obtained data results into the excel table:

Solution 2: Use spss software

a. Import data

b. Statistics

Analysis->Description->ctrl+a (select all)->Options (Statistics)->OK->Wait for the data graph to be generated

Calculation of Pearson's correlation coefficient (R):

a : There is a function corrcoef function for calculating the correlation coefficient in matlab

R = corrcoef(incoming data)

Beautify the correlation coefficient table:

Import data into excel

1 Adjust row height, font size, font position (centered), column width, and keep four decimal places

2 Set it as a colored table for easy observation: Start->Conditional Formatting->Color Scale (you can choose any one, red-white-blue in the above picture)->Rule Type->Edit Formatting Rules->Minimum Value- >number->-1, median->number->0->maximum->number->1.

 

b: use spss software to generate correlation coefficient graph

Steps: Graphics -> Old Dialog -> Scatterplot/Dot Plot -> Matrix Scatterplot -> Definition -> ctrl+a (select all) import matrix variables -> OK

Generate the image:  

                                                                                 

Part II: Hypothesis Testing

  1. Step 1: Null Hypothesis ( H0 ): The hypothesis we want to verify or disprove. The default is that the observed phenomenon is caused by random causes, without any real effect or association. In this case, we can assume that there is no association between campus traffic accidents and EV speeding, that is, H0 : EV speeding has nothing to do with campus traffic accidents.
  2. Alternative Hypothesis ( H1 ): The supplementary or counter hypothesis to the null hypothesis, which states that the observation we want to prove is caused by a true effect. In this case, the alternative hypothesis can be H1 : Speeding of electric vehicles is related to school traffic accidents.
  3. Significance level ( α ): Represents the bound on the error rate we accept in hypothesis testing. Common significance levels include 0.05 and 0.01 . Choosing an appropriate significance level depends on the purpose of the study as well as industry standards.
  4. Test statistic: Choose an appropriate test statistic based on the research question and data type. For the relationship between campus traffic accidents and electric vehicle speeding, statistical methods (such as chi-square test or regression analysis) can be used to evaluate the correlation between the two.
  5. Calculate p- value: Calculate the probability of the actually observed statistic (i.e., p -value) based on the selected test statistic and sample data . The p -value expresses the probability of observing the same or more extreme result under the null hypothesis than the actually observed statistic.
  6. Make a decision: compare the calculated p -value with the significance level, if the p -value is less than the significance level, reject the null hypothesis, consider the result to be statistically significant, and support the alternative hypothesis. If the p -value is greater than the significance level, the null hypothesis cannot be rejected and no conclusion can be drawn.

It is important to note that hypothesis testing is a method of statistical inference, and the results do not always lead to firm conclusions, but instead provide evidence against the null hypothesis. In addition, the reliability of hypothesis testing also depends on the quality of sample data collected, sample size and the satisfaction of other assumptions. Therefore, when performing hypothesis testing, it is necessary to interpret the results carefully and consider other relevant factors comprehensively.

If the P value is less than our hypothesized α , it means that we reject our null hypothesis.

If the p-value is greater than , it means that we cannot reject our null hypothesis.

In hypothesis testing, we can use one-sided or two-sided tests to assess the feasibility of the null hypothesis. The choice of these two tests depends on the research question and the direction of the expected effect.

  1. One -tailed test : In a one-tailed test, we are concerned with whether the hypothesized effect is significant in one direction. The one-sided test is suitable for those who have a clear theoretical basis or research purpose and hope to verify or infer the direction of the effect. For example, if we study whether a new drug can significantly lower blood pressure, we only care about whether the drug reduces blood pressure significantly, not whether it increases blood pressure. In a one-sided test, the significance level ( α ) exists for only one tail.
  2. Two-tailed test : In a two-tailed test, we are concerned with whether the hypothesized effect is significant in both directions. Two-sided tests are suitable when we do not have a clear expectation of the direction of the effect and just want to determine whether there is a significant effect. For example, we study whether a new teaching method can significantly improve student achievement, but we are not sure whether this method will significantly improve or significantly reduce student achievement. In a two-sided test, the significance level ( α ) is compared in both tails.

When performing a one-sided or two-sided test, we need to compare the calculated test statistic with the corresponding critical value. For a one-sided test, we only care about the critical value of one tail; for a two-sided test, we need to consider the critical value of both tails. If the calculated test statistic is within the critical value or less than the significance level ( α ), the null hypothesis can be rejected and the result considered statistically significant.

It should be noted that when choosing a one-sided test or a two-sided test, it should be determined according to the research question and expected effect. If there is a clear expected effect direction, one-sided test can be selected; if there is no clear expected effect direction, two-sided test can be selected.

(The above figure is a one-sided test)

The p -value comparison of the two-sided test needs to be compared by ×2:

                                                           

Part III: Pearson Correlation Coefficient Hypothesis Test

Explanation of the magnitude of the correlation coefficient:

Perform a hypothesis test on the Pearson correlation coefficient:

step:

 

Find the critical value in Matlab:

A better way to judge P value judgment method:

Find the p value in matlab :

When corrcoef has two receiving values, the first is correlation and the second is p value

One-sided: 1 - cumulative density function tcdf(x-value, free-value)

Bilateral: Unilateral result*2

Significant markers: general p-value<0.01***, p>0.01&&p<0.05**, p>0.05&&p<0.1*

Calculate the correlation coefficient and p-value between columns

It is also more convenient to calculate the p value with spss: 

 

 

Generate marked images (generally up to two * in spss):

                                                                            

The fourth part is the condition of Pearson correlation coefficient hypothesis test

Normal distribution JB test (large sample n>30)

definition:

Skewness and kurtosis:

The function of JB test in Matlab: (but the jb test in matlab can only be tested by column, so it is necessary to use a loop to test the elements in the data by column to get the test result of each column)

Code implementation (test data: eighth grade girls physical test):

%JB检验

%jbtest只能每次按列求

[h,p] = jbtest(S(:,1),0.05);%参数为正态分布,alpha(阿尔法)

[h,p] = jbtest(S(:,1),0.01);

%每列进行jb检验

[r,c] = size(S)

%提前开辟好相应的矩阵空间方便节省时间

H = zeros(1,c);

P = zeros(1,c);

%因为每次jb检验只能检验一列,所以利用for循环检验所有数据

for i=1:c

[h,p] = jbtest(S(:,i),0.05)

H(i) = h;

P(i) = p;

end

disp(H)

disp(P)

operation result:

H is to test whether your own null hypothesis is true, return 0 if it is true, return 1 if it is not true

If the value of P is too small, it will return 0.01 (it can be regarded as 0)

Shapiro-Wilk test (small sample 3<=n<=50):

Check with spss software

test result:

QQ plot test for normal distribution

See if the data points are all on the straight line. If there is a deviation, it does not conform to the normal distribution (only the corresponding QQ diagram can be generated by column test)

QQ graph function in Matlab:

qqplot(data)

In spss, you can directly display all the qq diagrams of all columns:

(The qq diagram will be generated in the Xia Luopi and the test method)

                                                                                  

Part V: Spearman Correlation Coefficient

definition:

The Spearman correlation coefficient is used to test the serial number of the data after sorting, and the R is obtained by calculation.

Spearman's two methods:

Code:

It can also be generated with spss:

 

Comparison of Spear's correlation coefficient and Pearson's correlation coefficient:

Hypothesis testing for Spearman's correlation coefficient:

 

 

The function of Spear hypothesis test in Matlab :

 

Summary and comparison of the two correlation coefficients:

After we get a set of data, we can first use spss to test whether it conforms to the normal distribution

Analysis->Descriptive Statistics->Explore-->Import Data-->Graph-->Normal Plot with Test

 

 

 

If the P value > 0.05, obey the normal distribution

If the P value is <0.05, it does not obey the normal distribution

Therefore, the conclusion: the two data do not obey the normal distribution

In this article, the eighth grade girls' physical measurement data code reference:

clear;clc
%S为女生数据,B为男生数据
load 相关性系数\girl_data.mat
%统计描述
MIN = min(S);%最小值
MAX = max(S);%最大值
MEAN = mean(S);%均值
MEDIAN = median(S);%中位数值
SKEWNESS = skewness(S);%偏度
KURTOSIS = kurtosis(S);%峰度
STD = std(S);%标准差
Result = [MIN;MAX;MEAN;MEDIAN;SKEWNESS;KURTOSIS;STD];

%计算各列之间的相关系数R,P值 
[R,P] = corrcoef(S);
%通过P值判断法进行相关性检验
P<0.01 %标记三颗星
(P>0.01) .* (P<0.05);%标记两颗星
(P>0.05) .* (P<0.10);%标记一颗星
%构建一个随机的正态分布
x = normrnd(2,3,100);
%求其偏度
skewness(x);
%求其峰度
kurtosis(x);
%JB检验
%jbtest只能每次按列求
[h,p] = jbtest(S(:,1),0.05);%参数为正态分布,alpha(阿尔法)
[h,p] = jbtest(S(:,1),0.01);
%每列进行jb检验
[r,c] = size(S)
%提前开辟好相应的矩阵空间方便节省时间
H = zeros(1,c);
P = zeros(1,c);
%因为每次jb检验只能检验一列,所以利用for循环检验所有数据
for i=1:c
[h,p] = jbtest(S(:,i),0.05);
H(i) = h;
P(i) = p;
end
disp('H:')
disp(H)
disp('P:')
disp(P)
qqplot(S(:,1))
%~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
%通过斯皮尔曼系数求男生数据
%求男生体测数据的列和行
[l,h] = size(B);
%利用斯皮尔曼相关系数求[相关性,显著性(p值)]
[R2,P2] = corr(B,'type','Spearman')

  The blogger mainly follows the course of Qingfeng Mathematical Modeling, and some of the pictures in it are derived from the screenshots of the class videos.

Guess you like

Origin blog.csdn.net/weixin_73612682/article/details/131788104