The correlation coefficient R language ---

The correlation coefficient can be used to describe the relationship between quantitative variables
symbol correlation indicates the relationship between the direction of its value indicates how strongly the size relationship
is generally believed that about 0.3 of weak correlation, the correlation is between 0.3 and 0.7, of 0.7 or more is strong correlation
relationship is a linear correlation described below, if the result is not relevant merely representative returns no linear correlation between

Type associated with
R a variety of languages may be calculated correlation coefficients, including Peason correlation coefficient, the Spearman correlation coefficient, the correlation coefficient Kendall, partial correlation coefficient, the correlation coefficient and correlation coefficient polychoric multi-series.
1, Pearson, Spearman and Kendall correlation
Pearson product-moment correlation coefficient is a measure of the degree of linear correlation between two quantitative variables
Spearman rank correlation coefficient is a measure of the degree of correlation between grade ordinal variable
Kenddall's Tau is a non-correlation coefficient parameter rank correlation measure

Pearson, the Spearman correlation of the difference between Kendall and
when present between the two continuous variables linear correlation, using Pearson product-moment correlation coefficient,
is not satisfied using the product moment correlation analysis strip (bivariate distribution too positive, linear correlation), use Spearman's rank correlation described related change between variables relationship
Spearman correlation coefficient, also known as rank correlation coefficient, a rank size for linear correlation analysis using two variables, the distribution of the original variables is not required, a non-parametric statistical method,
with respect to Pearson related use should be broad more, for obedience Pearson correlation coefficient data can also be calculated Spearman correlation coefficient, but the effect is lower
(because Spearman ignore the size of the values of the original variables, only value this value ranking position in the entire variable data)
Kendall's tau-b rank correlation coefficient for categorical variables reflecting the relevance of indicators, applies to two categorical variables are orderly classification

1 continuous variables, if not equal pitch measure because the distribution of an unknown - available rank correlation Pearson correlation can also be used for discrete variables will use the full level of rank correlation
, when the information is not subject to or bivariate normal population distribution is unknown or raw data 2 it can be a level, it is appropriate Spearman or Kendall correlation
3, if appropriate with the Kendall rank correlation coefficient is likely to draw relevant conclusions too small. For the general case the default data subject is too distributed by Pearson analysis

Related research method is correlation between two variables Spearman rank according to data. It is based on the difference between the two respective pairwise log-rank of the level to be calculated, it is also known as the "level difference counting method"
Spearman conditions required data is not strictly product-moment correlation coefficient, as long as the two the observed value of the variable pair grading information, or level data from the observation data converted to a continuous variable,
regardless of the overall distribution pattern of the two variables, how the sample size, can be used to Spearman the study.
Kendall's correlation coefficient
Kendall (Kendall) W coefficient of concordance coefficient, also known as, is a way to level variable degree of correlation of multiple columns. Data of this method is generally applicable method of grading collected,
such as allowing the K judges (subjects) assessed N pieces of things, or a judge (subjects) has assessed K times N pieces of things. Method for grading each of N pieces of object evaluator discharging a hierarchical order,
the minimum level of the ordinal numbers 1, N is the maximum, if the level of side by side, the common level should be equally occupy, e.g., two juxtaposed usually referred first, they should occupy 1,2 names,
so their level should be 1.5, and if a first, two parallel second, third three parallel, they should be a corresponding level, 2.5,2.5,5,5,5, where 2.5 is the average of 3, 5 4,5,6 average.
Kendall (Kendall) U factor, also known as consistency coefficient, is a way to level variable degree of correlation of multiple columns.
This method is also applicable to make the K judges (subjects) assessed N pieces of things, or a judge (subjects) has K times assessment N pieces obtained thing data,
but the method dual assessment assessing that every Evaluation of the N first thing to be pairwise comparison, evaluation results shown in the following table,
the table space bit (hatched portion can be whatever) to the data entered: if i is smaller than j remember 1, j if i is smaller than the difference scored 0 , both of the same mind 0.5. This table will get a total of K Chang,
K these sheets piled up table, the data corresponding to the position of the last accumulated up as the data calculated, data referred to as γij.

Calculate the three correlation coefficient
COR () function can calculate the three correlation coefficient
CoV () function can be used to calculate the covariance
COR () function to format
cor (x, use =, method =)
parameter x represents a matrix or data block
parameter use to specify how data is missing. Alternative ways to all.obs (assuming that there is no missing data - will encounter an error when missing data),
Everything (the face of missing data, the results of the correlation coefficient will be set to missing), complete.obs (line deleted ), and pairwise.complete.obs (pair delete)
Method correlation coefficient of the specified type, alternative types Pearson, Spearman kendall or
default parameters for the use = "everything" and method = "pearson"
example

> #使用的数据为state.x77表中的数据
> states <- state.x77[,1:6]
> options(digits = 2)
> #计算协方差
> cov(states)
           Population Income Illiteracy Life Exp Murder HS Grad
Population   19931684 571230     292.87  -407.84 5663.5 -3551.5
Income         571230 377573    -163.70   280.66 -521.9  3076.8
Illiteracy        293   -164       0.37    -0.48    1.6    -3.2
Life Exp         -408    281      -0.48     1.80   -3.9     6.3
Murder           5664   -522       1.58    -3.87   13.6   -14.5
HS Grad         -3552   3077      -3.24     6.31  -14.5    65.2
> #计算相关系数
> cor(states)
           Population Income Illiteracy Life Exp Murder HS Grad
Population      1.000   0.21       0.11   -0.068   0.34  -0.098
Income          0.208   1.00      -0.44    0.340  -0.23   0.620
Illiteracy      0.108  -0.44       1.00   -0.588   0.70  -0.657
Life Exp       -0.068   0.34      -0.59    1.000  -0.78   0.582
Murder          0.344  -0.23       0.70   -0.781   1.00  -0.488
HS Grad        -0.098   0.62      -0.66    0.582  -0.49   1.000
> #计算Spearman的相关系数
> cor(states,method = "spearman")
           Population Income Illiteracy Life Exp Murder HS Grad
Population       1.00   0.12       0.31    -0.10   0.35   -0.38
Income           0.12   1.00      -0.31     0.32  -0.22    0.51
Illiteracy       0.31  -0.31       1.00    -0.56   0.67   -0.65
Life Exp        -0.10   0.32      -0.56     1.00  -0.78    0.52
Murder           0.35  -0.22       0.67    -0.78   1.00   -0.44
HS Grad         -0.38   0.51      -0.65     0.52  -0.44    1.00

Note that the above results do not indicate whether the correlation coefficient is significantly different from zero, for this reason needs to be a significant test of the relationship between the relative number

Partial correlation
partial correlation means control one or more quantitative variables, the relationship between quantitative variables other two
in the case of the multivariate analysis, the relationship between a variable and the other variable, may also be of impact of three variables,
such as the relationship between the amount of fertilizer and yield, may also be affected by the weather, the impact will be subject to fertile land.
If you simply consider the correlation coefficient between them, it can not truly reflect the extent of their relationship.
Should be considered in the case of weather conditions and the extent of fertile land remains the same, the correlation coefficient and fertilizer production will tend to be more real,
the correlation coefficient here is the partial correlation coefficient.
Calculating a partial correlation
pcor ggm package () function to calculate partial correlation, the format
pcor (u, s)
the parameter U is a vector value, the first two numerical subscript represents the correlation coefficients to be calculated, a value of the remaining variable conditions (i.e. to exclude the influence of variables) index
parameter s as a variable covariance
example

> library(ggm)
> colnames(states)
[1] "Population" "Income"     "Illiteracy" "Life Exp"   "Murder"     "HS Grad"   
> pcor(c(1,5,2,3,6),cov(states))
[1] 0.35
> 

Other types of relevant
polycor package HETCOR () function can be calculated a correlation matrix of a mixture, including numerical variables Pearson product-moment correlation coefficient
multiple coefficients of correlation between the series of numerical variables and ordinal variables, ordinal variables quartile coefficient of correlation between correlation between the polychoric dichotomous variables and

2, significant correlation of test
relation significance: refers to the statistical relationship between the two (or more) variables are significant, general requirements for P <0.05.
(0.05, regardless of the correlation coefficient how strong are without significant p)> pointless discussion.
Using the function cor.test () function of a single Pearson, Spearman correlation coefficient and Kendall tested.
Function cor.test () format is used
cor.test (x, y, alternative = , method =)
parameters x, y represents a correlation to be tested variable
alternative for two-sided test is used to specify or one-sided test ( value is two.side, less, Greater)
Method for calculation related to specify the type ( "pearson, kendall, spearman)

When it is assumed study for the overall correlation coefficient is less than 0, the use of alternative = "less"
when it is assumed study for the overall correlation coefficient greater than 0, the use of alternative = "greater"
as research hypothesis for the overall correlation coefficient is not 0 when using alternative = "two.side" (which is the default setting)
Note cor.test () function test can only be a correlation between not as COR () function as a one-time calculation of the correlation between the number of variables correlation coefficient matrix coefficient generation
examples

>cor.test(states[,3],states[,5])

	Pearson's product-moment correlation

data:  states[, 3] and states[, 5]
t = 7, df = 48, p-value = 1e-08
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.53 0.82
sample estimates:
cor 
0.7 

> cor(states[,3],states[,5])
[1] 0.7
#cor.test(states[,3],states[,5])返回的p-value=1e-08
#如果总体的相关度为0的话,表示预计在一千万次机会(即p=1e-08)中只有一次的机会见到0.7这样大的相关样本相关度
#0.7为样本的相关度

psych package provided corr.test () function can generate one-time-many correlation coefficient matrix of the matrix and the level of significance
examples

> states <- state.x77[,1:6]
> library(psych)
Warning message:
程辑包‘psych’是用R版本3.6.1 来建造的 
> corr.test(states,use="complete")
Call:corr.test(x = states, use = "complete")
Correlation matrix 
           Population Income Illiteracy Life Exp Murder HS Grad
Population       1.00   0.21       0.11    -0.07   0.34   -0.10
Income           0.21   1.00      -0.44     0.34  -0.23    0.62
Illiteracy       0.11  -0.44       1.00    -0.59   0.70   -0.66
Life Exp        -0.07   0.34      -0.59     1.00  -0.78    0.58
Murder           0.34  -0.23       0.70    -0.78   1.00   -0.49
HS Grad         -0.10   0.62      -0.66     0.58  -0.49    1.00
Sample Size 
[1] 50
Probability values (Entries above the diagonal are adjusted for multiple tests.) 
           Population Income Illiteracy Life Exp Murder HS Grad
Population       0.00   0.59       1.00      1.0   0.10       1
Income           0.15   0.00       0.01      0.1   0.54       0
Illiteracy       0.46   0.00       0.00      0.0   0.00       0
Life Exp         0.64   0.02       0.00      0.0   0.00       0
Murder           0.01   0.11       0.00      0.0   0.00       0
HS Grad          0.50   0.00       0.00      0.0   0.00       0

 To see confidence intervals of the correlations, print with the short=FALSE option
> head(states)
           Population Income Illiteracy Life Exp Murder HS Grad
Alabama          3615   3624        2.1       69   15.1      41
Alaska            365   6315        1.5       69   11.3      67
Arizona          2212   4530        1.8       71    7.8      58
Arkansas         2110   3378        1.9       71   10.1      40
California      21198   5114        1.1       72   10.3      63
Colorado         2541   4884        0.7       72    6.8      64

Other significant test
significance test partial correlation coefficient
under the assumption of multivariate Normality, pcor.test psych package () function can be used to test between two variables when one or more additional controlled variables independence
pcor.test () format is
pcor.test (r, q, n)
parameter r is Pcor () function to get the partial correlation coefficient, q is the number of variables to be controlled (with a numerical position), n is sample size
psych package R.TEST () function provides a number of useful significance test method this function can be used to test:
significance some correlation coefficient, the correlation coefficient if the difference between the two independent significant, based on two a difference dependent correlation coefficient obtained if the shared variable significantly
if the difference between two separate non-correlation coefficients obtained based on a completely different variables significantly

Four, t test
common in the study is the behavior of two groups were compared.
Whether patients receiving a new drug therapy than patients using some existing drugs showed a greater degree of improvement? Whether a manufacturing process of another manufacturing process than the less defective products.
Like this is a dichotomous variable (only two types of categorical variable) a variable (This variable is the result of the variable) are continuous variables, outcome variables and assuming a normal distribution. You can use the t-test significance test
due to the assumption t-test is normally distributed in general, so before making test t test to be normal.
Verify that the data are consistent with the normal distribution method
1, you can draw a histogram to see if an inverted bell.
2, FIG painting QQ, observe whether the shape is a line connecting the main diagonal, if, that is close to the normal distribution
3, hapiro.test, This scheme is suitable for relatively small amount of the sample (N <20) when using
one for two independent samples t-test can be used to test the hypothesis of equal mean of two populations (that is, assuming that the two groups of equal population mean look hypothesis holds, if the establishment is not significant).
It is assumed that two independent time data and pits is in the normal population.
t-test format is
t.test (y ~ x, data)
parameter y is a numeric variable, x is a dichotomous variable
t test format may also be
t.test (y1, y2)
parameters y1 and y2 be numeric vector (i.e., the result variable for each group) optional parameter data values as a data frame or a matrix containing these variables
most other statistical software is different, where the default t-test assuming unequal variances. And the use of Welsh freedom of correction. Can be added to a parameter var.cqual = TRUE assumed equal variance
example

> library(MASS)
> t.test(Prob~So,data = UScrime)

	Welch Two Sample t-test

data:  Prob by So
t = -4, df = 25, p-value = 7e-04
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.039 -0.012
sample estimates:
mean in group 0 mean in group 1 
          0.039           0.064 
#此时返回的p值小于0.001可以拒绝原假设
#注意由于结果是一个比例值,可以在执行t检验之前尝试对其进行正态化变换

2, t test sample of non-independence
example, testing whether the younger male unemployment rate higher unemployment rate than older men, two sets of data in this case is not independent. You can not say no relationship between young men and older men the unemployment rate
when the observed correlation between the two groups, you get a group of non-independent design, front and rear side designs or repeating side will produce the same amount of non-independent design the group
of non-independent samples t-test assuming normal distribution differences between groups.
The format is
t.test (y1, y2, paired = TRUE)
wherein y1 and y2 of two independent groups of non-numeric vector.
example

> library(MASS)
> sapply(UScrime[c("U1","U2")],function(x)(c(mean=mean(x),sd=sd(x))))
     U1   U2
mean 95 34.0
sd   18  8.4
> t.test(UScrime$U1,UScrime$U2,paired = TRUE)

	Paired t-test

data:  UScrime$U1 and UScrime$U2
t = 32, df = 46, p-value <2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 58 65
sample estimates:
mean of the differences 
                     61 

More on t-test reference may be R t test language and R t language formula Detailed test
case of more than two groups
if you want to make a comparison between more than two groups. If it can be assumed the data is independent from the normal population samples obtained. Using analysis of variance (ANOVA)

Five non-parametric test group differences
parameter t test or ANOVA if the data can not meet the assumptions, non-parametric methods may be used instead. For example, if it is serious outcome variable bias or present relationship, in essence, an orderly, then you can use the difference between the groups of non-parametric tests
when reasonable assumptions t-test, efficacy parameters tested stronger (easier to detect differences exist). Rather than parametric tests when very unreasonable assumptions (such as for the level ordinal data) is more suitable for
1, two groups were compared
when two independent sets of data, you can use the Wilcoxon rank sum test (better known name is Mann-Whitney U test ) the format is
wilcox.test (y ~ x, data)
or
wilcox.test (y1, y2)
example of using Mann-Whitney U test to answer questions about the north and south of the state incarceration rate

> with(UScrime,by(Prob,So,median))
So: 0
[1] 0.038
------------------------------------------------------------------- 
So: 1
[1] 0.056
> wilcox.test(Prob~So,data=UScrime)

	Wilcoxon rank sum test

data:  Prob by So
W = 81, p-value = 8e-05
alternative hypothesis: true location shift is not equal to 0

P value returned from the point of view may be rejected and the same non-southern states southern states incarceration rate hypothesis (p <0.001)
non-parametric test for independent samples non
Examples

 sapply(UScrime[c("U1","U2")],median)
U1 U2 
92 34 
> with(UScrime,wilcox.test(U1,U2,paired = TRUE))

	Wilcoxon signed rank test with continuity correction

data:  U1 and U2
V = 1128, p-value = 2e-09
alternative hypothesis: true location shift is not equal to 0

Comparison of more than two sets
if the result does not satisfy the assumption variable ANOVA design, then the non-parametric methods may be used to assess differences between groups,
if each group may be used independently Kruskal-Wallis test, if not independent of each group so Friedman test
Kruskal-Wallis test call format is:
kruskal.test (y ~ a, Data)
format Friedman test is
friedman.test (y ~ a | B, data)
wherein y is a numerical outcome variable, a packet is a variable, while the B It is used to identify a matching block of variables observed
problems using the Kruskal-Wallis test examples answer illiteracy

> #首先需要将各地区的名称添加到数据集中。这些信息包含在随R基础安装分发的state.region数据集中
> states <- data.frame(state.region,state.x77)
> head(states)
           state.region Population Income Illiteracy Life.Exp Murder HS.Grad Frost   Area
Alabama           South       3615   3624        2.1       69   15.1      41    20  50708
Alaska             West        365   6315        1.5       69   11.3      67   152 566432
Arizona            West       2212   4530        1.8       71    7.8      58    15 113417
Arkansas          South       2110   3378        1.9       71   10.1      40    65  51945
California         West      21198   5114        1.1       72   10.3      63    20 156361
Colorado           West       2541   4884        0.7       72    6.8      64   166 103766
> kruskal.test(Illiteracy~state.region,data=states)

	Kruskal-Wallis rank sum test

data:  Illiteracy by state.region
Kruskal-Wallis chi-squared = 23, df = 3, p-value = 5e-05

This example would deny the null hypothesis of no difference, but the test does not tell you which areas are significantly different from other regions
1, you can use the Wilcoxon test comparing two sets of data each time.
2, the use of WMC () function simultaneously multiple groups, each with its Wilcoxon test groups, by p.adj () function adjusting probability value

Summary:
R & lt be calculated multiple correlation coefficients, including Peason correlation coefficient, the Spearman correlation coefficient, Kendall correlation coefficient
function CoV () can calculate the correlation coefficient
ggm package Pcor () function can be calculated partial correlation coefficient
significant correlation of test
Relationship significance: refers to the statistical relationship between the two (or more) variables are significant, general requirements for P <0.05
using the function cor.test () function of a single Pearson, Spearman correlation coefficient and Kendall significant test
psych package provided corr.test () function can be a one-time-many correlation coefficient matrix and generating a significance level matrix
pcor.test () function can be significant test of the partial correlation coefficient.
t-test
like this is a dichotomous variable (only two types of categorical variable) a variable (This variable is the result of the variable) are continuous variables, outcome variables and assuming a normal distribution. May be performed using t-test significance
format t-test for
t.test (y ~ x, data)

This is a good article describing the correlation coefficient on here: the correlation coefficient

Published 39 original articles · won praise 11 · views 10000 +

Guess you like

Origin blog.csdn.net/weixin_42712867/article/details/99574176