Correlation Analysis and Heat Mapping

1. What is correlation analysis?

Correlation analysis refers to the analysis of two or more correlated variable elements, so as to measure the closeness of the two variable factors. There needs to be a certain connection or probability between the elements of correlation before correlation analysis can be performed. In omics sequencing (such as transcriptome), multiple biological repeats need to be set, and the correlation analysis of multiple biological repeats can determine whether the biological repeat data can be used for subsequent analysis. In case of inconsistencies in biological repetitions, the variable data can be removed to prevent the unavailability of a certain repeated data, which will affect the analysis results of the data.

There are three common correlation analysis methods: Pearson correlation coefficient, Spearman correlation coefficient and Kendall correlation coefficient.

Table 1 Correlation relationship

Very weak or no correlation

weak correlation

Moderately related

strong correlation

Strong correlation

0.0-0.2

0.2-0.4

0.4-0.6

0.6-0.8

0.8-1.0

1 Pearson correlation coefficient (Pearson)

Pearson correlation coefficient (Pearson), also known as linear correlation coefficient, product-difference correlation coefficient, was proposed by British statistician Carl Pearson in 1890. It is a statistic used to reflect the degree of linear correlation between two variables, and it is suitable for data satisfying a normal distribution. The correlation coefficient is denoted by r, where n is the sample size, respectively, the observed value and the mean of the two variables. r describes the degree of linear correlation between two variables. The larger the absolute value of r, the stronger the correlation. The value of r is between -1 and +1. If r>0, it indicates that the two variables are positively correlated, that is, the greater the value of one variable, the greater the value of the other variable; if r<0, it indicates that the two variables The two variables are negatively correlated, that is, the larger the value of one variable, the smaller the value of the other variable. The larger the absolute value of r, the stronger the correlation. It should be noted that there is no causal relationship here. If r=0, it indicates that there is no linear correlation between the two variables.

The calculation formula is:

where n is the sample size, and xi and yi are the sample values ​​of the two variables.

2 Spearman correlation coefficient (spearman)

The Spearman correlation coefficient (spearman), also known as the Spearman rank correlation coefficient, is a linear correlation analysis using the ranks of the two variables, rather than calculating according to the actual value of the data. Compared with the Pearson correlation coefficient, it is a non-parametric statistical method and has a wider scope of application. It is often represented by the Greek letter ρ.

Its calculation formula is:

where di represents the rank difference of each pair of observations (x, y), and n is the number of observation pairs.

3 Kendall correlation coefficient (Kendall)

Kendall correlation coefficient (Kendall) is a rank correlation coefficient. It is a measure of the degree of correlation between two ordinal variables or two rank variables, which belongs to non-parametric statistics.

Two, the code

1 Data preparation

Data input format (csv format):

2 R package loading and data import

#下载包#install.packages("corrplot")install.packages("ggcorrplot")install.packages("psych")install.packages("vcd")#加载包#library(corrplot)library(ggplot2)library(ggcorrplot)library(vcd)library(psych)library(ggrepel)#数据导入#data<-read.table(file='C:/Rdata/jc/相关性热图数据.csv',row.names= 1,header=TRUE,sep=',')dim(data)data<-as.matrix(data) #利用as.matrix()将所需数据集转换为matrix格式,才可在corrplot中跑data=data.frame(scale(data))#数据标准化head(data)

#相关性计算data<-cor(data,method="spearman") #pearson,spearman和kendallround(data, 2)#保留两位小数

3 Correlation heat map drawing

3.1 ggcorrplot package draws heatmap

#相关性热图绘制#ggcorrplot(data, method="circle") #圆圈大小变化

Figure 1 Correlation basis heat map

#调整与美化#ggcorrplot(data, method = "circle", #"square", "circle"相比corrplot少了很多种,只有方形和圆形,默认方形。           type ="upper" , #full完全(默认),lower下三角,upper上三角           ggtheme = ggplot2::theme_minimal,           title = "",           show.legend = TRUE,  #是否显示图例。           legend.title = "Corr", #指定图例标题。           show.diag =T ,    #FALSE显示中间           colors = c("blue", "white", "red"), #需要长度为3的颜色向量,同时指定low,mid和high处的颜色。           outline.color = "gray", #指定方形或圆形的边线颜色。           hc.order = FALSE,  #是否按hclust(层次聚类顺序)排列。           hc.method = "complete", #相当于corrplot中的hclust.method, 指定方法一样,详情见?hclust。           lab =T , #是否添加相关系数。FALSE           lab_col = "black", #指定相关系数的颜色,只有当lab=TRUE时有效。           lab_size = 4, #指定相关系数大小,只有当lab=TRUE时有效。           p.mat = NULL,  #p.mat= p_mat,insig= "pch", pch.col= "red", pch.cex= 4,           sig.level = 0.05,           insig = c("pch", "blank"),           tl.cex = 12, #指定变量文本的大小,           tl.col = "black", #指定变量文本的颜色,           tl.srt = 45, #指定变量文本的旋转角度。           digits = 2 #指定相关系数的显示小数位数(默认2)。)

Figure 2 Adjusted and beautified heat map

3.2 corrplot package draws correlation heat map

3.2.1 Corrplot package basic heat map and explanation

#corrplot包绘图#corrplot(data)corrplot(data, method="circle", #square方形,ellipse, 椭圆形,number数值,shade阴影,color颜色,pie饼图         title = "pearson",   #指定标题         type="full",  #full完全(默认),lower下三角,upper上三角         #col=c("#FF6666", "white", "#0066CC"), #指定图形展示的颜色,默认以均匀的颜色展示。支持grDevices包中的调色板,也支持RColorBrewer包中调色板。         outline = T,  #是否添加圆形、方形或椭圆形的外边框,默认为FALSE。         diag = TRUE,  #是否展示对角线上的结果,默认为TRUE         mar = c(0,0,0,0), #设置图形的四边间距。数字分别对应(bottom, left, top, right)。         bg="white", #指定背景颜色         add = FALSE, #表示是否添加到已经存在的plot中。默认FALSE生成新plot。         is.corr = TRUE, #是否为相关系数绘图,默认为TRUE,FALSE则可将其它数字矩阵进行可视化。         addgrid.col = "darkgray", #设置网格线颜色,当指定method参数为color或shade时, 默认的网格线颜色为白色,其它method则默认为灰色,也可以自定义颜色。         addCoef.col = NULL, #设置相关系数值的颜色,只有当method不是number时才有效         addCoefasPercent = FALSE, #是否将相关系数转化为百分比形式,以节省空间,默认为FALSE。         order = "original", #指定相关系数排序的方法, 可以是original原始顺序,AOE特征向量角序,FPC第一主成分顺序,hclust层次聚类顺序,alphabet字母顺序。         hclust.method = "complete", # 指定hclust中细分的方法,只有当指定order参数为hclust时有效。有7种可选:complete,ward,single,average,mcquitty,median,centroid。         addrect = NULL, #是否添加矩形框,只有当指定order参数为hclust时有效, 默认不添加, 用整数指定即可添加。         rect.col = "black", #指定矩形框的颜色。         rect.lwd = 2, #指定矩形框的线宽。         tl.pos = NULL,  #指定文本标签(变量名称)相对绘图区域的位置,为"lt"(左侧和顶部),"ld"(左侧和对角线),"td"(顶部和对角线),"d"(对角线),"n"(无);当type="full"时默认"lt"。当type="lower"时默认"ld"。当type="upper"时默认"td"。         tl.cex = 1,  #设置文本标签的大小         tl.col = "black", #设置文本标签的颜色。         cl.pos = NULL #设置图例位置,为"r"(右边),"b"(底部),"n"(无)之一。当type="full"/"upper"时,默认"r"; 当type="lower"时,默认"b"。         #addshade = c("negative", "positive", "all"), # 表示给增加阴影,只有当method="shade"时有效。#为"negative"(对负相关系数增加阴影135度);"positive"(对正相关系数增加阴影45度);"all"(对所有相关系数增加阴影)。         #shade.lwd = 1,  #指定阴影线宽。         #shade.col = "white",  #指定阴影线的颜色。         #p.mat= res1$p,sig.level= 0.01,insig= "pch", pch.col= "blue", pch.cex= 3,#只有指定矩阵的P值,sig.level,pch等参数才有效。只有当insig = "pch"时,pch.col和pch.cex参数才有效。)

Figure 3 corrplot package drawing

3.2.2 The corrplot package mixes graphics and values

#显示数字与图形混合corrplot(data, method="circle", #square方形,ellipse, 椭圆形,number数值,shade阴影,color颜色,pie饼图         title = "pearson",   #指定标题         type="full", #full完全(默认),lower下三角,upper上三角         #col=c("#FF6666", "white", "#0066CC"), #指定图形展示的颜色,默认以均匀的颜色展示。支持grDevices包中的调色板,也支持RColorBrewer包中调色板。         outline = F,  #是否添加圆形、方形或椭圆形的外边框,默认为FALSE。         diag = TRUE,  #是否展示对角线上的结果,默认为TRUE         mar = c(0,0,0,0), #设置图形的四边间距。数字分别对应(bottom, left, top, right)。         bg="white", #指定背景颜色         add = FALSE, #表示是否添加到已经存在的plot中。默认FALSE生成新plot。         is.corr = TRUE, #是否为相关系数绘图,默认为TRUE,FALSE则可将其它数字矩阵进行可视化。         addgrid.col = "darkgray", #设置网格线颜色,当指定method参数为color或shade时, 默认的网格线颜色为白色,其它method则默认为灰色,也可以自定义颜色。         addCoef.col = NULL, #设置相关系数值的颜色,只有当method不是number时才有效         addCoefasPercent = FALSE, #是否将相关系数转化为百分比形式,以节省空间,默认为FALSE。         order = "original", #指定相关系数排序的方法, 可以是original原始顺序,AOE特征向量角序,FPC第一主成分顺序,hclust层次聚类顺序,alphabet字母顺序。         hclust.method = "complete", # 指定hclust中细分的方法,只有当指定order参数为hclust时有效。有7种可选:complete,ward,single,average,mcquitty,median,centroid。         addrect = NULL, #是否添加矩形框,只有当指定order参数为hclust时有效, 默认不添加, 用整数指定即可添加。         rect.col = "black", #指定矩形框的颜色。         rect.lwd = 2, #指定矩形框的线宽。         tl.pos = NULL,  #指定文本标签(变量名称)相对绘图区域的位置,为"lt"(左侧和顶部),"ld"(左侧和对角线),"td"(顶部和对角线),"d"(对角线),"n"(无);当type="full"时默认"lt"。当type="lower"时默认"ld"。当type="upper"时默认"td"。         tl.cex = 1,  #设置文本标签的大小         tl.col = "black", #设置文本标签的颜色。         cl.pos = NULL #设置图例位置,为"r"(右边),"b"(底部),"n"(无)之一。当type="full"/"upper"时,默认"r"; 当type="lower"时,默认"b"。         #addshade = c("negative", "positive", "all"), # 表示给增加阴影,只有当method="shade"时有效。#为"negative"(对负相关系数增加阴影135度);"positive"(对正相关系数增加阴影45度);"all"(对所有相关系数增加阴影)。         #shade.lwd = 1,  #指定阴影线宽。         #shade.col = "white",  #指定阴影线的颜色。         #p.mat= res1$p,sig.level= 0.01,insig= "pch", pch.col= "blue", pch.cex= 3,#只有指定矩阵的P值,sig.level,pch等参数才有效。只有当insig = "pch"时,pch.col和pch.cex参数才有效。)
corrplot(data, title = "",                 method = "number", #square方形,ellipse, 椭圆形,number数值,shade阴影,color颜色,pie饼图                outline = F, #是否添加圆形、方形或椭圆形的外边框,默认为FALSE。         add = TRUE, #表示是否添加到已经存在的plot中。默认FALSE生成新plot。         type = "lower", #full完全(默认),lower下三角,upper上三角                order="original",         col="black", #指定图形展示的颜色,默认以均匀的颜色展示。支持grDevices包中的调色板,也支持RColorBrewer包中调色板。         diag=FALSE, #是否展示对角线上的结果,默认为TRUE         tl.pos="n",  #指定文本标签(变量名称)相对绘图区域的位置,为"lt"(左侧和顶部),"ld"(左侧和对角线),"td"(顶部和对角线),"d"(对角线),"n"(无)         cl.pos=NULL #设置图例位置,为"r"(右边),"b"(底部),"n"(无)之一。         )

Figure 4 gcorrplot package numerical and graphic mixed drawing

​​​​​​​

#椭圆加数值#corrplot(data, method = "ellipse", order = "original",                  addCoef.col = "black",#设置相关系数值的颜色,只有当method不是number时才有效         type="full", #full完全(默认),lower下三角,upper上三角         title = "椭圆与黑色系数值",         add = FALSE, #表示是否添加到已经存在的plot中。默认FALSE生成新plot。         diag = TRUE, #是否展示对角线上的结果,默认为TRUE         tl.cex = 1,  #设置文本标签的大小         tl.col = "black", #设置文本标签的颜色。         cl.pos = NULL, #设置图例位置,为"r"(右边),"b"(底部),"n"(无)之一。当type="full"/"upper"时,默认"r"; 当type="lower"时,默认"b"。         mar = c(1,1,1,1)) #设置图形的四边间距。数字分别对应(bottom, left, top, right)。

Figure 5 Ellipse plus all values

#百分比表示#corrplot(data, method = "ellipse", order = "original",                  addCoef.col = "black",#设置相关系数值的颜色,只有当method不是number时才有效         addCoefasPercent = TRUE, #是否将相关系数转化为百分比形式,以节省空间,默认为FALSE。         type="full", #full完全(默认),lower下三角,upper上三角         title = "椭圆与黑色百分比",         add = FALSE, #表示是否添加到已经存在的plot中。默认FALSE生成新plot。         diag = TRUE, #是否展示对角线上的结果,默认为TRUE         tl.cex = 1,  #设置文本标签的大小         tl.col = "black", #设置文本标签的颜色。         cl.pos = NULL, #设置图例位置,为"r"(右边),"b"(底部),"n"(无)之一。当type="full"/"upper"时,默认"r"; 当type="lower"时,默认"b"。         mar = c(1,1,1,1)) #设置图形的四边间距。数字分别对应(bottom, left, top, right)。

Source of this article: Senior Pan, who is playing Xiaodoudou

Tree Valley Database Resource Encyclopedia (updated on February 9)

Guess you like

Origin blog.csdn.net/hu397313168/article/details/129744054