R language combat-linear regression analysis and correlation matrix visualization

               



Multiple regression analysis

The multiple regression analysis forecasting method refers to the method of establishing a forecast model for forecasting through the correlation analysis of two or more independent variables and a dependent variable. When there is a linear relationship between the independent variable and the dependent variable , it is called multiple linear regression analysis.



Applicable conditions of multiple linear regression:

(1) The independent variable has a significant impact on the change of the corresponding variable.

(2) The linear correlation between the independent variable and the dependent variable must be real, not formal.

(3) There must be a certain degree of mutual exclusion between the independent variables.

(4) There should be complete statistical data.

Below, we will select an example to implement a multiple linear regression model with R.

Training data: txt format (behavior samples, listed as food). In this example, we select 10 types of food as features, that is, the independent variable is 10 types of food. The dependent variable is the protein content (Protein) produced.

image



1 Check the relationship between two dependent variables


x=read.table("test.txt",header=T,sep="\t",row.names=1)

x=as.matrix(x)

z=x[,1:10]

cor <- cor(z,use="pairwise", method="pearson")

cor

corrplot (color)

image

The display of the correlation graph uses the corrplot package. For specific steps, see R language drawing-visualization of correlation matrix.

It can be seen from the figure that the degree of linearity between the various independent variables is not high and can be directly input as regression parameters. Because if the linear relationship between the independent variables exceeds the linear relationship between the independent variables and the dependent variables, the stability of the regression model will be destroyed.


1 Multiple regression model establishment


Next, we use 10 features as independent variables and protein content as dependent variables. Establish a multiple linear regression model.

y = x [, 11]

lm.result <-lm(y~z)

summary(lm.result)

image

***表示极为显著,**表示高度显著,*表示显著。以上是模型检验的结果。因为我们的例子数据不全,所以大家可以用自己数据试一试,看看结果好不好。


1 残差分析


par(mfrow=c(2,2))

plot(lm.results)

image

以上就是使用R实现多元线性回归的过程。



相关矩阵

在生物信息学分析中,经常会计算相关性矩阵。因此,相关矩阵的可视化很重要。许多文献中生动的相关性图形十分吸引眼球。

下面,我们介绍一种R语言中可视化相关性矩阵的方法(corrplot包)。


1 安装corrplot包


install.packages("corrplot")

library(corrplot)

然后,我们使用R中的例子数据mtcars计算相关性矩阵。

data(mtcars)

mtcars

image

1 计算相关性矩阵


cor=cor(mtcars)

image

1 计算相关性矩阵


corrplot(cor, method="circle",type="full",order="hclust")

image

cor 相关性矩阵

method共有7种。"circle"(default), "square","ellipse", "number", "pie", "shade" and "color"

type 共有3种。"full" (default), "upper" or "lower"

The order is divided into the following categories. "original" for original order(default).

"AOE" for the angular orderof the eigenvectors.

"FPC" for the first principalcomponent order.

"hclust" for the hierarchicalclustering order.

"alphabet" for alphabeticalorder.

We use additional parameters:

corrplot(cor,method="color",type="upper", order="hclust",addCoef.col = "black")

image

The specific color and shape change, to get a more beautiful picture, only need to adjust the parameters.


Guess you like

Origin blog.51cto.com/15127592/2674941