R语言 线性回归分析

回归分析是非常基础的但是也是极其泛用化的统计预测工具之一,而R语言也有自带的function来完成一元和多元回归。

针对一元回归,数据集选用的是R语言自带的women数据集

data(“women”)
plot(x=women w e i g h t , y = w o m e n weight,y=women height)
model=lm(height~weight,data = women)
abline(model)

这里应该注意,plot中x、y的数据应该与lm中的y~x相对应,不然abline是没法显示在图上的。

summary(model)
Call:
lm(formula = weight ~ height, data = women)
Residuals:
Min 1Q Median 3Q Max
-1.7333 -1.1333 -0.3833 0.7417 3.1167
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -87.51667 5.93694 -14.74 1.71e-09 ***
height 3.45000 0.09114 37.85 1.09e-14 ***
Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.525 on 13 degrees of freedom
Multiple R-squared: 0.991, Adjusted R-squared: 0.9903
F-statistic: 1433 on 1 and 13 DF, p-value: 1.091e-14

summary可以展示包含最大值最小值、中位数四分位数及截距等多种基础数据,而且还包括基础决定系数和调整后决定系数等多种反映拟合程度的数据

anova(model)
Analysis of Variance Table
Response: weight
Df Sum Sq Mean Sq F value Pr(>F)
height 1 3332.7 3332.7 1433 1.091e-14 ***
Residuals 13 30.2 2.3
—Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

anova方差分析能够反映单个参数是否显著,还有其他一些信息。

而其他信息可以用协方差cor()或者残差residuals()等拓展函数显示出来,也可以画图,这样更直观,比如

plot(predict(model,type = “response”),residuals(model,type = “deviance”))
plot(hatvalues(model))

下面进行预测,这里随便按了两个数

扫描二维码关注公众号,回复: 8996095 查看本文章

pre=data.frame(height=c(56,74))
predict(model,pre,interval = ‘prediction’,level=0.95)
fit lwr upr
1 105.6833 101.847 109.5197
2 167.7833 163.947 171.6197

R的pre会给出一个区间,上下界及预测值都会给出来,如何取就看个人了。

而对于多元线性回归,主要部分都与一元线性回归大同小异,只要在lm()中fomula参数位置的~后面多写几个变量就够了。而到底写几个变量,这里可以使用逐步回归来对变量进行筛选。

mydata<-data.frame(
x1=c( 7, 1,11,11, 7,11, 3, 1, 2,21, 1,11,10),
x2=c(26,29,56,31,52,55,71,31,54,47,40,66,68),
x3=c( 6,15, 8, 8, 6, 9,17,22,18, 4,23, 9, 8),
x4=c(60,52,20,47,33,22, 6,44,22,26,34,12,12),
Y =c(78.5,74.3,104.3,87.6,95.9,109.2,102.7,72.5,
93.1,115.9,83.8,113.3,109.4)
)
model<-lm(Y~x1+x2+x3+x4,data=tdata)
summary(model)
Call:
lm(formula = Y ~ x1 + x2 + x3 + x4, data = tdata)
Residuals:
Min 1Q Median 3Q Max
-3.1750 -1.6709 0.2508 1.3783 3.9254
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 62.4054 70.0710 0.891 0.3991
x1 1.5511 0.7448 2.083 0.0708 .
x2 0.5102 0.7238 0.705 0.5009
x3 0.1019 0.7547 0.135 0.8959
x4 -0.1441 0.7091 -0.203 0.8441
—Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.446 on 8 degrees of freedom
Multiple R-squared: 0.9824, Adjusted R-squared: 0.9736
F-statistic: 111.5 on 4 and 8 DF, p-value: 4.756e-07

从上面的t值判别可以看出,没有任何一项关于Y显著,所以我们需要运用逐步回归来优化这个回归模型。

mystep=step(model)
Start: AIC=26.94
Y ~ x1 + x2 + x3 + x4
Df Sum of Sq RSS AIC
- x3 1 0.1091 47.973 24.974
- - x4 1 0.2470 48.111 25.011
- - x2 1 2.9725 50.836 25.728
- 47.864 26.944
- - x1 1 25.9509 73.815 30.576
Step: AIC=24.97
Y ~ x1 + x2 + x4
Df Sum of Sq RSS AIC
47.97 24.974
- x4 1 9.93 57.90 25.420
- - x2 1 26.79 74.76 28.742
- - x1 1 820.91 868.88 60.629

R的step函数判别标准是跟据AIC信息的大小来删除的,所以有时候会出现不太完整的逐步回归。
比如这个例子

summary(mystep)
Call:
lm(formula = Y ~ x1 + x2 + x4, data = tdata)
Residuals:
Min 1Q Median 3Q Max
-3.0919 -1.8016 0.2562 1.2818 3.8982
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 71.6483 14.1424 5.066 0.000675 ***
x1 1.4519 0.1170 12.410 5.78e-07 ***
x2 0.4161 0.1856 2.242 0.051687 .
x4 -0.2365 0.1733 -1.365 0.205395
—Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.309 on 9 degrees of freedom
Multiple R-squared: 0.9823, Adjusted R-squared: 0.9764
F-statistic: 166.8 on 3 and 9 DF, p-value: 3.323e-08

可以看出变量并没有完全显著,这里可以用drop1()函数来进行后退法(与此相对的也有add1()函数),或者再step()中多写一个“k=2”

step(model,trace = 1,k=2)
Start: AIC=26.94
Y ~ x1 + x2 + x3 + x4
Df Sum of Sq RSS AIC
- x3 1 0.1091 47.973 24.974
- - x4 1 0.2470 48.111 25.011
- - x2 1 2.9725 50.836 25.728
- 47.864 26.944
- - x1 1 25.9509 73.815 30.576
Step: AIC=24.97
Y ~ x1 + x2 + x4
Df Sum of Sq RSS AIC
47.97 24.974
- x4 1 9.93 57.90 25.420
- - x2 1 26.79 74.76 28.742
- - x1 1 820.91 868.88 60.629
Call:
lm(formula = Y ~ x1 + x2 + x4, data = tdata)
Coefficients:
(Intercept) x1 x2 x4
71.6483 1.4519 0.4161 -0.2365

可以看删除X3 X4是比较好的选择

data<-lm(Y~x1+x2,data=tdata)
summary(data)
Call:
lm(formula = Y ~ x1 + x2, data = tdata)
Residuals:
Min 1Q Median 3Q Max
-2.893 -1.574 -1.302 1.363 4.048
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 52.57735 2.28617 23.00 5.46e-10 ***
x1 1.46831 0.12130 12.11 2.69e-07 ***
x2 0.66225 0.04585 14.44 5.03e-08 ***
Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.406 on 10 degrees of freedom
Multiple R-squared: 0.9787, Adjusted R-squared: 0.9744
F-statistic: 229.5 on 2 and 10 DF, p-value: 4.407e-09

而最后的summary显示确实是这样,删除3、4后,1、2便都显著了。

发布了6 篇原创文章 · 获赞 1 · 访问量 481

猜你喜欢

转载自blog.csdn.net/weixin_43745631/article/details/88129640