Linear regression model (least squares model) Diagnostic --R language

Regression Diagnostics

Regression diagnostics technology provides the necessary tools to evaluate the applicability of the model you return
8.3.1 standard method
of lm () function returns an object using plot () function to generate evaluation model fitting case four graphic
example

fit <- lm(weight~height,data=women)
par(mfrow=c(2,2))
plot(fit)

Here Insert Picture Description
Upper left picture shows the "residual plots and FIG fitting", which can verify the assumed linear statistical hypotheses, the variables associated with Ruoyin linear argument, and then the prediction residual values (fit) do not have any associated system. That it is, in addition to white noise, the model should contain all the system variance data. In the "residual plots and fitting figure you can clearly see the relationship between a curve, which implies that you might need to add a quadratic regression model
upper right picture shows the normal QQ plot is at a value corresponding to the normal distribution, standardization residual probability map. to meet the normality assumption, when the value of fixed predictors, the dependent variable to be normally distributed, the residuals should be normally distributed with mean 0, if meet the normality assumption , then the point on the normal QQ plot should fall on a straight line was 45-degree angle, so that long if it is in violation of the assumption of normality
homoscedasticity, constant variance assumed if yes, then in a position around the horizontal scale in FIG. point should be randomly distributed. the figure seems to point to satisfy this assumption
bottom right "residual and leveraged figure provides information for a single observation point you may be of interest. From the graph can identify outliers, high leverage points and strong image point. The following detailed description is
an observation point is the outlier, show their poor fit regression model prediction effect (produced huge residual positive or negative), a rough criterion is normalized residual is greater than 2 or less than - 2 point, i.e., to see the degree of deviation of each point in the 0:00 y-axis direction to
a high observation point of the lever has a value indicating that it is a combination of an abnormal prediction variable values, i.e., associated with other predictors from groups point. That is, the predictor space, it is an outlier
because the value of variable values is not involved in a lever observation point calculation, the value of the high leverage point judgment standard: hat Hat mean value greater than 2-fold or more than 3 times
a observation point is the strong point of impact shows the effect of its estimate model parameters generated is too large, very disproportionately strong impact point can be identified by Cook's distance.

Diagnosis quadratic fit
Examples

fit <- lm(weight~height+I(height^2),data=women)
par(mfrow=c(2,2))
plot(fit)

Here Insert Picture Description
Can be seen from the figure indicates that a second set of polynomial regression FIG ideal effect, in line with the assumption of linearity, the residual normality (in addition to the observation point 13)
and homoscedasticity (residual variance invariant) observation point 15 appears to be a strong point of impact, deleting it will affect the estimated parameters.
Delete observation points 13 and 15 fit again

3, use this basic method, take a look at multivariate regression fitting
examples

states <- as.data.frame(state.x77[,c("Murder","Population","Illiteracy","Income","Frost")])
fit <- lm(Murder~.,data=states)
par(mfrow=c(2,2))
plot(fit)

Here Insert Picture Description

8.3.2 improved method
car package provides a large number of functions, greatly enhancing the ability of fitting regression models and evaluation
qqPlot () quantile FIG comparison, verify the residual normal
durbinWatsonTest () to do the correlation error from Durbin -Watson testing, validation residuals autocorrelation
crPlots () component and the residual plots, regression analysis to verify whether the linear regression
ncvTest () made of non-test scores error variance constant, to verify whether the residual homoscedasticity
spreadLevelPlot ( ) level of dispersion tests, and above have been verified whether the residual homoskedasticity
outlierTest () to add variable graphics, verification outliers outliers in
inluencePlot () return influence diagrams, comprehensive verification outliers, while outliers can be verified high leverage points, influential point.
Scatterplot () Enhanced scattergram
scatterplotMatrix () scatterplot matrix reinforced
VIF () Is there a co-variance inflation factor, verify argument between linear

1, normality, compared to the base package plot () function, qqplot () function provides a more precise normality hypothesis testing, which depicts the distribution of the degrees of freedom of the t 1 np- Studentized residuals (also referred to remove residual Studentized residuals or folded) pattern, where n is the sample size, p is the number of the regression parameters (including intercept) code is as follows

library(car)
states <- as.data.frame(state.x77[,c("Murder","Population","Illiteracy","Income","Frost")])
fit <- lm(Murder~.,data=states)
par(mfrow=c(1,1))
qqPlot(fit,labels=row.names(states),id.method="identify",simulate=TRUE,main="QQ-Plot")

Here Insert Picture Description
参数id.method = "identify"选项能够交互式绘图–待图形绘制后,用鼠标单击图形内的点,将会标注函数中labels选项的设定值。
敲击Esc键,从图中下拉菜单中选择Stop,或者在图形上右击,都可以关闭这种交互模式
参数simulate=TRUE,95%的置信区间将会用参数自助法生产。simuluate翻译为模拟
从图中可以看到Nevada点并没有落在置信区间中
比较Nevada点的实际值和拟合值

states["Nevada",]
       Murder Population Illiteracy Income Frost
Nevada   11.5        590        0.5   5149   188
> fitted(fit)["Nevada"]
  Nevada 
3.878958 
> #残差值
> residuals(fit)["Nevada"]
  Nevada 
7.621042 

学生化残差:学生化残差是残差除以它的标准差后得到的数值,用以直观地判断误差项服从正态分布这一假设是否成立,
若假定成立,学生化残差的分布也应服从正态分布。
验证残差是否属于正态分布即还有其他方法,如使用代码清单中的代码,函数residplot()函数生成学生化残差柱状图,并添加正态曲线,核密度曲线和轴须图
residplot()函数为基础函数不用加载car包
例子

residplot <- function(fit,nbreaks=10) {
               #生成学生化残差
                z <- rstudent(fit)
                #绘制学生化残差的直方图
                #参数freq=FALSE使得直方图生成的是频率图而不是频数图
                hist(z,breaks=nbreaks,freq=FALSE,
                     xlab="Studentized Residual",
                     main="Distribution of Errors")
                 #绘制轴须图
                rug(jitter(z),col="brown")
                #curve函数常用于绘制函数对应的曲线,确定函数的表达式,以及对应的需要展示的起始坐标和终止坐标,curve函数就会自动化的绘制在该区间的函数图像
                #如果参数add=TRUE,这时图像将在一个已经存在的图像上生成,这种情况下就可以不指定起始和终止坐标
                curve(dnorm(x,mean=mean(z),sd=sd(z)),add=TRUE,col="blue",lwd=2)
                lines(density(z)$x,density(z)$y,col="red",lwd=2,lty=2)
                legend("topright",legend=c("Normal Curve","Kernel Density Curve"),
                       lty=1:2,col=c("blue","red"),cex=.7)
}
residplot(fit)

Here Insert Picture Description
验证误差的独立性
之前的章节提过,判断因变量值(或残差)是否相互独立,最好的方法是依据收集数据方式的先验知识(就是凭借本身对数据的理解判断因变量是否相互独立)
例如,时间序列数据通常呈现自相关性
car包中提供了一个可做Durbin-Watson检验的函数,能够检测误差的序列相关性
例子

library(car)
> states <- as.data.frame(state.x77[,c("Murder","Population","Illiteracy","Income","Frost")])
> fit <- lm(Murder~.,data=states)
> durbinWatsonTest(fit)
 lag Autocorrelation D-W Statistic p-value
   1            -0.2           2.3    0.25
 Alternative hypothesis: rho != 0
> 

P值不显著(p=0.242)说明无自相关性,误差项之间独立。滞后项(lag=1)表明数据集中每个数据都是与其后的一个数据进行比较的。
该检验适用于时间独立的数据,对于非聚集型的数据并不适用

3.线性验证
通过成分残差图也称偏残差图,可以查看因变量和自变量之间是否呈非线性关系,也可以看看是否有不同于已设定线性模型的系统偏差
拟合曲线与直线越接近线性关系越强
图形可用car包中的crPlots()函数绘制
代码示例

library(car)
states <- as.data.frame(state.x77[,c("Murder","Population","Illiteracy","Income","Frost")])
fit <- lm(Murder~.,data=states)
crPlots(fit)

Here Insert Picture Description
4.同方差性
car包提供了两个有用的函数,可以判断误差方差是否恒定,
ncyTest()函数生成一个计分检验,零假设为误差方差不变,备择假设为误差方差随着拟合值水平的变化而变化。
若检验显著,则说明存在异方差性(误差方差不恒定)
spreadlevelplot()函数创建一个添加了最佳拟合曲线的散点图,展示标准化残差绝对值与拟合值的关系
例子

library(car)
> states <- as.data.frame(state.x77[,c("Murder","Population","Illiteracy","Income","Frost")])
> fit <- lm(Murder~.,data=states)
> ncvTest(fit)
Non-constant Variance Score Test 
Variance formula: ~ fitted.values 
Chisquare = 1.7, Df = 1, p = 0.2
> #计分检验不显著(p=0.19),说明满足方差不变的假设,
> spreadLevelPlot(fit)

Suggested power transformation:  1.2 

Here Insert Picture Description
从图中可以看到其中的点在水平的最佳拟合曲线周围呈水平随机分布,若违反了该假设,你将会看到一个非水平的曲线,
代码结果建议幂次转换的含义是,经过p次幂(Y^p)变换,非恒定的误差将会平稳。对于当前的例子,异方差性很不明显,因此建议幂次接近1(不需要进行变换)

8.3.3 线性模型假设的综合验证
gvlma包中的gvlma()函数可以对线性模型假设进行综合验证,同时还能做偏斜度、峰度和异方差性的评价。
换句话说,它能给模型假设提供一个单独的综合检验(通过或不通过)
例子

 states <- as.data.frame(state.x77[,c("Murder","Population","Illiteracy","Income","Frost")])
> fit <- lm(Murder~.,data=states)
> library(gvlma)
> gvmodel <- gvlma(fit)
> summary(gvmodel)

Call:
lm(formula = Murder ~ ., data = states)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.7960 -1.6495 -0.0811  1.4815  7.6210 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.235e+00  3.866e+00   0.319   0.7510    
Population  2.237e-04  9.052e-05   2.471   0.0173 *  
Illiteracy  4.143e+00  8.744e-01   4.738 2.19e-05 ***
Income      6.442e-05  6.837e-04   0.094   0.9253    
Frost       5.813e-04  1.005e-02   0.058   0.9541    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.535 on 45 degrees of freedom
Multiple R-squared:  0.567,	Adjusted R-squared:  0.5285 
F-statistic: 14.73 on 4 and 45 DF,  p-value: 9.133e-08


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
 gvlma(x = fit) 

                    Value p-value                Decision
Global Stat        2.7728  0.5965 Assumptions acceptable.
Skewness           1.5374  0.2150 Assumptions acceptable.
Kurtosis           0.6376  0.4246 Assumptions acceptable.
Link Function      0.1154  0.7341 Assumptions acceptable.
Heteroscedasticity 0.4824  0.4873 Assumptions acceptable.
> 

从输出项(Global Stat中的文字栏)可以看到数据满足OLS回归模型所有的统计假设(p=0.597)
若p <0.05则违反了假设条件,可以通过使用前几节讨论的方法判断哪些假设没有被满足。

8.3.4 多重共线性
回归系数测量的是当其他预测变量不变时,某个预测变量对响应变量的影响。当变量存在多重共线时,其他变量不变也会限制在变化的变量的变化。
这种问题就称作多重共线性。它会导致模型参数的置信空间过大,使单个系数解释起来很困难。
多重共线性可以用统计量VIF(方差膨胀因子)进行检测。VIF的平方根表示变量回归参数的置信区间能膨胀为与模型无关的预测变量的程度。
car包中的vif()函数提供的VIF值,一般原则下vif的平方根 > 2就表明存在多重共线性
例子

> states <- as.data.frame(state.x77[,c("Murder","Population","Illiteracy","Income","Frost")])
> fit <- lm(Murder~.,data=states)
> library(car)
> vif(fit)
Population Illiteracy     Income      Frost 
       1.2        2.2        1.3        2.1 
> sqrt(vif(fit)) >2
Population Illiteracy     Income      Frost 
     FALSE      FALSE      FALSE      FALSE 

结果显示都不大于2所以不存在多重共线性

.4异常值的观测
一个全面的回归分析要覆盖对异常值的分析,包括离群点,高杠杆点和强影响点。
8.4.1 离群点
离群点是指那些模型预测效果不佳的观测点。它们通常有很大的或正或负的残差,正的残差说明模型低估了响应值,负的残差说明模型高估了响应值
通过QQ图可以识别离群点,落在置信区间之外的点就是离群点。
另一个粗糙的判断准则:标准化残差值大于2或者小于-2的点可能是离群点,需要特别关注。
car包也提供了另一种离群点的统计检验方法。outlierTest()函数可以求得最大标准化残差绝对值Bonferroni调整后的p值:

library(car)
> states <- as.data.frame(state.x77[,c("Murder","Population","Illiteracy","Income","Frost")])
> fit <- lm(Murder~.,data=states)
> outlierTest(fit)
       rstudent unadjusted p-value Bonferroni p
Nevada      3.5            0.00095        0.048

此处可以看到Nevada被判定为离群点(p=0.048).
注意该函数只是根据单个最大残差值的显著性来判断是否有离群点,若不显著。则说明数据集中没有离群点;
若显著则你必须删除该离群点,然后再检验是否还有其他离群点存在

高杠杆值点
高杠杆值观测点,指自变量因子空间中的离群点,由许多异常的自变量值组合起来的,与因变量没有关系
高杠杆点就是预测变量中的离群点(就是因变量中的不合群的点)比如如果只有一个因变量,有个点离所有其他因变量的点都很远那这个点就是高杠杆点
在简单线性回归中,高杠杆观测是很容易辨认的,我们可以简单地找到预测变量的取值超出正常范围的观测点。
但是,在有许多预测变量的多元线性回归中,可能存在这样的观测点:
单独来看,它各个预测变量的取值都在正常范围内,但从整个预测变量集的角度来看,它却是不寻常的
#高杠杆点可能是强影响点,也可能不是,这要看他们是不是离群点

高杠杆点的观测点可通过帽子统计量判断
对于一个给定的数据集,帽子均值为p/n,其中p是模型估计的参数数目(包含截距项),n是样本量。
一般来说,若观测点的帽子值大于帽子均值的2或3倍,就可以认定为高杠杆值点
代码示例

states <- as.data.frame(state.x77[,c("Murder","Population","Illiteracy","Income","Frost")])
fit <- lm(Murder~.,data=states)

hat.plot <- function(fit) {
  p <- length(coefficients(fit))
  n <- length(fitted(fit))
  #havvalues(fit)是计算fit中每个因变量的帽子统计量
  plot(hatvalues(fit),main = "Index Plot of Hat Values")
  abline(h=c(2,3)*p/n,col="red",lty=2)
  identify(1:n,hatvalues(fit),names(hatvalues(fit)))
}
hat.plot(fit)
hatvalues(fit)

Here Insert Picture Description
此图中可以看到Alaska和California非常异常,查看他们的预测变量值,与其他48个州进行比较发现
Alaska收入比其他州高得多,而人口和温度确很低,California人口比其他州多得多但收入和温度也很高
高杠杆点可能是强影响点,也可能不是,这要看他们是不是离群点

强影像点
强影响点,即对模型参数估计值影响有些比例失衡的点,例如,若移除模型的一个观测点时模型会发生巨大的改变,那么你就可以检测一下数据中是否存在强影响点了
有两种方法可以检测强影响点:Cook距离,或称D统计量,以及变量添加图
Cook距离
一般来说Cook’sD值大于4/(n-k-1),表明它是强影响点,其中n为样本量大小,k是预测变量数目。可通过下面的代码绘制Cook‘sD图形

states <- as.data.frame(state.x77[,c("Murder","Population","Illiteracy","Income","Frost")])
fit <- lm(Murder~.,data=states)

cutoff <- 4/(nrow(states)-length(fit$coefficients)-2)
plot(fit, which = 4, cook.levels = cutoff)
abline(h=cutoff,lty=2,col="red")

Here Insert Picture Description
从图中可以判断Alaska,Havaii和Nevada是强影响点

变量添加图
Cook’sD图有助于鉴别强影响点,但是并不提供关于这些点如何影像模型的信息,变量添加图弥补了这个缺陷
所谓变量添加图,即对每一个预测变量Xk绘制XK在其他k-1个预测变量 上回归的残差值相对于响应变量在其他k-1个预测变量上的回归的残差值的关系图
car包中的avPlots()函数可提供变量添加图

states <- as.data.frame(state.x77[,c("Murder","Population","Illiteracy","Income","Frost")])
fit <- lm(Murder~.,data=states)
library(car)
avPlots(fit,ask=FALSE,id.method="identify")

Here Insert Picture Description
图中的直线表示相应预测变量的实际回归系数
利用car包中的influencePlot()函数,可以将离群点,杠杆值和强影响点的信息整合到一幅图中中
代码示例

states <- as.data.frame(state.x77[,c("Murder","Population","Illiteracy","Income","Frost")])
fit <- lm(Murder~.,data=states)
library(car)
influencePlot(fit,id.method="identify",main="Influence Plot",sub="Circle size is proportional to Cook's distance")

Here Insert Picture Description
纵坐标超过+2或小于-2的州可被认为是离群点,水平轴超过0.2或0.3的州为高杠杆值,圆圈的大小与影响成正比。圆圈很大的点可能是对模型参数的估计造成的不成比例影响的强影响点。

小结:
回归诊断的基础方法可以对lm()函数返回的对象使用plot()函数,生成评价模型拟合情况的四幅图形。
改进方法:
1,使用car包中的qqplot()函数验证数据的正态性
2,car包中提供了一个可做Durbin-Watson检验的函数验证误差的独立性
3,通过成分残差图也称偏残差图,可以查看因变量和自变量之间是否呈非线性关系,函数使用car包中的crPlots()函数
4,同方差性,car包提供了两个有用的函数,可以判断误差方差是否恒定,ncyTest()函数生成一个计分检验,spreadlevelplot()函数创建一个添加了最佳拟合曲线的散点图,展示标准化残差绝对值与拟合值的关系。
5,gvlma包中的gvlma()函数可以对线性模型假设进行综合验证
6,多重共线性验证,car包中的vif()函数提供的VIF值判断是否存在多重共线性
7,异常值的观测:
A,离群点,car包outlierTest()函数可以求得最大标准化残差绝对值Bonferroni调整后的p值判断离群点。
B,高杠杆点高杠杆值观测点,指自变量因子空间中的离群点,由许多异常的自变量值组合起来的,与因变量没有关系,高杠杆点的观测点可通过帽子统计量判断
C,强影响点,有两种方法可以检测强影响点:Cook距离,或称D统计量,以及变量添加图
Cook距离
D,car包中的influencePlot()函数,可以将离群点,杠杆值和强影响点的信息整合到一幅图中

Published 39 original articles · won praise 11 · views 10000 +

Guess you like

Origin blog.csdn.net/weixin_42712867/article/details/99815449