R Language - regression analysis (least squares) - model improvement measures

There are four ways to deal with the return contrary to assumptions
1, remove the observation point
2, the variable conversion
3, add or delete variable
4, other regression methods

8.5.1 Remove the observation point
remove outlier data set generally can be increased to fit the assumption of normality, and the interference consequent influential points will generally also be removed.
After you remove the biggest outliers or influential point, need to re-fit model. If outliers or influential point remains, repeat the process
if it is because the data recording error, or fails to comply with regulations, or by the Object misunderstood instructions that the data itself is wrong can be deleted,
but the data in the right circumstances under these outliers help you research the topic more deeply, or you might find other asked not thought of

8.5.2 variable transformation
when the model does not meet the normality, the variances are the same or linear assumption, one or more transform variables can generally be improved effect or tuning the model
when the model violates the assumption of normality, the response is usually a variable may attempt transform seed
car package powerTransform () function is the maximum likelihood estimate of [lambda] to normal or variable X ^ λ.

library(car)
载入需要的程辑包:carData
> states <- as.data.frame(state.x77[,c("Murder","Population","Illiteracy","Income","Frost")])
> summary(powerTransform(states$Murder))
bcPower Transformation to Normality 
              Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
states$Murder    0.6055           1       0.0884       1.1227

Likelihood ratio test that transformation parameter is equal to 0
 (log transformation)
                           LRT df     pval
LR test, lambda = (0) 5.665991  1 0.017297

Likelihood ratio test that no transformation is needed
                           LRT df    pval
LR test, lambda = (1) 2.122763  1 0.14512
> 

The results show that may be used to Murder ^ 0.6 Normal variables Murder. However, in this embodiment, λ = 1 can not be assumed rejection (p = 0.145) and therefore there is no strong evidence that the present embodiment needs to be transformed variables.
When the violation of the linear hypothesis, to transform the predictor variables are often more useful.
example:

library(car)
> states <- as.data.frame(state.x77[,c("Murder","Population","Illiteracy","Income","Frost")])
> boxTidwell(Murder~Population + Illiteracy,data=states)
           MLE of lambda Score Statistic (z) Pr(>|z|)
Population       0.86939             -0.3228   0.7468
Illiteracy       1.35812              0.6194   0.5357

iterations =  19 
> 

The results showed that the use of Population 0.87 and illiteracy 1.36 linear relationship can be improved. However, test scores of Population (p = 0.75) and Illiteracy (p = 0.54) and showed a variable conversion is not required.
These results are consistent with the residual plots components of
the response variable conversion also improve heteroscedasticity
Note If the transformation more meaningful, such as logarithmic transformation of income, inverse transform distance, it will be much easier to explain, but if you have to transform does not make sense, it should avoid doing so

8.5.3 deletions variable
delete variables when dealing with multicollinearity is a very important way
, if you only make predictions, then multicollinearity is not a problem, but if you have to explain to each predictor, then it must solve this problem.
The most common way is to delete a multicollinearity of variables, another method is available ridge regression - Multiple regression variant, designed to handle multicollinearity problem

8.5.4 Other methods
use different methods to fit
if multicollinearity ridge regression model can be used
if the outliers and influential points or exist, you can use an alternative robust regression model OLS regression.
If the normality assumption violated, nonparametric regression model may be used.
If there is a significant non-linear, non-linear regression model can try.
If you violate the independence assumption error, you can use specialized model error structure.

8.6 choose the best regression model
with basic installation ANOVA () function may compare the goodness of fit of two nested models. The so-called nested models, that some of its items is entirely contained in another model.
In the multiple regression model states, we find that the regression coefficients Income and Frost was not significant, then you can test model does not contain these two variables containing the model predictions whether the effect is as good as two of
the code in Listing

 states <- as.data.frame(state.x77[,c("Murder","Population","Illiteracy","Income","Frost")])
> fit1 <- lm(Murder~.,data=states)
> fit2 <- lm(Murder~Population+Illiteracy,data=states)
> anova(fit2,fit1)
Analysis of Variance Table

Model 1: Murder ~ Population + Illiteracy
Model 2: Murder ~ Population + Illiteracy + Income + Frost
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1     47 289.25                           
2     45 289.17  2  0.078505 0.0061 0.9939
> 

Here, in the nested model 1 model 2, anova () function but also whether the corresponding Income and Frost should be added to the linear model to test,
since the test was not significant (p = 0.994), we can conclude, no Add these two variables to the linear model.

AIC also be used to compare model, which takes into account the statistical fit of the model and the number of parameters to fit.
AIC to a smaller value preference model, which illustrates the model with fewer parameters obtained a sufficient degree of fit.
The sample code

 states <- as.data.frame(state.x77[,c("Murder","Population","Illiteracy","Income","Frost")])
> fit1 <- lm(Murder~.,data=states)
> fit2 <- lm(Murder~Population+Illiteracy,data=states)
> AIC(fit2,fit1)
     df      AIC
fit2  4 237.6565
fit1  6 241.6429

Here AIC value indicates that there is no better model of Income and Frost.
Note that, ANOVA need nested models, the method does not require AIC.

8.6.2 Variable Selection
Choose from a large number of candidate predictor variables in the final there are about two popular methods: stepwise regression and full-subset regression
1, stepwise regression
varies according to the guidelines additions and deletions to achieve variable stepwise regression method.
StepAIC Mass package () function can be achieved stepwise regression model, based on the exact criteria AIC
code sample

 library(MASS)
> states <- as.data.frame(state.x77[,c("Murder","Population","Illiteracy","Income","Frost")])
> fit <- lm(Murder~.,data=states)
> #direction = "backward"表明使用向后回归的方法
> stepAIC(fit,direction = "backward")
Start:  AIC=97.75
Murder ~ Population + Illiteracy + Income + Frost

             Df Sum of Sq    RSS     AIC
- Frost       1     0.021 289.19  95.753
- Income      1     0.057 289.22  95.759
<none>                    289.17  97.749
- Population  1    39.238 328.41 102.111
- Illiteracy  1   144.264 433.43 115.986

Step:  AIC=95.75
Murder ~ Population + Illiteracy + Income

             Df Sum of Sq    RSS     AIC
- Income      1     0.057 289.25  93.763
<none>                    289.19  95.753
- Population  1    43.658 332.85 100.783
- Illiteracy  1   236.196 525.38 123.605

Step:  AIC=93.76
Murder ~ Population + Illiteracy

             Df Sum of Sq    RSS     AIC
<none>                    289.25  93.763
- Population  1    48.517 337.76  99.516
- Illiteracy  1   299.646 588.89 127.311

Call:
lm(formula = Murder ~ Population + Illiteracy, data = states)

Coefficients:
(Intercept)   Population   Illiteracy  
  1.6515497    0.0002242    4.0807366  

At the beginning of the model contains four predictor variables, then each step, AIC column provides the AIC values after deleting the variable line model, AIC value indicates no time model variables are deleted AIC
every step of selecting the smallest AIC value model model, continue to return to know the AIC is the minimum that we want

2. All subset regression
name suggests, full-subset regression refers to all possible models will be tested.
Nbest = n parameters n best model is to show a different subset sizes, first show the n best single predictor model, and show double n predictors model, and so on, up to and including all of the predictors
full available leaps subset regression package regsubsets () function implemented by R-squared, adjusted R-squared statistic Mallows Cp or other criteria to select the best model
R-squared extent predictors explain the meaning of the response variable, but the total R squared It will be as the number of variables increases
adjusted R-squared and the R-squared is similar to a more realistic estimate of P square
Mallows Cp statistics are also used as a regular judge stopped stepwise regression. Extensive study shows that a good model, it is very close to the number of Cp statistic parameter model (comprising intercept)
code sample

library(leaps)
states <- as.data.frame(state.x77[,c("Murder","Population","Illiteracy","Income","Frost")])
leaps <- regsubsets(Murder~Population + Illiteracy + Income + Frost, data=states, nbest=4)
plot(leaps, scale="adjr2")
library(car)
subsets(leaps,statistic="cp",
       main="Cp plot for All Subsets Regression")
abline(1,1,lty=2,col="red")

Here Insert Picture Description
The first row (bottom of FIG Start), you can see the model adjustment containing intercept (intercept) and R-squared Income 0.033, and containing intercept model is adjusted R-squared 0.1 Population Skip line 12, see Intercept containing, model adjustment Population, illiteracy, Income R-squared value of 0.54, and containing only intercept, Population, the model is adjusted R-squared of 0.55 illiteracy graphical table model is the best model predictive bis

Here Insert Picture Description
FIG better from this model are the intercept and slope of a straight line closer.

In most cases, better than whole subset regression stepwise regression model, considering that more models, but when there are a large number of predictors, full-subset regression will be very slow. Generally automatic selection should be seen rather than a direct method of an auxiliary model selection method. Background knowledge to understand the theme of your ultimate guide to get a good model.

深层次分析
1,交叉验证
交叉验证即将一定比例的数据挑选出来作为训练样本,另外的样本作为保留样本,先在训练样本上获取回归方程,然后在保留样本上做预测。 由于保留样本不涉及模型及参数的选择,该样本可获得比新数据更为精确的估计。

k重交叉难中,样本被分为k个子样本,轮流将k-1个子样本组合作为训练集,另外1个子样本作为保留集,这样会获得k个预测方程,记录k个保留样本的预测表现结果,然后求其平均值。【当n是观测总数目,k为n时,该方法又称作刀切法(jackknifing)】

bootstrap包中的crossval()函数可实现k重交叉验证

library(bootstrap)
> shrinkage<-function(fit,k=10){
+   require(bootstrap)
+   theta.fit<-function(x,y){lsfit(x,y)}
+   theta.predict<-function(fit,x){cbind(1,x)%*%fit$coef}
+   x<-fit$model[,2:ncol(fit$model)]
+   y<-fit$model[,1]
+   results<-crossval(x,y,theta.fit,theta.predict,ngroup=k)
+   r2<-cor(y,fit$fitted.values)^2
+   r2cv<-cor(y,results$cv.fit)^2
+   cat("Original R-square=",r2,"\n")
+   cat(k,"Fold Cross-Validated R-square=",r2cv,"\n")
+   cat("Change=",r2-r2cv,"\n")
+ }
> fit<-lm(Murder~Population+Income+Illiteracy+Frost,data=states)
> shrinkage(fit)
Original R-square= 0.5669502 
10 Fold Cross-Validated R-square= 0.4541713 
Change= 0.112779 

可以看到,基于初始样本的R平方(0.567)过于乐观了,对新数据更好的方差解释率估计是交叉验证后的R平方(0.454)(注意,观测被随机分配到K个群组中因此每次预测结果都有少许不同)

2,相对重要性
如果我们想知道哪些便利是最重要的,如果预测变量不相关,可以根据预测变量与响应变量的相关系数来进行排序。
但是如果预测变量之间有一定的相关性,可以通过比较标准化的回归系数,它表示当其他预测变量不变时,该预测变量的一个标准差的变化可引起响应变量的预测变化(以标准差单位度量)
在进行回归分析前,可以用scale()函数将数据标准为均值为0,标准差为1的数据集。这样用R回归即可获得标准化的回归系数。(注意scale()函数返回的是一个矩阵,而lm()函数要求一个数据框,需要转换一下)
代码示例

 states <- as.data.frame(state.x77[,c("Murder","Population","Illiteracy","Income","Frost")])
> zstates<-as.data.frame(scale(states))
> zfit<-lm(Murder~Population+Income+Illiteracy+Frost,data=zstates)
> coef(zfit)
  (Intercept)    Population        Income    Illiteracy         Frost 
-2.054026e-16  2.705095e-01  1.072372e-02  6.840496e-01  8.185407e-03 
> 

从此处可以看出,当其他因素不变时,文盲率一个标准差的变化将增加0.68个标准差的谋杀率。根据标准化的回归系数,可以认为Illiteracy是最重要的预测变量

Summary:
This section contains When model violates the assumption of normality, generally can try some transformation of the response variable, used car package powerTransform () function,

When the violation of the assumption of linearity, conversion of the prediction variables often useful, may be used in the package car boxTidwell () function

Using ANOVA () function or the AIC () function to compare the effect of two models, with respect to the AIC () function ANOVA () function requires two nested relationship model.

Can be selected by stepwise regression and full-subset regression method from a number of candidates in the final predictor variable, in effect, from the all-subset regression is better, but it's more computation time if the variables are more likely to very slow.

If you want to know the real effect of the model can be used to verify the effect of cross-validation of the model.

Ruoguo want to know the importance of the independent variables can compare the relative importance of multiple independent variables using standardized regression coefficients.

Published 39 original articles · won praise 11 · views 10000 +

Guess you like

Origin blog.csdn.net/weixin_42712867/article/details/100149812