Table of contents
Variable Selection with Stepwise Regression
Regression Diagnosis: A General Approach
In order to study the changing trend of China's civil aviation passenger traffic and its causes, the dependent variable is civil aviation passenger traffic (10,000 people), passenger traffic (10,000 people), inbound tourists (10,000 people), foreign inbound tourists (10,000 people), domestic The number of outbound residents (10,000 people) and domestic tourists (10,000 people) are the main explanatory variables, and a multiple linear regression model is established and analyzed. (Data files are at the end of the article).
Source: Annual data from the National Bureau of Statistics of the People's Republic of China, http://www.stats.gov.cn/tjsj/.
Figure 1 Multi-factor analysis data of civil aviation passenger traffic
Build regression equation
According to the data of ex2.3.csv, establish a linear regression equation about , , , and and conduct a significance test on the equation and regression coefficient. The R program is as follows:
d2.3<-read.csv("ex2.3.csv",header = T) #将ex2.3.csv数据读入到d2.3中
lm.exam<-lm(y~x1+x2+x3+x4+x5,data=d2.3) #建立y关于x1,x2,x3,x4和x5的线性回归方程,数据为d2.3
summary(lm.exam)
The result of the operation is as follows:
From the above results, it can be seen that the F value of the regression equation is 1413, and the corresponding P value is 2.2e-16, indicating that the regression equation is significant, but the P value corresponding to t shows that: constant items and variables are significant, while Variables , , and are not significant.
Variable Selection with Stepwise Regression
In order to obtain the "optimal" regression equation, a stepwise regression method is used to establish a linear regression equation about , , , and and to perform a significance test on the equation and regression coefficients. The R program is as follows:
#逐步回归
lm.step<-step(lm.exam,dorection="both") #进行逐步回归
The result of the operation is as follows:
Interpretation of output results:
(1) When all independent variables are used for regression, if the variable is removed , the minimum AIC value at this time is 274.25, so R software removes it for the second round of calculation.
(2) At this time, the AIC value is 274.25. If the variable is removed , the AIC value reaches the minimum value of 273.29, so the R software removes it for the third round of calculation.
(3) At this time, the AIC value is 273.29. If the variable is removed , the AIC value reaches the minimum value of 272.65, so the R software removes it for the fourth round of calculation.
(4) At this time, the AIC value is 272.65. At this time, no matter which variable is removed or added, the AIC value will increase, so the calculation stops, and the optimal regression model is obtained, that is, the linear regression model about the sum.
Now use the command summary(lm.step) to get the following summary information of the regression model:
summary(lm.step) #给出回归系数的估计和显著性检验等
operation result:
Conclusion: Note that the constant terms, and are significant, and the model is also significant (because the P values are all less than 0.05, which means rejecting the null hypothesis), so we get the following "optimal" regression equation:
regression diagnosis
Residual analysis and detection of outliers (points that deviate significantly from the main body of the data)
The residual vector is an estimate of the random error term in the model . Residual analysis can diagnose whether the basic assumptions of the model are established (such as random error independence and normality assumptions). In R, use residuals() and rstandard() respectively and rstudent to compute ordinary, standardized, and studentized residuals. If the regression model can describe the fitted data well, then the scatterplot of the residuals against the predicted values should look like some randomly scattered points. If a residual is "large", it means that this point is far away from the main body of the data. Generally , the observations with the absolute value of the standardized residual greater than or equal to 2 are regarded as suspicious points, and the observations with the absolute value of the standardized residual greater than or equal to 3 are considered abnormal points.
Next, use residuals(), rstandard(), and rstudent() to calculate the ordinary residuals, standardized residuals, and studentized residuals of the above stepwise regression model lm.step. The R program is as follows:
#已经得到逐步回归模型lm.step
y.res<-residuals(lm.exam) #计算模型lm.exam的普通残差
y.rst<-rstandard(lm.step) #计算回归模型lm.step的标准化残差
print(y.rst) #输出回归模型lm.step的标准化残差
y.fit<-predict(lm.step) #计算回归模型lm.step的预测值
plot(y.res~y.fit) #绘制以普通残差为纵坐标,预测值为横坐标的散点图
plot(y.rst~y.fit) #绘制以标准化残差为纵坐标,预测值为横坐标的散点图
After running, the standardized residual y.rst of the regression model lm.step is obtained as follows:
It can be seen from the standardized residuals that the absolute value of the standardized residuals (3.18) of the 12th point is greater than 3, so we think that the observed value of the 12th point may be an outlier.
The residual scatter diagram of the regression model lm.step is shown in Figure 2 and Figure 3. It can be seen that the distribution of the residual has a tendency to decrease first and then increase with the predicted value, so the basic assumption of homoscedasticity may be invalid.
Figure 2 Ordinary residual plot
Figure 3 Standardized residual plot
If the assumption of homoscedasticity is not established, sometimes the problem of non-homogeneous variance can be solved by making appropriate transformations on the dependent variable . The common variance-stabilizing transformations are:
(1) Logarithmic transformation:
(2) Square root transformation:
(3) Reciprocal transformation:
Next, we use logarithmic transformation to solve the problem of non-uniform variance of the stepwise regression model lm.step. The R program is as follows:
#已经得到逐步回归模型lm.step,下面对模型进行对数变换来解决方差非齐问题
lm.step_new<-update(lm.step,log(.)~.) #对模型进行对数变换
y.rst<-rstandard(lm.step_new) #计算lm.step_new的标准化残差
y.fit<-predict(lm.step_new) #计算lm.step_new的预测值
plot(y.rst~y.fit) #绘制以标准化残差为纵坐标,预测值为横坐标的散点图
Figure 4 Standardized residual plot after logarithmic change
Comparing the standardized residual plots in Figure 3 and Figure 4, it can be seen that the residual scatter plot after the logarithmic transformation of the model has improved , but point 12 is an abnormal point, here is a simple process to remove observation 12 Value, repeat the above regression analysis and residual analysis process, the R program is as follows:
#去掉12号观测值,重复上述回归分析和残差分析过程
lm.exam<-lm(log(y)~x1+x2+x3+x4+x5,data=d2.3[-c(12),]) #去掉12号观测值再建立全变量回归方程
lm.step<-step(lm.exam,direction = "both") #用一切子集回归法进行逐步回归
y.rst<-rstandard(lm.step) #计算lm.step的标准化残差
y.fit<-predict(lm.step) #计算lm.step的预测值
plot(y.rst~y.fit) #绘制以标准化残差为纵坐标,预测值为横坐标的散点图
Figure 5 Standardized residuals after logarithmic transformation: Observation No. 12 removed
Regression Diagnosis: A General Approach
The above residual analysis judges whether the basic assumptions of the model are established by calculating the residuals corresponding to each sample point, and which points in the model may be abnormal points, but it cannot analyze the influence points in the model, that is, to detect which points are inferred to the model have an important impact. The general method of regression diagnosis is given below, which can diagnose whether the basic assumptions of the model are true , which points are abnormal points, and which points are strong influence points. In R, the functions plot() and influence.measures() can be used to draw diagnostic graphs and calculate diagnostic statistics. The following describes the diagnostic results output by these two functions, and performs regression diagnostic analysis on the above stepwise regression model lm.step_new , the R program is as follows:
#已经获得模型lm.step_new
par(mfrow=c(2,2)) #在一个2x2网格中建立4个绘图区
plot(lm.step_new) #绘制模型诊断图
influence.measures(lm.step_new)
Run the above program to get the regression diagnostic diagram Figure 6 and the values of the diagnostic statistics of the following 20 observations:
Figure 6 Regression diagnostic diagram
Figure 6 shows four regression diagnostic plots for the stepwise regression model lm.step_new:
(1) Residuals-fitting diagram (Residuals vs Fitted); (2) Normal QQ diagram (Normal QQ);
(3) Scale-Location; (4) Residuals vs Leverage.
It can be seen from these four graphs that the points in the residual-fitting graph basically present a random distribution pattern; the points in the normal QQ graph basically fall on a straight line, indicating that the residuals basically obey the normal distribution; the size- No. 20 deviates the farthest from the center position in the position diagram and residual leverage diagram, which indicates that observation No. 20 may be an abnormal point or a strong influence point.
influence.measure(lm.step_new) gives values for diagnostic statistics DFBETAS, DFFITS(dffit), covariance ratio (cor.r), Cook's distance (cook.d) and hat matrix (hat inf):
Note that there is a small asterisk mark on the right end of observations No. 1 and No. 20, indicating that these two points are diagnosed as strong influence points.
It should be noted that this method can identify influential observations, that is, strong influence points, but for strong influence points, they cannot be simply deleted, and how to deal with them needs further discussion.
regression prediction
Regression prediction is divided into point prediction and interval prediction, which can be realized by predict(), given explanatory variables
, using the regression model lm.step for point prediction and interval prediction (95% confidence). The R program is as follows:
#假定已经获得模型lm.step
preds<-data.frame(x2=10903.82,x5=7025) #给定解释变量x2和x5的值
predict(lm.step,newdata = preds,interval = "prediction",level = 0.95) #进行点预测和区间预测
The point prediction and interval prediction results obtained by running the above results are as follows:
The option interval="prediction" in the program means to give an interval prediction, and the option level=0.95 means the confidence level is 95%. The point prediction of the calculation result is 11406.428, and the interval prediction is [-571.5959,3384.451].
appendix
All R programs:
d2.3<-read.csv("ex2.3.csv",header = T) #将ex2.3.csv数据读入到d2.3中
lm.exam<-lm(y~x1+x2+x3+x4+x5,data=d2.3) #建立y关于x1,x2,x3,x4和x5的线性回归方程,数据为d2.3
summary(lm.exam)
#逐步回归
lm.step<-step(lm.exam,dorection="both") #进行逐步回归
summary(lm.step) #给出回归系数的估计和显著性检验等
#已经得到逐步回归模型lm.step
y.res<-residuals(lm.exam) #计算模型lm.exam的普通残差
y.rst<-rstandard(lm.step) #计算回归模型lm.step的标准化残差
print(y.rst) #输出回归模型lm.step的标准化残差
y.fit<-predict(lm.step) #计算回归模型lm.step的预测值
plot(y.res~y.fit) #绘制以普通残差为纵坐标,预测值为横坐标的散点图
plot(y.rst~y.fit) #绘制以标准化残差为纵坐标,预测值为横坐标的散点图
#已经得到逐步回归模型lm.step,下面对模型进行对数变换来解决方差非齐问题
lm.step_new<-update(lm.step,log(.)~.) #对模型进行对数变换
y.rst<-rstandard(lm.step_new) #计算lm.step_new的标准化残差
y.fit<-predict(lm.step_new) #计算lm.step_new的预测值
plot(y.rst~y.fit) #绘制以标准化残差为纵坐标,预测值为横坐标的散点图
#去掉12号观测值,重复上述回归分析和残差分析过程
lm.exam<-lm(log(y)~x1+x2+x3+x4+x5,data=d2.3[-c(12),]) #去掉12号观测值再建立全变量回归方程
lm.step<-step(lm.exam,direction = "both") #用一切子集回归法进行逐步回归
y.rst<-rstandard(lm.step) #计算lm.step的标准化残差
y.fit<-predict(lm.step) #计算lm.step的预测值
plot(y.rst~y.fit) #绘制以标准化残差为纵坐标,预测值为横坐标的散点图
#已经获得模型lm.step_new
par(mfrow=c(2,2)) #在一个2x2网格中建立4个绘图区
plot(lm.step_new) #绘制模型诊断图
influence.measures(lm.step_new)
#假定已经获得模型lm.step
preds<-data.frame(x2=10903.82,x5=7025) #给定解释变量x2和x5的值
predict(lm.step,newdata = preds,interval = "prediction",level = 0.95) #进行点预测和区间预测
Topic data:
t y x1 x2 x3 x4 x5
1999 6094 1394413 7279.56 843.23 923.24 71900
2000 6722 1478573 8344.39 1016.04 1047.26 74400
2001 7524 1534122 8901.3 1122.64 1213.44 78400
2002 8594 1608150 9790.8 1343.95 1660.23 87800
2003 8759 1587497 9166.21 1140.29 2022.19 87000
2004 12123 1767453 10903.82 1693.25 2885 110200
2005 13827 1847018 12029.23 2025.51 3102.63 121200
2006 15968 2024158 12494.21 2221.03 3452.36 139400
2007 18576.21 2227761 13187.33 2610.97 4095.4 161000
2008 19251.16 2867892.14 13002.74 2432.53 4584.44 171200
2009 23051.64 2976897.83 12647.59 2193.75 4765.62 190200
2010 26769.14 3269508.17 13376.22 2612.69 5738.65 210300
2011 29316.66 3526318.73 13542.35 2711.2 7025 264100
2012 31936.05 3804034.9 13240.53 2719.16 8318.17 295700
2013 35396.63 2122991.55 12907.78 2629.03 9818.52 326200
2014 39194.88 2032218 12849.83 2636.08 11659.32 361100
2015 43618 1943271 13382.04 2598.54 12786 400000
2016 48796.05 1900194.34 13844.38 2815.12 13513 444000
2017 55156.11 1848620.12 13948 2917 14272.74 500000
2018 61173.77 1793820.32 14119.83 3054.29 16199 553900