Multiple Linear Regression Model of Civil Aviation Passenger Volume Change Trend--Based on R (with R program and explanation)

Table of contents

Build regression equation 

Variable Selection with Stepwise Regression

regression diagnosis

Regression Diagnosis: A General Approach

regression prediction

appendix

All R programs:

 Topic data:


       In order to study the changing trend of China's civil aviation passenger traffic and its causes, the dependent variable is civil aviation passenger traffic (10,000 ypeople), passenger traffic x_{1}(10,000 people), inbound tourists x_{2}(10,000 people), foreign inbound tourists x_{3}(10,000 people), domestic The number of outbound residents x_{4}(10,000 people) and domestic tourists x_{5}(10,000 people) are the main explanatory variables, and a multiple linear regression model is established and analyzed. (Data files are at the end of the article).

Source: Annual data from the National Bureau of Statistics of the People's Republic of China, http://www.stats.gov.cn/tjsj/. 

Figure 1 Multi-factor analysis data of civil aviation passenger traffic

Build regression equation 

      According to the data of ex2.3.csv, establish a linear regression equation yabout  x_{1}, x_{2}, x_{3}, x_{4}and x_{5}and conduct a significance test on the equation and regression coefficient. The R program is as follows:

d2.3<-read.csv("ex2.3.csv",header = T) #将ex2.3.csv数据读入到d2.3中
lm.exam<-lm(y~x1+x2+x3+x4+x5,data=d2.3) #建立y关于x1,x2,x3,x4和x5的线性回归方程,数据为d2.3
summary(lm.exam)

       The result of the operation is as follows:

       From the above results, it can be seen that the F value of the regression equation is 1413, and the corresponding P value is 2.2e-16, indicating that the regression equation is significant, but the P value corresponding to t shows that: constant items and variables are significant, x_{5}while Variables  x_{1}, x_{2}, x_{3}and x_{4}are not significant.

Variable Selection with Stepwise Regression

      In order to obtain the "optimal" regression equation, a stepwise regression method is used to establish  a linear regression equation yabout  x_{1}, x_{2}, x_{3}, x_{4}and x_{5}and to perform a significance test on the equation and regression coefficients. The R program is as follows:

#逐步回归
lm.step<-step(lm.exam,dorection="both") #进行逐步回归

 The result of the operation is as follows:

 Interpretation of output results:

(1) When all independent variables are used for regression, if the variable is removed x_{1}, the minimum AIC value at this time is 274.25, so R software removes it x_{1}for the second round of calculation.

(2) At this time, the AIC value is 274.25. If the variable is removed x_{3}, the AIC value reaches the minimum value of 273.29, so the R software removes it x_{3}for the third round of calculation.

(3) At this time, the AIC value is 273.29. If the variable is removed x_{4}, the AIC value reaches the minimum value of 272.65, so the R software removes it x_{4}for the fourth round of calculation.

(4) At this time, the AIC value is 272.65. At this time, no matter which variable is removed or added, the AIC value will increase, so the calculation stops, and the optimal regression model is obtained, that is, the   linear regression model yabout  the sum.x_{2}x_{5}

Now use the command summary(lm.step) to get the following summary information of the regression model:

summary(lm.step) #给出回归系数的估计和显著性检验等

 operation result:

 Conclusion: Note that the constant terms, x_{2}and x_{5}are significant, and the model is also significant (because the P values ​​are all less than \alpha0.05, which means rejecting the null hypothesis), so we get the following "optimal" regression equation:

regression diagnosis

       Residual analysis and detection of outliers (points that deviate significantly from the main body of the data)

The residual vector e=y-\hat{y}=y-X\check{\beta }is an estimate of the random error term in the model \varepsilon. Residual analysis can diagnose whether the basic assumptions of the model are established (such as random error independence and normality assumptions). In R, use residuals() and rstandard() respectively and rstudent to compute ordinary, standardized, and studentized residuals. If the regression model can describe the fitted data well, then the scatterplot of the residuals against the predicted values ​​should look like some randomly scattered points. If a residual is "large", it means that this point is far away from the main body of the data. Generally , the observations with the absolute value of the standardized residual greater than or equal to 2 are regarded as suspicious points, and the observations with the absolute value of the standardized residual greater than or equal to 3 are considered abnormal points.

       Next, use residuals(), rstandard(), and rstudent() to calculate the ordinary residuals, standardized residuals, and studentized residuals of the above stepwise regression model lm.step. The R program is as follows:

#已经得到逐步回归模型lm.step
y.res<-residuals(lm.exam) #计算模型lm.exam的普通残差
y.rst<-rstandard(lm.step) #计算回归模型lm.step的标准化残差
print(y.rst) #输出回归模型lm.step的标准化残差
y.fit<-predict(lm.step) #计算回归模型lm.step的预测值
plot(y.res~y.fit) #绘制以普通残差为纵坐标,预测值为横坐标的散点图
plot(y.rst~y.fit) #绘制以标准化残差为纵坐标,预测值为横坐标的散点图

       After running, the standardized residual y.rst of the regression model lm.step is obtained as follows:

       It can be seen from the standardized residuals that the absolute value of the standardized residuals (3.18) of the 12th point is greater than 3, so we think that the observed value of the 12th point may be an outlier.

       The residual scatter diagram of the regression model lm.step is shown in Figure 2 and Figure 3. It can be seen that the distribution of the residual has a tendency to decrease first and then increase with the predicted value, so the basic assumption of homoscedasticity may be invalid.

 Figure 2 Ordinary residual plot

 Figure 3 Standardized residual plot

      If the assumption of homoscedasticity is not established, sometimes the problem of non-homogeneous variance can be solved by making appropriate transformations on the dependent variable . The common variance-stabilizing transformations are:

(1) Logarithmic transformation:z=ly.

(2) Square root transformation:z=\sqrt{y}

(3) Reciprocal transformation:z=\frac{1}{y}

       Next, we use logarithmic transformation to solve the problem of non-uniform variance of the stepwise regression model lm.step. The R program is as follows:

#已经得到逐步回归模型lm.step,下面对模型进行对数变换来解决方差非齐问题
lm.step_new<-update(lm.step,log(.)~.) #对模型进行对数变换
y.rst<-rstandard(lm.step_new) #计算lm.step_new的标准化残差
y.fit<-predict(lm.step_new) #计算lm.step_new的预测值
plot(y.rst~y.fit) #绘制以标准化残差为纵坐标,预测值为横坐标的散点图

 Figure 4 Standardized residual plot after logarithmic change

      Comparing the standardized residual plots in Figure 3 and Figure 4, it can be seen that the residual scatter plot after the logarithmic transformation of the model has improved , but point 12 is an abnormal point, here is a simple process to remove observation 12 Value, repeat the above regression analysis and residual analysis process, the R program is as follows:

#去掉12号观测值,重复上述回归分析和残差分析过程
lm.exam<-lm(log(y)~x1+x2+x3+x4+x5,data=d2.3[-c(12),]) #去掉12号观测值再建立全变量回归方程
lm.step<-step(lm.exam,direction = "both") #用一切子集回归法进行逐步回归
y.rst<-rstandard(lm.step) #计算lm.step的标准化残差
y.fit<-predict(lm.step) #计算lm.step的预测值
plot(y.rst~y.fit) #绘制以标准化残差为纵坐标,预测值为横坐标的散点图

 Figure 5 Standardized residuals after logarithmic transformation: Observation No. 12 removed

Regression Diagnosis: A General Approach

       The above residual analysis judges whether the basic assumptions of the model are established by calculating the residuals corresponding to each sample point, and which points in the model may be abnormal points, but it cannot analyze the influence points in the model, that is, to detect which points are inferred to the model have an important impact. The general method of regression diagnosis is given below, which can diagnose whether the basic assumptions of the model are true , which points are abnormal points, and which points are strong influence points. In R, the functions plot() and influence.measures() can be used to draw diagnostic graphs and calculate diagnostic statistics. The following describes the diagnostic results output by these two functions, and performs regression diagnostic analysis on the above stepwise regression model lm.step_new , the R program is as follows:

#已经获得模型lm.step_new
par(mfrow=c(2,2)) #在一个2x2网格中建立4个绘图区
plot(lm.step_new) #绘制模型诊断图
influence.measures(lm.step_new)

        Run the above program to get the regression diagnostic diagram Figure 6 and the values ​​of the diagnostic statistics of the following 20 observations:

 Figure 6 Regression diagnostic diagram

          Figure 6 shows four regression diagnostic plots for the stepwise regression model lm.step_new:

(1) Residuals-fitting diagram (Residuals vs Fitted); (2) Normal QQ diagram (Normal QQ); 

(3) Scale-Location; (4) Residuals vs Leverage. 

      It can be seen from these four graphs that the points in the residual-fitting graph basically present a random distribution pattern; the points in the normal QQ graph basically fall on a straight line, indicating that the residuals basically obey the normal distribution; the size- No. 20 deviates the farthest from the center position in the position diagram and residual leverage diagram, which indicates that observation No. 20 may be an abnormal point or a strong influence point.

influence.measure(lm.step_new) gives values ​​for diagnostic statistics DFBETAS, DFFITS(dffit), covariance ratio (cor.r), Cook's distance (cook.d) and hat matrix (hat inf):

      Note that there is a small asterisk mark on the right end of observations No. 1 and No. 20, indicating that these two points are diagnosed as strong influence points.

      It should be noted that this method can identify influential observations, that is, strong influence points, but for strong influence points, they cannot be simply deleted, and how to deal with them needs further discussion.

regression prediction

        Regression prediction is divided into point prediction and interval prediction, which can be realized by predict(), given explanatory variables x_{2}=10903.82

x_{5}=7025, using the regression model lm.step for point prediction and interval prediction (95% confidence). The R program is as follows:

#假定已经获得模型lm.step
preds<-data.frame(x2=10903.82,x5=7025) #给定解释变量x2和x5的值
predict(lm.step,newdata = preds,interval = "prediction",level = 0.95) #进行点预测和区间预测

yThe point prediction and interval prediction results        obtained by running the above results are as follows:

        The option interval="prediction" in the program means to give an interval prediction, and the option level=0.95 means the confidence level is 95%. The point prediction of the calculation result yis 11406.428, and the interval prediction is [-571.5959,3384.451].

appendix

All R programs:

d2.3<-read.csv("ex2.3.csv",header = T) #将ex2.3.csv数据读入到d2.3中
lm.exam<-lm(y~x1+x2+x3+x4+x5,data=d2.3) #建立y关于x1,x2,x3,x4和x5的线性回归方程,数据为d2.3
summary(lm.exam)

#逐步回归
lm.step<-step(lm.exam,dorection="both") #进行逐步回归

summary(lm.step) #给出回归系数的估计和显著性检验等


#已经得到逐步回归模型lm.step
y.res<-residuals(lm.exam) #计算模型lm.exam的普通残差
y.rst<-rstandard(lm.step) #计算回归模型lm.step的标准化残差
print(y.rst) #输出回归模型lm.step的标准化残差
y.fit<-predict(lm.step) #计算回归模型lm.step的预测值
plot(y.res~y.fit) #绘制以普通残差为纵坐标,预测值为横坐标的散点图
plot(y.rst~y.fit) #绘制以标准化残差为纵坐标,预测值为横坐标的散点图


#已经得到逐步回归模型lm.step,下面对模型进行对数变换来解决方差非齐问题
lm.step_new<-update(lm.step,log(.)~.) #对模型进行对数变换
y.rst<-rstandard(lm.step_new) #计算lm.step_new的标准化残差
y.fit<-predict(lm.step_new) #计算lm.step_new的预测值
plot(y.rst~y.fit) #绘制以标准化残差为纵坐标,预测值为横坐标的散点图


#去掉12号观测值,重复上述回归分析和残差分析过程
lm.exam<-lm(log(y)~x1+x2+x3+x4+x5,data=d2.3[-c(12),]) #去掉12号观测值再建立全变量回归方程
lm.step<-step(lm.exam,direction = "both") #用一切子集回归法进行逐步回归
y.rst<-rstandard(lm.step) #计算lm.step的标准化残差
y.fit<-predict(lm.step) #计算lm.step的预测值
plot(y.rst~y.fit) #绘制以标准化残差为纵坐标,预测值为横坐标的散点图

#已经获得模型lm.step_new
par(mfrow=c(2,2)) #在一个2x2网格中建立4个绘图区
plot(lm.step_new) #绘制模型诊断图
influence.measures(lm.step_new)


#假定已经获得模型lm.step
preds<-data.frame(x2=10903.82,x5=7025) #给定解释变量x2和x5的值
predict(lm.step,newdata = preds,interval = "prediction",level = 0.95) #进行点预测和区间预测



 Topic data:

t	y	x1	x2	x3	x4	x5
1999	6094	1394413	7279.56	843.23	923.24	71900
2000	6722	1478573	8344.39	1016.04	1047.26	74400
2001	7524	1534122	8901.3	1122.64	1213.44	78400
2002	8594	1608150	9790.8	1343.95	1660.23	87800
2003	8759	1587497	9166.21	1140.29	2022.19	87000
2004	12123	1767453	10903.82	1693.25	2885	110200
2005	13827	1847018	12029.23	2025.51	3102.63	121200
2006	15968	2024158	12494.21	2221.03	3452.36	139400
2007	18576.21	2227761	13187.33	2610.97	4095.4	161000
2008	19251.16	2867892.14	13002.74	2432.53	4584.44	171200
2009	23051.64	2976897.83	12647.59	2193.75	4765.62	190200
2010	26769.14	3269508.17	13376.22	2612.69	5738.65	210300
2011	29316.66	3526318.73	13542.35	2711.2	7025	264100
2012	31936.05	3804034.9	13240.53	2719.16	8318.17	295700
2013	35396.63	2122991.55	12907.78	2629.03	9818.52	326200
2014	39194.88	2032218	12849.83	2636.08	11659.32	361100
2015	43618	1943271	13382.04	2598.54	12786	400000
2016	48796.05	1900194.34	13844.38	2815.12	13513	444000
2017	55156.11	1848620.12	13948	2917	14272.74	500000
2018	61173.77	1793820.32	14119.83	3054.29	16199	553900

Guess you like

Origin blog.csdn.net/weixin_44734502/article/details/129312784