Several Solutions to Multicollinearity Problems

Abstract:      Among the classical assumptions of multiple linear regression models, one of its important assumptions is that there is no linear relationship between the explanatory variables of the regression model, that is, none of the explanatory variables X1, X2,..., Xk can be Linear combinations of other explanatory variables. If this assumption is violated, that is, there is a linear relationship between one explanatory variable and other explanatory variables in the linear regression model, it is called multicollinearity in the linear regression model. Multicollinearity violates the classical assumption of uncorrelatedness among explanatory variables and will have serious consequences for ordinary least squares. The so-called multicollinearity is

    Among the classical assumptions of the multiple linear regression model, one of its important assumptions is that there is no linear relationship between the explanatory variables of the regression model, that is, none of the explanatory variables X 1 , X 2 , ... , X k can is a linear combination of other explanatory variables. If this assumption is violated, that is, there is a linear relationship between one explanatory variable and other explanatory variables in the linear regression model, it is called multicollinearity in the linear regression model. Multicollinearity violates the classical assumption of uncorrelatedness among explanatory variables and will have serious consequences for ordinary least squares.

    The so-called multicollinearity refers to the distortion or inaccuracy of the model evaluation due to the existence of precise correlation or high correlation between the explanatory variables of the linear regression model. Here, we summarize 8 available methods for dealing with multicollinearity problems, which you can refer to when encountering multicollinearity problems:

1. Retain important explanatory variables and remove secondary or alternative explanatory variables

    There is collinearity between independent variables, indicating that the information provided by independent variables is overlapping, and unimportant independent variables can be deleted to reduce repeated information. However, care should be taken when removing independent variables from the model: those variables that were determined to be relatively insignificant by actual economic analysis and that were confirmed by tests of partial correlation coefficients to be the cause of collinearity. If it is deleted improperly, the model setting error will occur, resulting in serious biased parameter estimation.

2. Change the form of the explanatory variable

    Changing the form of the explanatory variables is an easy way to deal with multicollinearity, such as using relative variables for cross-sectional data and incremental variables for time series data.

3. Difference method

4. Stepwise regression analysis

    Stepwise regression is a commonly used method to eliminate multicollinearity and select the "optimal" regression equation. The method is to introduce the independent variables one by one. The condition for the introduction is that the independent variable is significant through the F test. After each independent variable is introduced, the selected variables are tested one by one. becomes no longer significant, then remove it. The introduction of a variable or the removal of a variable from the regression equation is a step of stepwise regression, and an F test is performed at each step to ensure that only significant variables are included in the regression equation before each new variable is introduced. This process is repeated until neither insignificant independent variables are selected into the regression equation nor significant independent variables are eliminated from the regression equation.

5. Principal Component Analysis

As a common method of multivariate statistical analysis, principal component analysis has certain advantages in dealing with multivariate problems, and its dimensionality reduction advantage is obvious. The principal component regression method is still applicable to general multicollinearity problems, especially is between the variables with strong collinearity.

6. Partial Least Squares Regression

7. Ridge Return

    Ridge regression estimation is a method to remedy multicollinearity by allowing the existence of biased estimators of regression coefficients through the improvement of the least squares method . Using it can allow small errors in exchange for higher accuracy than unbiased estimators , so it is close to the truth. value is more likely. The flexible use of ridge regression method can bring unique and effective help to analyze the role and relationship between variables.

8. Increase the sample size

    The essence of the multicollinearity problem is that the model parameters cannot be estimated accurately due to insufficient sample information. Therefore, adding sample information is an effective way to solve this problem. However, due to the difficulty of data collection and investigation, it is sometimes not easy to add sample information in practice.


This time we mainly study how the stepwise regression analysis method handles the multicollinearity problem.

The basic idea of ​​stepwise regression analysis method is to comprehensively judge the pros and cons of a series of regression equations through three aspects: correlation coefficient r, goodness of fit R2 and standard error, so as to obtain the optimal regression equation. The specific method is divided into two steps:

第一步,先将被解释变量y对每个解释变量作简单回归:

对每一个回归方程进行统计检验分析(相关系数r、拟合优度R2和标准误差),并结合经济理论分析选出最优回归方程,也称为基本回归方程。

第二步,将其他解释变量逐一引入到基本回归方程中,建立一系列回归方程,根据每个新加的解释变量的标准差和复相关系数来考察其对每个回归系数的影响,一般根据如下标准进行分类判别:

1.如果新引进的解释变量使R2得到提高,而其他参数回归系数在统计上和经济理论上仍然合理,则认为这个新引入的变量对回归模型是有利的,可以作为解释变量予以保留。

2.如果新引进的解释变量对R2改进不明显,对其他回归系数也没有多大影响,则不必保留在回归模型中。

3.如果新引进的解释变量不仅改变了R2,而且对其他回归系数的数值或符号具有明显影响,则认为该解释变量为不利变量,引进后会使回归模型出现多重共线性问题。不利变量未必是多余的,如果它可能对被解释变量是不可缺少的,则不能简单舍弃,而是应研究改善模型的形式,寻找更符合实际的模型,重新进行估计。如果通过检验证明回归模型存在明显线性相关的两个解释变量中的其中一个可以被另一个很好地解释,则可略去其中对被解释变量影响较小的那个变量,模型中保留影响较大的那个变量。

下边我们通过实例来说明逐步回归分析方法在解决多重共线性问题上的具体应用过程。

具体实例

例1设某地10年间有关服装消费、可支配收入、流动资产、服装类物价指数、总物价指数的调查数据如表1,请建立需求函数模型。

   表1  服装消费及相关变量调查数据

 

(1)设对服装的需求函数为

用最小二乘法估计得估计模型:

模型的检验量得分,R2=0.998,D·W=3.383,F=626.4634

    R2接近1,说明该回归模型与原始数据拟合得很好。由得出拒绝零假设,认为服装支出与解释变量间存在显著关系。

(2)求各解释变量的基本相关系数

上述基本相关系数表明解释变量间高度相关,也就是存在较严重的多重共线性。

(3)为检验多重共线性的影响,作如下简单回归:

各方程下边括号内的数字分别表示的是对应解释变量系数的t检验值。

观察以上四个方程,根据经济理论和统计检验(t检验值=41.937最大,拟合优度也最高),收入Y是最重要的解释变量,从而得出最优简单回归方程

(4)将其余变量逐个引入,计算结果如下表2:

表2 服装消费模型的估计


结果分析:

①在最优简单回归方程中引入变量Pc,使R2由0.9955提高到0.9957;根据经济理论分析,正号,负号是合理的。然而t检验不显著(),而从经济理论分析,Pc应该是重要因素。虽然Y与Pc高度相关,但并不影响收入Y回归系数的显著性和稳定性。依照第1条判别标准,Pc可能是“有利变量”,暂时给予保留。

②模型中引入变量L,R2由0.9957提高到0.9959,值略有提高。一方面,虽然Y与L,Pc与L均高度相关,但是L的引入对回归系数的影响不大(其中的值由0.1257变为0.1387,值由-0.0361变为-0.0345,变化很小);另一方面,根据经济理论的分析,L与服装支出C之间应该是正相关关系,即的符号应该为正号而非负号,依照第2条判别标准,解释变量L不必保留在模型中。

③舍去变量L,加入变量P0,使R2由0.9957提高到0.9980,R2值改进较大。均显著(这三个回归系数的t检验值绝对值均大于),从经济意义上看也是合理的(服装支出C与Y,P0之间呈正相关,而与服装价格Pc之间呈负相关关系)。根据判别标准第1条,可以认为Pc、P0皆为“有利变量”,给予保留。

④最后再引入变量L,此时R2=0.9980没有增加(或几乎没有增加),新引入变量对其他三个解释变量的参数系数也没有产生多大影响,可以确定L是多余变量,根据判别标准第2条,解释变量L不必保留在模型中。

因此我们得到如下结论:回归模型为最优模型。

    通过以上案例的分析,我们从理论和实际问题两方面具体了解了逐步回归分析是如何对多重共线性问题进行处理的。事实上,一般统计软件如SPSS,在回归模型的窗口中都会提供变量逐步进入的选项,勾选后实际上就是选择了运用逐步回归的思想来构建回归模型。运用SPSS软件不需要我们懂得其背后的运行规律,然而作为分析师,了解并理解模型背后的理论知识,将更有助于我们理解模型、解释结论背后的内在含义,从而达到更好地分析问题的目的。


参考链接:https://yq.aliyun.com/articles/54540

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325965023&siteId=291194637