Statistics_Jia Junping——Thinking Questions Chapter 12 Multiple Linear Regression

1. Explain the meaning of multiple regression models, multiple regression equations, and estimated multiple regression equations.

(1) Multiple regression model: Let the dependent variable be yyy k k The k independent variables arex 1 , x 2 , . . . , xk x_1,x_2,...,x_kx1,x2,...,xk, describing the dependent variable yyHow y depends on the independent variablesx 1 , x 2 , . . . , xk x_1,x_2,...,x_kx1,x2,...,xkand the error term ε \varepsilonThe equation for ε is called a multiple regression model. Its general form can be expressed as:
y = β 0 + β 1 x 1 + β 2 x 2 + . . . + β kxk + ϵ y=\beta _0+\beta _1x_1+\beta _2x_2+...+\beta _kx_k+\epsilony=b0+b1x1+b2x2+...+bkxk+ϵ

(2) Multiple regression equation: According to the assumption of the regression model, E ( y ) = β 0 + β 1 x 1 + β 2 x 2 + . . . + β kxk + ϵ E(y)=\beta _0+\beta _1x_1+ \beta _2x_2+...+\beta _kx_k+\epsilonE ( and )=b0+b1x1+b2x2+...+bkxk+ϵ , called the multiple regression equation, which describes the dependent variableyyExpected value of y and independent variablex 1 , x 2 , . . . , xk x_1,x_2,...,x_kx1,x2,...,xkThe relationship between.

(3) Estimated multiple regression equation: parameters in the regression equation β 0 , β 1 , β 2 , . . . , β k \beta _0,\beta _1,\beta _2,...,\beta _kb0,b1,b2,...,bkare unknown, and need to use sample data to estimate them. When using sample statistics β ^ 0 , β ^ 1 , β ^ 2 , . . . , β ^ k \hat{\beta} _0,\hat{\beta} _1,\hat{\beta} _2,.. .,\hat{\beta}_kb^0,b^1,b^2,...,b^kTo estimate the unknown parameters in the regression equation β 0 , β 1 , β 2 , . . . , β k \beta _0,\beta _1,\beta _2,...,\beta _kb0,b1,b2,...,bk, the estimated multiple regression equation is obtained, and its general form is:
y ^ = β ^ 0 + β ^ 1 x 1 + β ^ 2 x 2 + . . . + β ^ kxk \hat{y} =\hat {\beta}_0+\hat{\beta}_1x_1+\hat{\beta}_2x_2+...+\hat{\beta}_kx_ky^=b^0+b^1x1+b^2x2+...+b^kxk
式中,β ^ 0 , β ^ 1 , β ^ 2 , . . . , β ^ k \hat{\beta } _0,\hat{\beta } _1,\hat{\beta } _2,...,\hat{\beta } _kb^0,b^1,b^2,...,b^kare the parameters β 0 , β 1 , β 2 , . . . , β k \beta _0,\beta _1,\beta _2,...,\beta _kb0,b1,b2,...,bkEstimated value of y ^ \hat{y}y^is the dependent variable yyEstimated value of y . whereβ 1 , β 2 , . . . , β k \beta _1,\beta _2,...,\beta _kb1,b2,...,bkcalled partial regression coefficients.

2. What are the basic assumptions in the multiple linear regression model?

The basic assumptions of the multiple regression model are:

(1) Independent variable x 1 , x 2 , . . . , xk x_1,x_2,...,x_kx1,x2,...,xkIt is non-random, fixed, and uncorrelated with each other (no multicollinearity), and the sample size must be greater than the number of regression coefficients to be estimated, that is, n > k n> kn>k ;
(2) Error termϵ \epsilonϵ is a random variable with an expected value of 0, ieE ( ϵ ) = 0 E( \epsilon )=0E ( ϵ )=0 ;
(3) For independent variablesx 1 , x 2 , . . . , xk x_1,x_2,...,x_kx1,x2,...,xkFor all values ​​of ϵ \epsilonThe variance of ϵ σ 2 \sigma ^2p2 are the same, and there is no serial correlation, that is,D ( ε i ) = σ 2 , C ov ( ε i , ε j ) = 0 , i ≠ j D(\varepsilon _i)=\sigma ^2,Cov(\varepsilon _i,\varepsilon _j)=0,i\ne jD ( ei)=p2,C o v ( εi,ej)=0,i=j ;
(4) Error termϵ \epsilonϵ is a random variable that obeys a normal distribution and is independent of each other, that is,ε ∼ N ( 0 , σ 2 ) \varepsilon \sim N(0,\sigma ^2)eN(0,p2)

3. Explain the meaning and function of multiple determination coefficient and adjusted multiple determination coefficient.

(1) The multiple determination coefficient is the ratio of the regression sum of squares to the total sum of squares in multiple regression. It is a statistic to measure the degree of fitting of the multiple regression equation, reflecting the dependent variable yyThe proportion of the variance in y that is explained by the estimated regression equation, calculated as:R 2 = SSR / SST = 1 − SSE / SSTR^2 = SSR/SST = 1-SSE/SSTR2SSR/SST1SSE/SST

(2) The adjusted multiple determination coefficient takes into account the sample size ( nnn ) and the number of independent variables in the model (kkk ), which makesR a 2 R^2 _aRa2The value of is always less than R 2 R^2R2 , andR a 2 R^2 _aRa2The value of will not get closer to 1 due to the increase in the number of independent variables in the model, and its calculation formula is: R a 2 = 1 − ( 1 − R 2 ) × n − 1 n − k − 1 R^2 _a =1-(1-R^2)\times \frac{n-1}{nk-1}Ra2=1(1R2)×nk1n1

4. Explain the meaning of multicollinearity.

When two or more independent variables in a regression model are correlated with each other, it is said that there is multicollinearity in the regression model.

5. What impact does multicollinearity have on regression analysis?

When there is multicollinearity in regression analysis, the following problems will arise:

(1) When the variables are highly correlated, it may confuse the regression results and even lead the analysis astray;
(2) Multicollinearity may affect the sign of the estimated value of the parameter, especially β i \beta_ibiThe sign of may be the opposite of the expected sign.

6. What are the main methods for identifying multicollinearity?

There are many ways to detect multicollinearity, the simplest of which is to calculate the correlation coefficient between each pair of independent variables in the model, and perform a significance test on each correlation coefficient. If one or more correlation coefficients are significant, it means that there are related independent variables in the model, that is, there is a problem of multicollinearity.

Specifically, if the following conditions appear, it implies the existence of multicollinearity:
(1) There is a significant correlation between each pair of independent variables in the model; (
2) When the linear relationship test (F test) of the model is significant, almost all regression Coefficient β i \beta_ibithe ttThe t test is not significant;
(3) The sign of the regression coefficient is opposite to the expected one.
(4) Tolerance and variance expansion factor. Tolerance for an independent variable is equal to 1 minus that independent variable being the dependent variable and the otherk − 1 k-1kThe coefficient of determination of the linear regression model obtained when one independent variable is the predictor variable, that is, 1 − R i 2 1-R^2 _i1Ri2. The smaller the tolerance, the more serious the multicollinearity. It is generally considered that when the tolerance is less than 0.1, there is severe multicollinearity. The variance expansion factor is equal to the reciprocal of the tolerance, that is, VIF = 1 / ( 1 − R i 2 ) VIF=1/(1-R^2 _i)VIF=1/(1Ri2) . Apparently,VIF VIFThe larger the V IF , the more serious the multicollinearity. It is generally believed thatVIF VIFWhen V IF is greater than 10, there is serious multicollinearity.

7. What are the methods of dealing with multicollinearity?

There are several ways to deal with multicollinearity:

(1) Remove one or more relevant independent variables from the model, so that the remaining independent variables are as irrelevant as possible.
(2) If you want to keep all the independent variables in the model, then you should: ① Avoid according to ttt statistic for a single parameterβ \betaβ test; ② for the dependent variableyyThe inference (estimation or prediction) of y- values ​​is restricted to the range of sample values ​​of the independent variable.

8. In multiple linear regression, what are the methods for selecting independent variables?

In multiple linear regression, the methods of variable selection mainly include: forward selection, backward elimination, stepwise regression, optimal subset, etc.

(1) Forward selection starts with no independent variable in the model, and continuously adds independent variables to the model until the value of the F statistic cannot be significantly increased by adding independent variables; (2) The backward elimination is opposite to the forward selection method
, it first adds all independent variables into the model, and then removes them one by one until removing one independent variable does not significantly reduce the SSE, at this time, the remaining independent variables in the model are significant; (3) Stepwise regression
is a combination of forward selection and backward elimination. The first two steps are the same as the forward selection method, and then continuously add variables to the model and consider the possibility of eliminating previously added variables until the addition of variables can no longer lead to a significant SSE. reduce.

Guess you like

Origin blog.csdn.net/J__aries/article/details/131317633