Applied regression analysis final exam knowledge points summary

In unary linear regression, the expectation E(e i )=_ 0 of the residual e i ; in regression analysis, if there is a problem of heteroscedasticity , it should be dealt with by _ weighted least squares method; in multiple linear regression analysis, if |XX|≈0 will lead to _Multicollinearity, where X is the design matrix. Commonly used sample data in regression analysis are divided into time series data and _cross-sectional data_. Two important branches of modern statistics to study statistical relationships are _Regression analysis_ and _correlation analysis. Regression analysis is a mathematical statistical method to deal with the _correlation analysis relationship_ between variables . The heteroscedasticity problem in regression analysis will cause the following three effects: (1) Although the estimated value of the parameter is Unbiased, but not least variance linear unbiased estimator. (2) The significance test of the parameters fails. (3) The application effect of the regression equation is extremely unsatisfactory. Why building a regression model with independent variables is an extremely important process 3. Answer: (1)) If some important variables are omitted, the effect of the regression equation will definitely not be good. (2) If too many independent variables are considered, among these independent variables, some independent variables may not be important to the research of the problem, or the data quality of some independent variables may be poor, or there is a large degree of difference between the independent variables. Overlapping, then not only the amount of calculation will be increased, but also the stability of the obtained regression equation will be very poor, which will affect the application of the equation. Regression models are usually used in factor analysis, _prediction_ and _control_ of variables .What is the difference and connection between regression analysis and correlation analysis? Relationship: Regression analysis and correlation analysis are both statistical topics that study the relationship between variables. Differences: a. In regression analysis, the variable y is called the dependent variable, which is in a special position to be explained. In correlation analysis, variable x and variable y are in an equal position, that is, the degree of closeness between research variable y and variable x is the same as the degree of closeness between research variable x and variable y. b. The variable y and the variable x involved in the correlation analysis are all random variables. In regression analysis, the dependent variable y is a random variable, and the independent variable x can be a random variable or a non-random deterministic variable. c. The research of correlation analysis is mainly to describe the close degree of linear correlation between two types of variables. Regression analysis can not only reveal the influence of variable x on variable y, but also can be predicted and controlled by the regression equation. What is the significance of the random error term ε in a regression model? ε is a random error term. It is precisely because of the introduction of the random error term that the relationship between variables is described as a random equation, so that we can use random mathematical methods to study the relationship between y and x1, x2…..xp. Due to the objective economic Phenomena are intricate, and it is difficult to accurately explain an economic phenomenon with a limited number of factors. The random error term can generally express various accidental factors that are not considered due to the limitations of people's understanding and other objective reasons. What are the basic assumptions of a linear regression model? The basic assumptions of the linear regression model are: 1. The explanatory variables x1.x2….xp are non-random, and the observed values ​​xi1.xi2…..xip are constants. 2. The assumptions of equal variance and uncorrelation are E(εi)=0, i=1,2…. Cov(εi,εj)=σ^2, 3. The assumptions of normal distribution are independent of each other. 4. The number of sample size is more than the number of explanatory variables, that is, n>p. Why should the regression model be tested? The purpose of our regression model is to apply it to study economic issues, but it is obviously not prudent to use this model to predict, control, and analyze immediately, so we must pass the test to determine whether this model really reveals what is explained. The relationship between variables and explanatory variables.Write the matrix representation of the multiple linear regression model and give the basic assumptions of the multiple linear regression model. Answer: The matrix is ​​expressed as y = Xβ +ε. The basic assumption of the linear regression model is that ① the explanatory variable x1….xp is a deterministic variable, not a random variable, and rank(X)= p+1<n is required; ② random error term Has zero mean and equal variance, that is, E(εi) = 0; i=1,2,...n; cov(εi,εj)=(i,j=1,2,...n) ③ The assumption of normal.\ distribution is several criteria for the selection of independent variables 1.The degree of freedom adjusts the complex coefficient of determination to reach the maximum2. AICand BICcriteria 3.C(p) statistic is the smallestWhy the selection of independent variables is to establish regression A very important problem in the model? (1) If some important variables are omitted, the effect of the regression equation will definitely not be good. (2) If too many independent variables are considered, among these independent variables, some independent variables may not be important to the research of the problem, or the data quality of some independent variables may be poor, or there is a large degree of difference between the independent variables. Overlapping, then not only the amount of calculation will be increased, but also the stability of the obtained regression equation will be very poor, which will affect the application of the equation. How does multicollinearity affect the estimation of regression parameters? Answer: 1. The parameter estimator does not exist under the complete collinearity; 2. The OLS estimator is not effective under the approximate collinearity; 3. The economic meaning of the parameter estimator is unreasonable; 4. The significance test of the variable loses meaning; 5. The model Prediction fails. Discuss the relationship between the sample size n and the number of independent variables p. How do they affect the parameter estimates of the model?Answer: 1 In the multiple linear regression model, the relationship between the sample size n and the number of independent variables p is: n>>p. If n<=p will have a serious impact on the parameter estimation of the model. Because: 1. In the multiple linear regression model, there are p+1 parameters β to be estimated, so the number of sample sizes should be greater than the number of explanatory variables, otherwise the parameters cannot be estimated. 2. The explanatory variable X is a deterministic variable, and rank(X)=p+1<n is required, indicating that the independent variable columns in the design matrix X are not correlated, that is, the matrix X is a full-rank matrix. If rank(X)<p+1, there is a linear correlation between the explanatory variables, and (XX)-1 is a singular matrix, and the estimation of s is unstable. Heteroscedasticity test Residual graph analysis method 1. The fitting value is the abscissa 2. X is the abscissa 3. The time or serial number is the abscissa rank correlation coefficient method (Spearman test method), the method of unary elimination of heteroscedasticity , Weighted least squares method (the most commonly used), BOX-COX transformation method, variance stability transformation method What are the consequences of heteroscedasticity? (1) The parameter estimator is not effective (2) The significance test of the variable is meaningless (3) The application effect of the regression equation is extremely unsatisfactory. In general, when heteroscedasticity occurs in the model, the degree of variation of the estimated value of the parameter OLS increases, resulting in a larger prediction error for Y, lower prediction accuracy, and failure of the prediction function. Reasons for serial correlation 1. When key variables are omitted, serial autocorrelation will be generated. 2. The hysteresis of economic variables will bring autocorrelation to the sequence. 3. Using the wrong regression function form to generate autocorrelation 4. Cobweb Phenomenon will cause autocorrelation 5. What are the serious consequences of autocorrelation between error items caused by data processing and sequence correlation?Answer: When directly using the ordinary least squares method to estimate the unknown parameters of the linear regression model with serial correlation in the random error term, the following problems will arise: (1) The parameter estimator is still unbiased, but it is not effective because there are The variance of the parameter estimates is larger with autocorrelation than without autocorrelation. (2) The mean square error MSE may seriously underestimate the variance of the error term (3) The significance test of the variable loses its meaning: In the significance test of the variable, the statistic is based on the correct estimation of the parameter variance. When the parameter variance When seriously underestimated, it is easy to cause the t value and F value to be too large, that is, it may lead to a serious wrong conclusion that the regression parameter statistical test and the regression equation test are significant, but they are not actually significant. (4) When there is serial correlation, it is still an unbiased estimate, but in any particular sample, the estimator may seriously distort the real situation, that is, the least squares method becomes very sensitive to sampling fluctuations. (5) The prediction and structural analysis of the model fail. Summarize the advantages and disadvantages of the DW test. Answer: Advantages: (1) It is widely used, and the DW value can be calculated by general computer software; (2) It is suitable for small samples; (3) It can be used to test the serial correlation problem with the first-order autoregressive form of the random disturbance item. Disadvantages: (1) There are two undetermined areas in the DW test. Once the DW value falls into this area, it cannot be judged. At this time, only increase the sample size or choose other methods; (2) The upper and lower bound tables of DW statistics require n>15, because if the sample is small, it is difficult to use residuals (3) The DW test is not suitable for the test of random items with high-order serial correlation. 5.6 What are the advantages and disadvantages of the forward method and the backward method Answer: The advantage of the forward method is that the independent variables that affect the dependent variable can be selected one by one according to their significance, and the amount of calculation is small. The disadvantage of the forward method is that it cannot reflect the changes after the introduction of new variables, and the selected variables cannot be deleted even if they are not significant. The advantage of the backward method is that the independent variables that have no significant impact on the dependent variable can be eliminated one by one according to insignificance, and the remaining independent variables are all significant. The disadvantage of the back-off method is that it starts with a large amount of calculation, and when an independent variable is reduced, it has no chance to enter again. If there is a correlation between the independent variables, the regression equations made by the forward method and the backward method will have different degrees of problems. 5.7 Discuss the thinking method of stepwise regression method.Answer: The basic idea of ​​gradual regression is that there are ins and outs. The specific method is to introduce variables one by one. When each independent variable is introduced, the selected variables should be tested one by one. remove. Introducing a variable or removing a variable from the regression equation is a stepwise regression step, and an F test is performed at each step to ensure that only significant variables are included in the regression equation before introducing a new variable each time. This process is repeated until no significant variables are introduced into the regression equation, and no non-significant variables are removed from the regression equation. In this way, the defects of the forward method and the backward method are avoided, and the final regression subset is guaranteed to be the optimal regression subset. List the diagnostic methods of multicollinearity and give a brief description. Variance expansion factor method: When the variance of the variable back is expanded by the factor network, it shows that there is serious multicollinearity between the back and other independent variables. Characteristic root determination method: 1. Characteristic root analysis; 2. Condition number, when the condition number indicates that there is strong multicollinearity in the model. The intuitive judgment method is to make intuitive judgments based on the output results of the model, such as the sign of the regression coefficient does not meet the economic significance; the correlation coefficient between independent variables is large; some important independent variables have not passed the significance test, etc. How does multicollinearity affect the estimation of regression parameters? Answer: 1. The parameter estimator does not exist under the complete collinearity; 2. The OLS estimator is not effective under the approximate collinearity; 3. The economic meaning of the parameter estimator is unreasonable; 4. The significance test of the variable loses meaning; 5. The model Prediction fails. Does multicollinearity arise from the number n of sample size and the number p of independent variables? Answer: It is related. Increasing the sample size cannot eliminate the multicollinearity in the model, but it can properly eliminate the consequences of multicollinearity. When the number p of independent variables is large, multicollinearity is easy to occur in general, so the number of independent variables should be small and precise. Methods to eliminate multicollinearity 1. Propose irrelevant variables 2. Increase sample size 3. Biased estimation of regression coefficients What are the methods for choosing the ridge parameter k?Answer: The optimal k depends on the unknown parameters β and β. Several common selection methods are: (1) Ridge trace method: Selecting the point of k can make each ridge estimate basically stable, the sign of the ridge estimate is reasonable, and the regression coefficient is not inconsistent. The absolute value in line with economic significance, and the residual sum of squares does not increase too much; (2) Variance expansion factor method: c(k)=(X'X+kI)-1X'X(X'X+kI)-1, Its diagonal element Cjj(k) is the variance expansion factor of the ridge estimate. Let c jj(k)≤10; (3) Residual sum of squares: the largest t value that satisfies SSE(k)<cSSE. What basic principles should be followed in choosing independent variables with ridge regression method? The usual principles for selecting variables for ridge regression are: 1. In the calculation of ridge regression, we usually assume that the involved matrix has been centered and standardized, so that the size of the standardized ridge regression coefficients can be directly compared. We can eliminate the independent variables whose standardized ridge regression coefficients are relatively stable and have small absolute values; 2. When the value of k is small, the absolute value of the standardized ridge regression coefficients is not very small, but unstable, and rapidly tends to close to zero. We can also remove the independent variables whose ridge regression coefficients are unstable and the vibration tends to zero; 3. Remove the independent variables whose standardized ridge regression coefficients are very unstable. If there are several ridge regression coefficients that are unstable, how many should be removed and which ones should be removed should be determined according to the effect of re-analysis of ridge regression after removing a certain variable. Ridge regression and ridge trace method to choose k value general principles 1 . The ridge estimation of each regression coefficient is basically stable 2 . The regression coefficient with unreasonable sign is estimated by the least square method, and the sign of ridge estimation becomes reasonable 3 . The regression coefficient is not uneconomical The absolute value of the significance of 4. The residual sum of squares does not increase too much

Guess you like

Origin blog.csdn.net/qq_56437391/article/details/125404642