Mathematical modeling - multiple linear regression analysis (+ lasso regression operation)

Part 1: Introduction to Regression Analysis

Definition: Regression analysis is the most basic and important analysis tool in data analysis. Most of the data analysis problems can be solved using the idea of ​​regression. The number of people in regression analysis is to try to explain the formation mechanism of Y by studying the correlation between the independent variable X and the dependent variable Y, and then achieve the purpose of predicting Y through X.

There are five types of common regression analysis: linear regression, 0-1 regression, ordinal regression, count regression and survival regression, which are divided based on the type of dependent variable Y. In this lecture, you mainly learn linear regression.

Returning Thoughts:

The first keyword: relevance

Correlation! = Causality, we cannot conclude that the two are causally related just because they are correlated.

The second keyword: Y

The third keyword is: X

Example of 0-1 regression (the example of 0-1 regression generally only has two answers so Y only has two values ​​​​to represent)

The mission of regression analysis:

Part II: Processing methods for different data types

 

Classification of data:

  1. cross-sectional data

     

     2.  Time series data:

    3.  Panel data

  2.          Processing methods for different data types:

  3. Part III: Understanding of Linear Regression and Research on Natural Problems

    Univariate linear regression:

    There is a disturbance term: yi-y^i=yi-B^0-B^1xi

  4. For linear understanding:

    Explanation of regression coefficients:

    Inquiry into Endogeneity:

    When there is no correlation between the disturbance term and all independent variables, the model is exogenous. Therefore, we need to find the correlation between the independent variable and the disturbance term of the model.

    Endogenous Monte Carlo simulation:

    Matlab practice:

     

    Core Explanatory Variables and Control Variables

    For the factors we want to take as variables, the rest of the factors can be regarded as disturbance items.

    Part IV: Explanation of the four models, setting of your variables and interpretation of the interaction term

    Explanation of regression coefficients:

    When to take the logarithm?

    Explanation of the regression coefficients for the four models:

     

    Special independent variables: dummy variables,

    For qualitative variables, we can use numbers to express, such as 1 for women and 0 for men.

     

    Multiple categorical dummy variables:

    In order to avoid the influence of complete multicollinearity, the number of dummy variables introduced is generally the number of categories minus 1.

    There is also the independent variable of the interaction term (multiplication of two independent variables)

    Part V: Introduction of Cases

    Introduction to Stata software:

    File import:

     

    Stata中一些函数的作用:// 按键盘上的PageUp可以使用上一次输入的代码(Matlab中是上箭头)
    
    // 清除所有变量
    
    clear
    
    // 清屏 和 matlab的clc类似
    
    cls
    
    // 导入数据(其实是我们直接在界面上粘贴过来的,我们用鼠标点界面导入更方便 本条请删除后再复制到论文中,如果评委老师看到了就知道这不是你写的了)
    
    // import excel "C:\Users\hc_lzp\Desktop\数学建模视频录制\第7讲.多元回归分析\代码和例题数据\课堂中讲解的奶粉数据.xlsx", sheet("Sheet1") firstrow
    
    import excel "课堂中讲解的奶粉数据.xlsx", sheet("Sheet1") firstrow
    
    // 定量变量的描述性统计
    
    summarize 团购价元 评价量 商品毛重kg
    
    // 定性变量的频数分布,并得到相应字母开头的虚拟变量
    
    tabulate 配方,gen(A)
    
    tabulate 奶源产地 ,gen(B)
    
    tabulate 国产或进口 ,gen(C)
    
    tabulate 适用年龄岁 ,gen(D)
    
    tabulate 包装单位 ,gen(E)
    
    tabulate 分类 ,gen(F)
    
    tabulate 段位 ,gen(G)
    
    // 下面进行回归
    
    regress 评价量 团购价元 商品毛重kg
    
    // 下面的语句可帮助我们把回归结果保存在Word文档中
    
    // 在使用之前需要运行下面这个代码来安装下这个功能包(运行一次之后就可以注释掉了)
    
    // ssc install reg2docx, all replace
    
    // 如果安装出现connection timed out的错误,可以尝试换成手机热点联网,如果手机热点也不能下载,就不用这个命令吧,可以自己做一个回归结果表,如果觉得麻烦就直接把回归结果截图。
    
    est store m1
    
    reg2docx m1 using m1.docx, replace
    
    // *** p<0.01  ** p<0.05 * p<0.1
    
    
    
    // Stata会自动剔除多重共线性的变量
    
    regress 评价量 团购价元 商品毛重kg A1 A2 A3 B1 B2 B3 B4 B5 B6 B7 B8 B9 C1 C2 D1 D2 D3 D4 D5 E1 E2 E3 E4 F1 F2 G1 G2 G3 G4
    
    est store m2
    
    reg2docx m2 using m2.docx, replace
    
    
    
    // 得到标准化回归系数
    
    regress 评价量 团购价元 商品毛重kg, b
    
    
    
    // 画出残差图
    
    regress 评价量 团购价元 商品毛重kg A1 A2 A3 B1 B2 B3 B4 B5 B6 B7 B8 B9 C1 C2 D1 D2 D3 D4 D5 E1 E2 E3 E4 F1 F2 G1 G2 G3 G4
    
    rvfplot
    
    // 残差与拟合值的散点图
    
    graph export a1.png ,replace
    
    // 残差与自变量团购价的散点图
    
    rvpplot  团购价元
    
    graph export a2.png ,replace
    
    
    
    // 为什么评价量的拟合值会出现负数?
    
    // 描述性统计并给出分位数对应的数值
    
    summarize 评价量,d
    
    
    
    // 作评价量的概率密度估计图
    
    kdensity 评价量
    
    graph export a3.png ,replace
    
    
    
    // 异方差BP检验
    
    estat hettest ,rhs iid
    
    
    
    // 异方差怀特检验
    
    estat imtest,white
    
    
    
    // 使用OLS + 稳健的标准误
    
    regress 评价量 团购价元 商品毛重kg A1 A2 A3 B1 B2 B3 B4 B5 B6 B7 B8 B9 C1 C2 D1 D2 D3 D4 D5 E1 E2 E3 E4 F1 F2 G1 G2 G3 G4, r
    
    est store m3
    
    reg2docx m3 using m3.docx, replace
    
    
    
    // 计算VIF
    
    estat  vif
    
    
    
    // 逐步回归(一定要注意完全多重共线性的影响)
    
    // 向前逐步回归(后面的r表示稳健的标准误)
    
    stepwise reg 评价量 团购价元 商品毛重kg A1 A3 B1 B2 B3 B4 B5 B6 B7 B9 C1 D1 D2 D3 D4 E1 E2 E3 F1 G1 G2 G3,  r pe(0.05)
    
    // 向后逐步回归(后面的r表示稳健的标准误)
    
    stepwise reg 评价量 团购价元 商品毛重kg A1 A3 B1 B2 B3 B4 B5 B6 B7 B9 C1 D1 D2 D3 D4 E1 E2 E3 F1 G1 G2 G3,  r pr(0.05)
    
    // 向后逐步回归的同时使用标准化回归系数(在r后面跟上一个b即可)
    
    stepwise reg 评价量 团购价元 商品毛重kg A1 A3 B1 B2 B3 B4 B5 B6 B7 B9 C1 D1 D2 D3 D4 E1 E2 E3 F1 G1 G2 G3,  r b pr(0.05)
    
    
    
    
    
    // 补充语法 (大家不需要具体的去学Stata软件,掌握我课堂上教给大家的一些命令应对数学建模比赛就可以啦)
    
    // 事实上大家学好Excel,学好后应对90%的数据预处理问题都能解决
    
    // (1) 用已知变量生成新的变量
    
    generate lny = log(评价量) 
    
    generate price_square = 团购价元 ^2
    
    generate interaction_term = 团购价元*商品毛重kg
    
    
    
    // (2) 修改变量名称,因为用中文命名变量名称有时候可能容易出现未知Bug
    
    rename 团购价元 price

  5. Introduction to each indicator in the case:

    The regression statement in Stata:

    The Model in the table corresponds to SSR, Residual corresponds to SSE, and Total corresponds to SST

    The columns of Df (degrees of freedom) are: k, nk-1, n-1.

    Look at prob if <0.1 (assumed to be 90%) pass.

    What to do if the goodness of fit is low:

    Reasons for negative values ​​in the fit:

    Standardized regression coefficients:

    Stata standardized regression command:

     

    The sixth part is heteroscedastic multicollinearity and interpretation of interaction terms:

    The conditions to be satisfied by the disturbance item:

    Heteroskedasticity and how to fix it:

    Test for heteroscedasticity:

    Reasons for negative fitted values

    The distribution of fitted values ​​is uneven, R^2 is too small, and negative numbers appear.

Hypothesis testing for heteroscedasticity:

Result of BP test:

White's test:

How to deal with heteroscedasticity:

OLS + Robust Standard Errors in Stata

Multicollinearity:

test:

deal with:

stepwise regression analysis

Implementation of stepwise regression analysis in Stata:

 

 

 

Full multicollinearity error:

renew:

Lasso returns

Since some independent variables in the data will lead to collinearity in the model, Losso regression is used to remove some unimportant independent variables.

 

The implementation of Losso regression we use stata operation:

 

Let us take cotton production estimates as an example

 

For independent variables with different dimensions, standardization is required.

The function to standardize data in Stata is: egen rename = independent variable that needs to be standardized. (The dimensions of this case are the same, just to give an example of how to standardize)

 

How to use stata for lasso regression?

 

Finally, stata will generate a data table and a table.

Data table analysis:

The ones with * in the data table represent λmin , MSPEmin. That is, our minimum tuning parameter

 

Form analysis:

Selected represents the core independent variable

Lasso represents the x coefficient and estimated value of the Lasso estimate.

Past-est OLS: x coefficients and estimates for standard multiple linear regression.

Lasso only helps us to eliminate the independent variable xi that may cause multicollinearity, and we still choose the parameters of the standard multiple regression model when generating the multiple linear regression model.

Note: The core variable will also change relative to the random number after the seed.

 

Use of Lasso regression: Help us filter out unimportant variables when building a multiple linear regression model on the data.

Steps: 1. Determine whether the dimensions of the independent variables are the same, if not, standardization preprocessing is required

2. Use lasso regression on the variables, and the variables whose coefficients are not 0 are the important variables to be left.

 

 

Guess you like

Origin blog.csdn.net/weixin_73612682/article/details/132095675