stata simple regression and test

Stata Simple Regression and Test – Pan Deng's stata notes

OLS regression

sysuse auto, clear
regress price weight   // OLS

aaplot price weight    // 图示拟合情况

insert image description here

Coefficient's t-test

// *-OLS 的估计系数是一个随机变量, SE 衡量了其不确定程度;

regress price weight
dis "t-value = " %4.2f _b[weight]/_se[weight]

twoway function y=tden(72, x),   ///
        rang(-6 6) xline(5.2, lp(dash) lc(red))

insert image description here

heteroscedasticity robust standard errors

sysuse auto, clear
reg price weight, robust

insert image description here

Calculate Fits and Residuals

regress price weight
	
predict price_fit, xb  // 拟合值, xb 选项可以省略,默认
gen price_fit2 = _b[_cons] + _b[weight]*weight //手动计算

predict e, residual    // 残差, residual 选项是必须的, 可以简写为 r
gen e2 = price - price_fit //手动计算

br price weight price_* e*

residual analysis

计算正常工资和超额工资
  
sysuse nlsw88, clear
        
global x "age hours tenure collgrad married south"
reg wage $x
keep if e(sample)   //仅保留参与回归的观察值, 参见 D3_miss.do
        
predict normal_wage       //正常工资(线性拟合值)
        
predict excess_wage, res  //超额工资(残差, 可正可负)
        
// *-进一步分析	  

histogram excess_wage     //直方图, 参见 G3_histogram.do
        
tabstat excess_wage, by(industry)  c(s)   /// //统计分析
        s(mean N sd p50 min max) f(%4.2f)
        
global z "i.race union never_married"	
reg excess_wage $z    //影响因素,不完整

insert image description here

correlation coefficient matrix

Correlation Matrix Scatterplot

sysuse auto, clear
graph matrix price weight length mpg

insert image description here

Pearson correlation coefficient

sysuse nlsw88, clear

// *-stata 官方命令
global x "age grade wage hours ttl_exp tenure"
pwcorr $x      //缺陷: (1)小数点后两位为宜; (2)没有标注显著水平;
pwcorr $x, sig        //整理起来很麻烦
pwcorr $x, star(0.05) //小数点后两位不易调整;

// 自编命令 _a与_c的主要区别就是标星的时候a会根据显著性水平标1-3颗星
pwcorr_a $x, format(%7.3f)
pwcorr_c $x, star(0.05) format(%7.2f) //比较符合多数期刊的要求

insert image description here

Spearman correlation coefficient

sysuse nlsw88, clear

// *-stata 官方命令
global x "age grade wage hours ttl_exp tenure"
spearman $x, star(0.05)

Combined presentation of the Spearman and Pearson correlation coefficient matrices

sysuse nlsw88, clear

// *-stata 官方命令
global x "age grade wage hours ttl_exp tenure"
corsp $x, format(%7.3f)
corsp $x, format(%7.3f) pvalue

Notice:

  • Pearson correlation coefficient, lower triangle
  • Spearman correlation coefficient, upper triangle
  • You can add asterisks according to the p-value and mark the significant level

Difference Between Spearman and Pearson Correlation Coefficients

  • Continuous variable, normal distribution, linear relationship. Both are acceptable, Pearson correlation coefficient is better;
  • If any of the above conditions are not met, use the spearman correlation coefficient instead of the Pearson correlation coefficient

t test

Univariate t-test

sysuse nlsw88, clear
	
ttest wage, by(collgrad)  

ttest wage, by(race)            //错误命令
ttest wage if race!=3, by(race) //限定为两组即可

insert image description here

Multivariate t-test

Essentially the result of combining multiple univariate

sysuse nlsw88, clear
global x "wage hours tenure ttl_exp" //待检验变量列表
ttable3 $x, by(collgrad)

insert image description here

normdiff command: output t-value or p-value

sysuse nlsw88, clear
global x "wage hours tenure ttl_exp" //待检验变量列表
normdiff $x, over(collgrad)   ///
        diff t p n(below) f(%16.2f) quietly nonormdiff

insert image description here

normdiff: standardized difference

sysuse nlsw88, clear
global x "wage hours tenure ttl_exp" //待检验变量列表
qui reg $x
keep if e(sample) //保证所有的变量有相同的观察值个数
normdiff $x, over(collgrad)   ///
            diff t p n(below) f(%16.2f) quietly

insert image description here

Variation of variables across multiple groups

In essence, it is the result of univariate running multiple grouping and merging

sysuse nlsw88, clear
ttestplus wage, by(married union collgrad south)
// Group1: var=0; Group2: var=1

insert image description here

robust standard error

White Heteroscedasticity Robust Standard Errors

sysuse nlsw88, clear
global x "ttl_exp race age industry hours"

reg wage $x
est store homo

reg wage $x, robust  // White(1980)
est store robust

esttab homo robust, mtitle(Homo Het_Robust) nogap

insert image description here

Notice:

  • This is the method adopted by more than 90% of the literature;
  • The robust standard errors of subsequent complex models are also basically based on White (1980)

Cluster-adjusted standard error

Thought:

  • The distractors of individuals in the same industry are correlated with each other
  • The distractors of individuals in different industries are uncorrelated with each other
sysuse nlsw88, clear
global x "ttl_exp race age industry hours"
reg wage $x, vce(cluster industry)

// 二维 cluster: industry occupation
  
egen indoccu = group(industry occupation) //D5_egen.do
sort industry occupation
br industry occupation indoccu

reg wage $x, vce(cluster indoccu)

insert image description here

Bootstrap Robust Standard Errors

Basic idea: Assuming that the sample is randomly selected from the matrix, the distribution of the matrix is ​​simulated by repeatedly drawing samples from the sample;

  1. Using OLS to estimate the original model, we get xxThe estimated coefficient of x β x \beta_xbx
  2. Draw N observations from the sample with replacement, perform OLS, and record coefficient estimates
  3. Repeat step 2 300 times to get 300 records of coefficient estimates, namely β j = { β 1 , β 2 , . . . , β 300 } \beta_j = \{\beta_1, \beta_2, ... , \beta_{300}\}bj={ b1,b2,...,b300}
  4. Compute the standard deviation of these 300 estimates sd ( β j ) = sd { β 1 , β 2 , . . . , β 300 } sd(\beta_j) = sd\{\beta_1, \beta_2, ..., \ beta_{300}\}s d ( bj)=s d { b1,b2,...,b300} , treat it as the actual estimated valueβ x \beta_xbxThe standard error of , namely sd ( β j ) = se ( β x ) sd(\beta_j) = se(\beta_x)s d ( bj)=se ( bx)
  5. Calculate the t value: t = β x / se ( β x ) t = \beta_x/se(\beta_x)t=bx/ se ( bx) , and the corresponding p-value
reg wage hours, vce(bs,reps(300) noheader nodots) 
reg wage hours, robust noheader // White s.e.

insert image description here

Notice:

  1. In most cases, 1000 repeatable samples will give very stable results
  2. Most commands in stata support the vce(bs) option
  3. Before submitting, please set the seed value to ensure that the results can be reproduced
reg price weight, vce(bs,reps(1000) seed(13579))

Presentation and output of results

  • regfit: output linear fitting expression
reg price weight length mpg trunk i.foreign i.rep78
regfit
dis in g "R-square = " in y %4.2f e(r2) in g "  F = " in y %4.2f e(F)

insert image description here

  • ereturnreturn value after return
reg price weight length mpg trunk i.foreign i.rep78
ereturn list 

insert image description here

  • logoutoutput the result to a document
// 调入数据
sysuse nlsw88.dta, clear
 
global xx "wage age tenure ttl_exp hours married"
logout, save("Tab1_statis") excel replace: ///
tabstat $xx, stat(mean p50 sd min max)   ///
                        format(%3.2f) column(statis)

insert image description here

  • est storeTemporary results
  • esttabDisplay the temporary results
// 调入数据
sysuse nlsw88.dta, clear
 
global xx "wage age tenure ttl_exp hours married"
reg $xx
est store full
reg $xx if race==1
est store white
reg $xx if race==2
est store black
reg $xx i.occupation
est store occu

esttab full white black occu, nogap // 基本设定

insert image description here

// 接上面
// local s "using Tab3_reg.csv"    // 输出 Excel 文档的暂元
local m "full white black occu" // 放置模型名称的暂元 
// esttab `m' `s', nogap compress replace 能直接输出到csv文档中
esttab `m' , nogap compress             ///
        mtitle("Full" "White" "Black" "with_occu") ///
        b(%4.3f) t(%4.2f)                          /// 
                scalar(N r2_a)                             ///
                star(* 0.1 ** 0.05 *** 0.01)               ///
                drop(*.*)

insert image description here

in:

  • nogapremove blank lines
  • compressPresent results in a more compact form
  • replaceOverwrite existing old files
  • b(%4.3f)The coefficient retains three decimal places
  • t(%4.2f)t-values ​​rounded to two decimal places
  • scalar()Statistics for the last two rows: N-sample number; r2_a-adj-R2

Guess you like

Origin blog.csdn.net/weixin_52185313/article/details/130118380