Stata Simple Regression and Test – Pan Deng's stata notes
Article Directory
OLS regression
sysuse auto, clear
regress price weight // OLS
aaplot price weight // 图示拟合情况
Coefficient's t-test
// *-OLS 的估计系数是一个随机变量, SE 衡量了其不确定程度;
regress price weight
dis "t-value = " %4.2f _b[weight]/_se[weight]
twoway function y=tden(72, x), ///
rang(-6 6) xline(5.2, lp(dash) lc(red))
heteroscedasticity robust standard errors
sysuse auto, clear
reg price weight, robust
Calculate Fits and Residuals
regress price weight
predict price_fit, xb // 拟合值, xb 选项可以省略,默认
gen price_fit2 = _b[_cons] + _b[weight]*weight //手动计算
predict e, residual // 残差, residual 选项是必须的, 可以简写为 r
gen e2 = price - price_fit //手动计算
br price weight price_* e*
residual analysis
计算正常工资和超额工资
sysuse nlsw88, clear
global x "age hours tenure collgrad married south"
reg wage $x
keep if e(sample) //仅保留参与回归的观察值, 参见 D3_miss.do
predict normal_wage //正常工资(线性拟合值)
predict excess_wage, res //超额工资(残差, 可正可负)
// *-进一步分析
histogram excess_wage //直方图, 参见 G3_histogram.do
tabstat excess_wage, by(industry) c(s) /// //统计分析
s(mean N sd p50 min max) f(%4.2f)
global z "i.race union never_married"
reg excess_wage $z //影响因素,不完整
correlation coefficient matrix
Correlation Matrix Scatterplot
sysuse auto, clear
graph matrix price weight length mpg
Pearson correlation coefficient
sysuse nlsw88, clear
// *-stata 官方命令
global x "age grade wage hours ttl_exp tenure"
pwcorr $x //缺陷: (1)小数点后两位为宜; (2)没有标注显著水平;
pwcorr $x, sig //整理起来很麻烦
pwcorr $x, star(0.05) //小数点后两位不易调整;
// 自编命令 _a与_c的主要区别就是标星的时候a会根据显著性水平标1-3颗星
pwcorr_a $x, format(%7.3f)
pwcorr_c $x, star(0.05) format(%7.2f) //比较符合多数期刊的要求
Spearman correlation coefficient
sysuse nlsw88, clear
// *-stata 官方命令
global x "age grade wage hours ttl_exp tenure"
spearman $x, star(0.05)
Combined presentation of the Spearman and Pearson correlation coefficient matrices
sysuse nlsw88, clear
// *-stata 官方命令
global x "age grade wage hours ttl_exp tenure"
corsp $x, format(%7.3f)
corsp $x, format(%7.3f) pvalue
Notice:
- Pearson correlation coefficient, lower triangle
- Spearman correlation coefficient, upper triangle
- You can add asterisks according to the p-value and mark the significant level
Difference Between Spearman and Pearson Correlation Coefficients
- Continuous variable, normal distribution, linear relationship. Both are acceptable, Pearson correlation coefficient is better;
- If any of the above conditions are not met, use the spearman correlation coefficient instead of the Pearson correlation coefficient
t test
Univariate t-test
sysuse nlsw88, clear
ttest wage, by(collgrad)
ttest wage, by(race) //错误命令
ttest wage if race!=3, by(race) //限定为两组即可
Multivariate t-test
Essentially the result of combining multiple univariate
sysuse nlsw88, clear
global x "wage hours tenure ttl_exp" //待检验变量列表
ttable3 $x, by(collgrad)
normdiff command: output t-value or p-value
sysuse nlsw88, clear
global x "wage hours tenure ttl_exp" //待检验变量列表
normdiff $x, over(collgrad) ///
diff t p n(below) f(%16.2f) quietly nonormdiff
normdiff: standardized difference
sysuse nlsw88, clear
global x "wage hours tenure ttl_exp" //待检验变量列表
qui reg $x
keep if e(sample) //保证所有的变量有相同的观察值个数
normdiff $x, over(collgrad) ///
diff t p n(below) f(%16.2f) quietly
Variation of variables across multiple groups
In essence, it is the result of univariate running multiple grouping and merging
sysuse nlsw88, clear
ttestplus wage, by(married union collgrad south)
// Group1: var=0; Group2: var=1
robust standard error
White Heteroscedasticity Robust Standard Errors
sysuse nlsw88, clear
global x "ttl_exp race age industry hours"
reg wage $x
est store homo
reg wage $x, robust // White(1980)
est store robust
esttab homo robust, mtitle(Homo Het_Robust) nogap
Notice:
- This is the method adopted by more than 90% of the literature;
- The robust standard errors of subsequent complex models are also basically based on White (1980)
Cluster-adjusted standard error
Thought:
- The distractors of individuals in the same industry are correlated with each other
- The distractors of individuals in different industries are uncorrelated with each other
sysuse nlsw88, clear
global x "ttl_exp race age industry hours"
reg wage $x, vce(cluster industry)
// 二维 cluster: industry occupation
egen indoccu = group(industry occupation) //D5_egen.do
sort industry occupation
br industry occupation indoccu
reg wage $x, vce(cluster indoccu)
Bootstrap Robust Standard Errors
Basic idea: Assuming that the sample is randomly selected from the matrix, the distribution of the matrix is simulated by repeatedly drawing samples from the sample;
- Using OLS to estimate the original model, we get xxThe estimated coefficient of x β x \beta_xbx
- Draw N observations from the sample with replacement, perform OLS, and record coefficient estimates
- Repeat step 2 300 times to get 300 records of coefficient estimates, namely β j = { β 1 , β 2 , . . . , β 300 } \beta_j = \{\beta_1, \beta_2, ... , \beta_{300}\}bj={ b1,b2,...,b300};
- Compute the standard deviation of these 300 estimates sd ( β j ) = sd { β 1 , β 2 , . . . , β 300 } sd(\beta_j) = sd\{\beta_1, \beta_2, ..., \ beta_{300}\}s d ( bj)=s d { b1,b2,...,b300} , treat it as the actual estimated valueβ x \beta_xbxThe standard error of , namely sd ( β j ) = se ( β x ) sd(\beta_j) = se(\beta_x)s d ( bj)=se ( bx)
- Calculate the t value: t = β x / se ( β x ) t = \beta_x/se(\beta_x)t=bx/ se ( bx) , and the corresponding p-value
reg wage hours, vce(bs,reps(300) noheader nodots)
reg wage hours, robust noheader // White s.e.
Notice:
- In most cases, 1000 repeatable samples will give very stable results
- Most commands in stata support the vce(bs) option
- Before submitting, please set the seed value to ensure that the results can be reproduced
reg price weight, vce(bs,reps(1000) seed(13579))
Presentation and output of results
regfit
: output linear fitting expression
reg price weight length mpg trunk i.foreign i.rep78
regfit
dis in g "R-square = " in y %4.2f e(r2) in g " F = " in y %4.2f e(F)
ereturn
return value after return
reg price weight length mpg trunk i.foreign i.rep78
ereturn list
logout
output the result to a document
// 调入数据
sysuse nlsw88.dta, clear
global xx "wage age tenure ttl_exp hours married"
logout, save("Tab1_statis") excel replace: ///
tabstat $xx, stat(mean p50 sd min max) ///
format(%3.2f) column(statis)
est store
Temporary resultsesttab
Display the temporary results
// 调入数据
sysuse nlsw88.dta, clear
global xx "wage age tenure ttl_exp hours married"
reg $xx
est store full
reg $xx if race==1
est store white
reg $xx if race==2
est store black
reg $xx i.occupation
est store occu
esttab full white black occu, nogap // 基本设定
// 接上面
// local s "using Tab3_reg.csv" // 输出 Excel 文档的暂元
local m "full white black occu" // 放置模型名称的暂元
// esttab `m' `s', nogap compress replace 能直接输出到csv文档中
esttab `m' , nogap compress ///
mtitle("Full" "White" "Black" "with_occu") ///
b(%4.3f) t(%4.2f) ///
scalar(N r2_a) ///
star(* 0.1 ** 0.05 *** 0.01) ///
drop(*.*)
in:
nogap
remove blank linescompress
Present results in a more compact formreplace
Overwrite existing old filesb(%4.3f)
The coefficient retains three decimal placest(%4.2f)
t-values rounded to two decimal placesscalar()
Statistics for the last two rows: N-sample number; r2_a-adj-R2