统计分析之参数检验

当我们想通过小样本量的数据推测大样本（总体）的参数情况时，用到的方法我们称之为参数检验。那何为参数，像我们经常用到的均值或误差都是参数。案例如下：

我们想知道初一学生（总体）的平均年龄（参数），可通过抽样出的目标人群（样本）的实际平均年龄进行检验。
参数检验根据对总体了解的程度可分为两类，1）已知总体参数的部分情况（比如Z检验）和 2）对下总体参数一无所知（比如t检验）

参数检验的流程

确定原假设（假设依据）
确定参数统计量

统计学中把总体的指标统称为参数。而由样本算得的相应的总体指标称为统计量。如研究某地成年男子的平均脉搏数（次/分），并从该地抽取1000名成年男子进行测量，所得的样本平均数即称为统计量。
计算参数统计量概率，即原假设为真的概率p
显著性水平，我们日常判断一件事不是小概率事件的概率水平，通常设定为0.05，表示95%的概率这不是小概率事件而是件常事。所以当P>0.05时，说明原假设大概率是真的，而<0.05则拒真。

假设检验的选择

数据种类	数据类型	参数类型	检验方法
One sample	Noraml and same distribution	Parametric	One-sample t-test
One sample	unknown	Nonparametric	Wilcoxon signed-rank test
Matched pairs	Noraml and same distribution	Parametric	two-sample t-test
Matched pairs	unknown	Nonparametric	Wilcoxon signed-rank test
two independent pairs	Noraml and same distribution	Parametric	two-sample t-test
two independent pairs	unknown	Nonparametric	Wilcoxon rank sum test(or Mann-Whitney Test)

Wilcoxon rank sum test for independent samples

Q:
假设有两组来自同一分布的样本，随机分成两组，比较两者是否还是同分布。
A = (1.3, 3.4), nA = 2, A~F
B = (4.9,10.3,3.3), nB = 3, B~F

将所有样本进行排序并确定排位 Order 1.3 3.3 3.4 4.9 10.3 Assign ranks 1 2 3 4 5
计算统计量 R1(obs) = sum of ranks attached to A = 4
计算原假设下的统计量R
显著性水平，5个数两两组合中取到rank之和小于等于4的概率是P(R1<=4)=p(R1=3)+p(R1=4)=1/10+1/10=1/5

> term <- c(0.80, 0.83, 1.89, 1.04, 1.45,1.38, 1.91, 1.64, 0.73, 1.46)

> mid <- c(1.15, 0.88, 0.90, 0.74,1.21)

> rank(c(term,mid))
[1] 3 4 14 7 11 10 15 13 1 12 8 5 6 2 9

> sum(rank(c(term,mid))[1:10])
[1] 90

> sum(rank(c(term,mid))[1:10])-(10*11/2)
[1] 35

> 1-pwilcox(34,10,5) # Beware this is a discrete random variable....
[1] 0.1272061 Output from the test:

> wilcox.test(term, mid, alternative = "g") # greater
Wilcoxon rank sum test
data: term and mid W = 35, p-value = 0.1272
alternative hypothesis: true mu is greater than 0

Wilcoxon signed-rank test for paired samples(=paired t-test)

如果生男生女概率相同，意味了生孩子属于某一性别的概率为0.5，在统计学里可以对应为正态分布。所以检测两组成对样本间是否有显著差异的问题，可以转变为差异量是否符合0-1标准正态分布问题。
Q:
比较黑鱼中两种细胞检测物质所测量出的汞含量之间是否有显著差异？

A:
原假设：差异量D均匀于0

基于计算出的均值方差，求所有配对样本的统计量Z(obs)
显著性水平 2P(Z<=Z(obs))

> pnorm(-1.27)
[1] 0.1020423
2 sided P 0.203

> wilcox.test(Hg~way,paired=T)
Wilcoxon signed rank test with correction
data: Hg by way
V =107,
p-value = 0.2242
alternative hypothesis: true mu is
not equal to 0

> pt(-1.745,24)
[1] 0.04688927
2 sided P 0.094

> t.test(Hg~way,paired=T)
Paired t-test
data: Hg by way
t = -1.7448, df = 24,
p-value =0.0938
alternative hypothesis:
true difference
in means is not equal to 0
95 percent confidence interval:
-0.088189837 0.007389837
sample estimates:
mean of the differences
-0.0404

One-sample t-test vs. wilcoxon signed ranks test

House price

price <-c(120, 110, 108, 100, 150, 106, 100, 100, 114, 130, 122, 100, 120, 130, 115, 112, 126, 110, 120, 128)

hist(price)

Use the following command to perform a one-sample t-test, testing the null hypothesis that the population mean is 118.

扫描二维码关注公众号，回复： 9726597 查看本文章

t.test(price, mu=118)
## 
##  One Sample t-test
## 
## data:  price
## t = -0.67654, df = 19, p-value = 0.5069
## alternative hypothesis: true mean is not equal to 118
## 95 percent confidence interval:
##  110.0172 122.0828
## sample estimates:
## mean of x 
##    116.05

Use the wilcox.test() command to compare the results to a wilcoxon signed ranks test. For this test, you should test the null hypothesis that the population median=118.

wilcox.test(price, mu=118)
## Warning in wilcox.test.default(price, mu = 118): cannot compute exact p-
## value with ties
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  price
## V = 80, p-value = 0.3594
## alternative hypothesis: true location is not equal to 118

Two-sample t-test vs. Wilcoxon rank sum test

Salary data

Salary <- c(18.9,10.5,  17.5,   13.1,   13.0,   18.2,   22.0,   13.0,   25.0,   12.2,  10.3,15.5,   24.4,   11.8,   15.0,   25.6,   11.8,   22.8,   19.4,   12.3, 22.7, 27.3,   16.0,   11.0,   12.6,   17.7,   17.2, 20.2, 34.0,   36.4,   11.3,   24.0,   17.6,   26.0,   25.7,   17.2,   14.1,   22.0,   17.2,   20.9,   16.8,   19.3,   15.8,   27.0,   20.4,   25.5,   30.1, 28.3, 29.5,   31.6)
Sector<- c(rep(0,25), rep(1,25))

hist(Salary[Sector==0])
hist(Salary[Sector==1])

Use the following command to perform the two-sample t-test. This assumes that the population variances for each population are equal. We are testing the null hypothesis that the population mean for group 1 = population mean for group 2.

t.test(Salary~Sector, var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  Salary by Sector
## t = -3.3933, df = 48, p-value = 0.001392
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -9.16664 -2.34536
## sample estimates:
## mean in group 0 mean in group 1 
##          16.876          22.632

Since the normality assumptions is dubious from the histograms. Use the wilcox.test() command to perform a Wilcoxon rank sum test(Mann-Whitney test) and compare the results to the two-sample t-test.

wilcox.test(Salary~Sector)
## Warning in wilcox.test.default(x = c(18.9, 10.5, 17.5, 13.1, 13, 18.2,
## 22, : cannot compute exact p-value with ties
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Salary by Sector
## W = 156.5, p-value = 0.002547
## alternative hypothesis: true location shift is not equal to 0

统计分析之参数（假设）检验