Probability and Statistics 17-- point estimate and the continuity correction

  Original |   https://mp.weixin.qq.com/s/NV3ThVwhM5dTIDQAWITSQQ

  Probability (probabilty) and statistics (statistics) are two similar concepts, in fact, just the opposite problem of research.

  Probability is using a model known parameters to predict the result of the model generated, and research findings related to digital features, such as expectation, variance and so on. Suppose now known that a shooting athletes with mean score of 8.2, 1.5 and variance of the normal distribution, you can score this case athletes have one shot of a rough estimate.

  Statistics and probability on the contrary, is it a bunch of data, and then use the data to infer the model parameters and models. Now came a strange player, we know nothing about him, but he declared that he is a good professional athlete. After carrying out a series of shooting tests, the coaching staff gathered the name of a group of athletes, by observing the data that his grades in line with the normal distribution (that is, to determine the model), followed by further data speculation model parameters specific values. For the normal distribution, the parameters are the mean and variance.

  It would appear that most of the problems we encounter are the statistical issues, although the probability is the objective law of random events, but unfortunately, the law always appear as an unknown quantity, as compensation, we have a series of data samples, While these may be much less than the overall sample, but can still be carried out in order to point these samples on the overall parameter estimation, approximation draw on the overall probability distribution. This is the parameter estimation based on a sample of the overall process parameter estimation. Depending on the nature of the parameters, it can be divided into point and interval estimation.

  

  The point is to use the estimated value of a specific sample statistics directly infer an unknown population parameters, get a specific parameter value. We have said before moment estimation, maximum likelihood estimation, Bayesian estimation are point estimates.

  We estimate that at the moment, for example, to re-look at how to estimate the population from the sample.

 

The overall number and number of samples

  We often expressed with the overall number of n, exactly what constitutes overall it?

  Overall always gives a "more" concept, but it is not, different issues of "total" number may be very different. For example, a brewery producing canned beer a year is 10 million, the number of students in a class is 60 people, whether it is 60 or 10 million, are overall. Represented by the overall number of n.

  Since it is estimated that the overall sample, of course, sampling, Sampling may refer to: Data Analysis (4) - Gossip sampling | whether seemingly random sample fair really fair? . It represents the number of samples with m.

Estimated population mean

  Is determined to a large extent depend on the doctor's blood test results. Blood from the end to get the report card will take some time, this time not sure, maybe good luck, got ten minutes, it could wait an hour. Now obtained a set of samples, = {X-X . 1 , X 2 , ......, X m }, where each of the acquired data representative of a patient waiting time before a single report. Our goal is based on a sample of this group overall average waiting time to do an estimate.

  Calculation is very simple, just need to calculate the sample mean it:

  We believe that the overall distribution of the sample distribution is similar to the mean value is based on the best data currently known approximate general description. The results of this estimation with the sample mean population mean moment estimation method is called, the point estimate is the estimate of the population mean.

  The following code shows the overall relationship between the distribution and sampling distributions:

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

fig = plt.figure(figsize=(10, 5))
plt.subplots_adjust (hspace = 0.5)   # adjust the top and bottom margins between the sub FIG 

MU, sigma_square = 30,. 5 # mean and variance 
Sigma sigma_square = 0.5 ** # standard deviation 
XS = np.arange (15, 45, 0.5 )
ys = stats.norm.pdf(xs, mu, sigma)
ax = fig.add_subplot(2, 2, 1)
ax.plot (XS, YS, label = ' density curve ' )
ax.vlines (MU, 0, 0.2, linestyles = ' - ' , = Colors ' R & lt ' , label = ' Mean ' )
ax.legend(loc='upper right')
ax.set_xlabel('X')
ax.set_ylabel('pdf')
ax.set_title('X~N($\mu$, $\sigma^2$), $\mu$={0}, $\sigma^2$={1}'.format(mu, sigma_square))

for i in [1, 2, 3]:
    m = I ** 10 # number of samples 
    np.random.seed (m)
    X- = stats.norm.rvs (LOC = MU, Sigma Scale =, m = size) # generates the m normally distributed random variable 
    X-np.trunc = (X-) # data rounded 
    mu_x = X.mean ( ) # sample mean 
    AX = fig.add_subplot (2, 2,. 1 + I )
    ax.hist(X, bins=40)
    ax.set_xlabel('X')
    ax.set_ylabel ( ' frequency ' )
    ax.set_title ( ' m = {0}, the sample mean. 1} {= ' .format (m, mu_x))

plt.rcParams [ ' font.sans serif- ' ] = [ ' SimHei ' ]   # for Chinese normal display tag 
# plt.rcParams [ 'axes.unicode_minus'] = False # resolve in the negative coordinate axis display problems Chinese 
plt .show ()

  You can see, the more samples taken, the closer the overall distribution of the sample distribution. There is the problem of the mean symbol, in a variety of materials in a while is μ, will be wearing a hat μ, x is a will to pull out, in the end with what?

  In the past we have always said some of the problems ~ N in line with the X-(μ, [sigma] 2 ) distribution, where μ is the overall mean; resulting in maximum likelihood estimation results were said to illustrate this point is through a sample of the overall mean the estimated amount.

The overall estimate of variance

  Suppose we have previously calculated the estimated amount of points mean, is it possible to calculate the sample variance is calculated as the total variance of the same?

  From the above you can see the whole map, most of the data concentrated in the smaller near the mean, the probability of extreme values occurring is very low, which means that for the sample, the smaller the sample size, able to get the possibility of extreme value. Variance characterizes the data relative to the volatility of expectations, due to the probability of extreme values of the samples appears very low, so the fluctuations in the sample is likely to be lower than the overall volatility, the variance is smaller than the overall variance. To cope with this situation, we often see is another formula to calculate the sample variance:

  a/(m - 1)肯定大于a/m,这使得②的结果稍大于①,m的值越小,①和②的差别越明显。随着样本数量的增加,取得极端值的机会也变大,①和②的差别也会越来越小。将样本的方差作为总体方差的点估计量,通常用s2表示。

  值得一提的是,如果我们有m个样本,在计算这些样本的实际方差时,直接用①;如果是用这些样本估计总体的方差,应该使用②。

估计总体的比例

  很多人会在30分钟之内取得报告单,同样也有很多人要等更久。我们可以计算出样本中成功人数(30分钟之内拿到报告的人数)的比例,并用这个比例作为总体概率的点估计量:

  到目前为止,点估计仍然很简单,所以经常有人吐槽:概率这么简单的玩意有啥值得研究的?

样本出现的概率

  经过多年的统计分析,医院已经明确告知,每个患者都有50%的概率会在30分钟内拿到报告单。我们用p=50%表示总体中所有在30分钟之内拿到报告的人数的占比。如果把一个患者在30分钟之内拿到报告看作成功,用随机变量X表示m个样本中的成功数量,那么X符合参数为m和p的二项分布,X~B(m, p),即成功次数符合试验次数和成功率的二项分布。保持试验次数m不变,二项分布近似于均值为mp、方差为mpq的正态分布(q = 1 - p)。

  下面的代码画出了二项分布和其近似的正态分布:

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

fig = plt.figure(figsize=(10, 6))
plt.subplots_adjust(hspace=0.8, wspace=0.3)  # 调整子图之间的边距

p = 0.5 # 每次试验成功的概率
q = 1 - p # 每次试验失败的概率
m_list = [10, 15, 20] # 试验次数
c_list = ['r', 'g', 'b'] # 曲线颜色
m_max = max(m_list)

# 二项分布 X~B(m,p)
for i, m in enumerate(m_list):
    ax = fig.add_subplot(3, 2, i * 2 + 1)
    xs = np.arange(0, m + 1, 1) # 随机变量的取值
    ys = stats.binom.pmf(xs, m, p) # 二项分布 X~B(m,p)
    ax.vlines(xs, 0, ys, colors=c_list[i], label='m={}, p={}'.format(m, p))
    ax.set_xticks(list(range(0, m_max + 1, 2))) # 重置x轴坐标
    ax.set_xlabel('X')
    ax.set_ylabel('pmf')
    ax.set_title('X~B(m, p)')
    ax.legend(loc='upper right')

# 保持二项分布试验的次数m不变,二项分布近似于均值为mp、方差为mp(1-p)的正态分布:
for i, m in enumerate(m_list):
    ax = fig.add_subplot(3, 2, i * 2 + 2)
    xs = np.arange(0, m + 1, 0.1) # 随机变量的取值
    mu, sigma = m * p, (m * p * q) ** 0.5
    ys = stats.norm.pdf(xs, mu, sigma)
    ax.plot(xs, ys, c=c_list[i], label='m={}, p={}'.format(m, p))
    ax.set_xticks(list(range(0, m_max + 1, 2)))  # 重置x轴坐标
    ax.set_xlabel('X')
    ax.set_ylabel('pdf')
    ax.set_title('X~N(mp, mpq)')
    ax.legend(loc='upper right')

plt.show()

  某天来了20名患者,其中有12人在30分钟之内拿到了报告单(12个成功)。根据二项分布,这种情况出现的概率是:

  100天过去,每天都有20名患者接受验血,xi人在30分钟内拿到了报告,每天的样本都对应一个概率:

  上式中所有m­i的数量都是20,之所以用m­i表示,是为了强调虽然每天的样本数量一致,但样本本身是不同的。如果将这些概率也看成随机变量,那么这些变量也必然会符合某一个分布,只要弄清这个分布,就能回答产生某个样本的概率。既然可以通过样本知道样本中成功数量的占比,那么这个分布也就等同于“样本中成功数量的占比”的概率。比如第10天的样本中成功数量的占比是p10=45%,我们的目标是了解p10产生的概率有多大,即P(p10)=?换句话说,我们希望知道所有Pi(X=xi)构成的分布。

  

  我们用ps表示某个特定样本中成功数量的占比,借助期望和方差来窥探ps的分布。一个明显的关系是,如果总体中有50%的人可以在30分钟内拿到报告,那么我们也同样期望在样本中看到这个比例,这也是我们能够用样本估计总体的基础。用随机变量X表示样本中成功的数量,ps = X/m:

  我们已经知道X~B(m, p),这里m是样本数量,p是每个样本成功的概率,是预先给出的。二项分布的期望是E[x] = mp,方差是Var(X)=mpq,q = 1 – p,因此:

  E[ps]告诉我们,样本中成功数量的占比与整体中成功数量的占比一致;Var(p­s)告诉我们,m越大,p­s的方差越小,样本中成功数量的占比越近总体中成功数量的占比,用ps来估计p越可靠。既然二项分布X~B(m, p)可以由X~N(mp, mpq)来近似,那么p­s =X/m也可以由p­s~N(p, pq/m)来近似。对于本例来说,p=0.25,pq/m=0.0125:

  值得注意的是,比例的分布刻画的是样本成功数占比(即X/m)的变化情况,而二项分布刻画的是特定数量的样本中成功数(即X)的变化情况。比例的取值范围是[0, 1],因此在描述ps的分布时,随机变量的有效取值范围是[0, 1]。当m固定时,每个成功数占比都代表一个特定的样本,我们可以借用ps的分布计算从总体中抽样出某个固定数量样本的概率。

连续性修正

  对于二项分布来说,保持试验次数n不变,二项分布近似于均值为np、方差为npq的正态分布。这里特别强调了“近似于”,是因为二项分布的随机变量是离散型的,而正态分布的随机变量是连续型,但是这又有什么关系呢?

  这里先要了解一下离散型分布函数和连续型分布函数的特点。对于连续型分布来说,其分布函数是用密度函数的积分表示的:

  对于积分来说,a~b的区间与是否包含a点或b点没什么关系,对于连续型随机变量的累积概率来说:

  但是上式对于离散型随机变量并不成立。下面是一个离散型分布函数,纵坐标的c.d.f是累积分布函数(cumulative distribution function)的缩写:

  上图向我们展示了P(X < 1) = 0,P(X ≤ 1) = 0.5。这意味着对于离散型随机变量来说,经常有P(x ≤ a) ≠ P(x < a)的情况(并不总是不等,这要看a的取值,对于上图来说,P(X < 1.5) = P(X ≤ 1.5)),而连续型随机变量总是有P(x≤a) = P(x<a)。

  

  μ=50,σ2=25的正态分布X~N(μ, σ2)可以用来近似n=100,p=0.5的二项分布X~B(n, p),下图是二者的分布函数(注意这里的曲线是分布函数,而不是密度函数):

  可以看出,由于二项分布的离散型随机变量只能取到整数,因此它的分布函数是阶梯状的,而正态分布的曲线穿过了每个阶梯的中心点,将阶梯分成了两部分,左半部分离散分布大于连续分布,右半部分则相反:

  分别用FB(x)=PB(X≤x)和Fn(x)=Pn(X≤x)表示二项分布和正态分布的分布函数,对于整数x来说,在[x, x+0.5)区间内,FB(x) > Fn(x);在(x+0.5, x+1)区间内,FB(x) < Fn(x);只有在中心点,才有FB(x) = Fn(x)。

  现在问题来了,用正态分布去做近似的时候,如果直接用FN(x)去近似PB(X<x),那么结果会偏大;如果用FN(x)去近似PB(X≤x),则结果会偏小:

  时大时小并不是个好主意,我们想要的是一个一致的近似,要么总是大,要么总是小。一个办法是对于X的正态连续性修正为±0.5,即用FN(x+0.5)去近似PB(X < x)和PB(X ≤ x),得到的结果不会偏小;或用FN(x-0.5)去近似PB(X < x)和PB(X ≤ x),得到的结果不会偏大。这有点类似于用黎曼和计算积分时选用左矩形公式还是右矩形公式:

  回顾上一节的内容,我们计算出了样本占比p­s的正态分布近似,p­s的连续性修正为:

  

  借助连续性修正可以求得:

 


  出处:微信公众号 "我是8位的"

  本文以学习、研究和分享为主,如需转载,请联系本人,标明作者和出处,非商业用途! 

  扫描二维码关注作者公众号“我是8位的”

Guess you like

Origin www.cnblogs.com/bigmonkey/p/12290228.html