Probability and Statistics 20-- estimator selection criteria

  Way to estimate population parameters varied, in order to judge the merits of the estimated amount, we need the help of some of the selection criteria.

The mess of symbols

  I think the design parameter estimation is always artificially variety threshold, which mingled with various symbols, while a X, while is x; a moment is θ, while is θ (X); as well as "general parameters", "to be estimation parameters "such terms, what is the meaning of a few?

  It is necessary to sort out these symbols.

  We used the 18 to 50 year-old male height, for example, all 18 to 50 year-old male is overall. Probability and Statistics, when we talk in general, refers to a random variable with a specific probability distributions, this is represented by the random variable X, X comply with certain distribution. n represents the number of total, assuming that these men have 300 million, then n is equal to 300 million. When doing statistical survey certainly not all, so the cost is too high, so only a sample. Of course, there are also sample a variety of forms, such as uniform sampling, sampling and other refuse, which is another topic, be launched in the data analysis column.

  Now the investigation of one million eligible men, these men constitute "a sample of the whole" and with the X- 1 , the X- 2 , ..., the X- m represents, the X- i represents a sample of the i-th male, m is the sample size, m is equal to 1 million. Each sample has a specific height of the male, is a specific numerical value, the value represented by lowercase X, X 10 = 176cm represents the value of the sample data 10 is 176cm, this time X- 10 = X 10 . This is similar to P (X = x) meaning, X denotes a random variable itself, x denotes a particular value.

  Notably, if the X- . 1 , X- 2 , ..., X- m denotes a sample, random samples are stressed, theoretical, yet born samples, each data sample is a random variable; if with X . 1 , X 2 , ..., X m denotes a sample, the sample stressed random variables have specific value, a sample is already owned.

  In addition, the value of n is not necessarily great, if the survey average height of a particular class, then the value of n just a few students in this class, such as n = 60. n is not necessarily a determined value, for example, from the founding of the National People now consume a total of how many pounds of beer, there is no specific number, just know that this number is not large enough to edge.

  Now we know that 18 to 50-year-old male height in line with a mean μ, variance [sigma] 2 normal X is N ~ (μ, [sigma] 2 ), [mu] and [sigma] 2 is called "general parameters", it is this two values determine the specific form of distribution, generally represents a collection of parameters with a large Θ. Overall more than one parameter, where μ and [sigma] 2 are overall parameters. θ is a parameter in the population, which may represent [mu], [sigma] can also represent 2 , meaning somewhat variable, it may be better understood than by θ with x, where x has been occupied. Further, by showing the mean of the samples, with S 2 represents the variance of the sample.

  Now is a specific value [theta] does not know how much is required depending on the sample X- . 1 , X- 2 , ..., X- m estimate the parameter [theta], with a specific estimate expressed. Represented by a specific sample the X- 1 , the X- 2 , ..., the X- m estimated that out, just to emphasize this point, as to how it is estimated that another matter. This is somewhat similar to y = y (x), y is a first specific value, this value is determined by the x, y is a second mapping relationship, the mapping relation as to what is another matter. Sometimes referred to as the m samples = {X-X- . 1 , X- 2 , ..., X- m }, so there is , if μ represented by [theta], there . Where X is no longer overall, but from the sample population, as X in the end or overall sample, determined according to the context.

Question multiple parameters generated

  [Mu] is the known population mean, variance [sigma] 2 > 0, but does not know the specific values of both, as compensation, we have generally the m data samples, X- . 1 , X- 2 , ......, X- m . Now want the overall probability distribution model that estimates μ and σ by these samples through the sample estimates 2 specific values.

  There are known to the general expectation and variance of two digital features, but do not know the specific value, which can not not know how much stronger than the direct spade.

  We have assumed that the sample analyzed using tools like the histogram, or direct consultation with experts in the field, that the general should follow a normal distribution, the X-~ N ([mu], [sigma] 2 ). Now we can use a variety of methods to estimate μ and [sigma] 2 ?

  Point estimates and continuity correction (Probability and Statistics 17) in the introduction, sample moments estimator is:

  One-dimensional normal distribution of the maximum likelihood estimation (probability 11) , the maximum likelihood estimation can be obtained similar conclusions:

  When m large, 1 / m and the gap 1 / (m-1) is small, it is considered moment estimation and maximum likelihood estimation conclusion equal. Can we therefore conclude that, being bilingual estimation obtained in any distribution of the same conclusions?

  

  Or the estimated population mean and variance, this time learned from the analysis of the sample, the overall likely to meet uniform distribution X ~ [a, b] of.

  In the Look law of large numbers (Probability and Statistics 18) , we already know the density function of the uniform distribution, so as to obtain a uniform distribution of the mean and variance:

  When using the moment to estimate the mean and variance obtained sample, we will consider the overall sample moments equal moments, resulting in a system of equations on a and b, then obtain the estimated amount of moments a and b:

  Here we can also see that moment estimation advantage is simple, no matter what the overall distribution of obedience, calculated sample moments are the same.

  

  Now look at the maximum likelihood estimation under uniform distribution of the sample.

  With X min and X max represents the sample values of the minimum and maximum for the X ~ [a, b], the values of all samples in between a, b, i.e., X min ≥ A, X max ≤ B , likelihood function is:

  After the goal is to find a sample according to L (X; a, b) maximum a, b values:

  This result and moment estimate significantly different.

 

 

  The problem now is that we confuse the two estimators of the pros and cons. This is a new problem we have to face.

 

  我们用 表示两种方案的估计量。对于不同的估计量,与真实值的差误差也不同,无法仅凭一个数值来评估估计量,而是使用一条曲线:

 

 

  对于某些估计而言 ,对于另外一些则可能相反。这就好比两个人的考试成绩,甲的语文成绩比较好,而乙的数学成绩更优秀。能否找出一个全优的学生呢?也就是对于整体中的全部参数,我们都希望估得最佳结果,以使得根据样本估计的分布接近整体分布。这是个美好的愿望,随着待估计参数的增加,找到全优学生的难度也急剧增大。因此为了找出最优估计量,我们必须添加一些额外的评判规则。这就涉及到如何评估估计量的问题。较为常用的三个标准是无偏性、有效性和相合性。

无偏性

  X1, X2, …, X是来自于总体中的样本,θ是总体分布的参数,θ∈Θ,根据样本可以得到θ的估计量:

  如果的数学期望存在,且:

  如果对于整体中的任意θ,上式都成立,则称是θ的无偏估计量。

  这到底是啥意思?参数为什么能有期望?

无偏性的数学解释

  首先需要回顾第一节的内容,清楚地了解这些符号的真正含义。

  设总体X的均值为μ,方差是σ2>0,它们都是整体分布的参数,且都是待估计的未知参数。既然μ和σ2都是和总体分布有关的参数,它们自然都可以用θ表示,作为估计量的就代表了 。在这个例子中,“是θ的无偏估计”意味着:

  如果使用矩估计,则根据再看大数定律(概率统计18)中的内容,样本均值的期望与方差是:

  这表明样本均值是整体均值的无偏估计。

  

  样本的方差是:

  这里之所以用Xi而不是xi,是为了强调样本的随机性,可以简单地理解为计划抽取一个随机样本,但还没有真正开始抽取。

  

  现在看看E[S2]是多少。

  

  根据方差的性质:

  对于样本中的任意一个随机变量来说,方差和期望都相等:

  此外:

  最终:

  上面的结论表明,样本方差S2也是总体方差的无偏估计,这也附带说明了样本方差的系数是1/(m-1)的原因,如果取1/m,则估计量无法确保无偏性。

  

  从这个例子中也看出,无论总体符合什么分布,样本均值都是整体均值的无偏估计,样本方差也都是总体方差的无偏估计。

无偏性的意义

  样本X1, X2, …, X是随机的,因此根据这些样本得出的估计量 也是随机的,我们已经多次重申过这一点。既然是随机的,那么一个自然的结论是:根据样本的不同,有些估计量可能偏大,有些可能偏小。反复将这一估计量使用多次,就“平均”来说其偏差为零。

  在科学技术中称为以作为θ估计的系统误差。无偏估计的实际意义就是无系统误差。

 

  既然如此,是否意味着无偏估计一定好呢?通常来讲是的,但也不尽然,比如下图中,有偏的甲明显更优于无偏的乙。

不同的无偏估计量

  设总体X服从指数分布,概率密度为:

  其中参数θ未知,X1, X2, …, X是来自X的样本,根据指数分布的性质:

  因此样本均值是参数θ的无偏估计量。

 

  然而估计量不止一种,下面的mZ也是θ的无偏估计量:

  Z具有概率密度:

  可见一个未知参数可能有不同的无偏估计量。

有效性

  同一个参数为什么会出现不同的无偏估计量呢?我们可以想象一个场景:任何人都可以估计明天的天气,至于是否准确另当别论。同样是估计天气,气象局的天气预报显然更准确。但就无偏性来说,普通人和天气预报的平均偏差都为0。这就好比甲乙二人的射击比赛,甲的成绩明显高于乙,但无偏性却告诉我们二者的成绩相同,这显然是荒谬的:

  对于上图来说,谁的成绩越接近靶心,谁的成绩就越好,这也正是有效性的基本逻辑。对于参数θ的两个无偏估计量,谁和θ更靠近,谁就越好。一种自然的方式是比较不同的无偏估计量与θ之差的绝对值,但是绝对值不易处理,于是使用平方误差法,这也是一种常用的较为简便的方式。如果对于整体中的任意θ,都有:

  则称 有效。

  再次强调的是,都是随机值,因此才通过期望来去掉随机性,进而比较二者谁更有效:

  另一个值得关注的问题是,有效性还强调了对于任意θ∈Θ都成立。如果总体参数θ中包含两个待估计变量,只有当方案1的两个估计量全部优于方案2时,才能说方案1比方案2更有效。

  对于上节的指数分布来说:

  因此比mZ更有效。

相合性

  简单而言,如果当样本的容量增大时,估计量逐渐收敛于待估计参数的真实值,那么称是θ的相合估计量。

  相合性是对一个估计量的基本要求,如果估计量不具有相合性,那么无论样本的容量有多大,都无法将参数估计得足够准确,这种估计已经有点近似于胡乱猜测。

优化的策略

  有了评选标准之后,我们就可以使用一些优化策略,找出最优估计量。

  无偏性为估计量加上了限制,有了这条限制,大多数不太好的估计量会被排除。经过无偏性的筛选后,再使用有效性求得的最优解称为最小方差无偏估计量(uniformly minimum variance unbiased estimate,UMVUE)。

  尽管我们可以通过减少候选项的方式找出最优解,但需要认清的事实是,找到任何情况下都适用的全能最优解绝非易事。既然如此,不妨改变策略,弱化最优解的定义,只要满足相合性和渐进有效性,就认为这个解是可以接受的。

  渐进有效性:当样本容量n→∞时, 收敛于理论边界。

  最大似然估计就是这种策略下最常用的方案。

  在最小方差无偏估计中,我们实际上是想找到总分最优的估计量,但这种方法假设所有参数都是平等的,并没有为参数分配恰当的权重。贝叶斯估计采用了另一种思路应对这个问题。

  无论最小方差无偏估计还是最大似然估计,我们都认为待估计参数θ是个确定的值,比如1949年10月1日中华人民共和国成立,这是一个明确的日期。而在贝叶斯估计中,把θ也看作一个变量,所求的是θ的分布,也就是后验分布,如果后验分布较窄,则可信度较高,否则可信度较低。这类似于估计1949年10月1日中华人民共和国成立的概率是多少。贝叶斯估计的难点在于后验概率的计算较为复杂。关于更多先验和后验的问题将在后续章节陆续展开。

 


  出处:微信公众号 "我是8位的"

  本文以学习、研究和分享为主,如需转载,请联系本人,标明作者和出处,非商业用途! 

  扫描二维码关注作者公众号“我是8位的”

Guess you like

Origin www.cnblogs.com/bigmonkey/p/12346914.html