Chapter 8 (Bayesian Statistical Inference): Bayesian Inference and the Posterior Distribution

本文为 I n t r o d u c t i o n Introduction Introduction t o to to P r o b a b i l i t y Probability Probability 的读书笔记

Statistical Inference 统计推断

  • Statistical inference is the process of extracting information about an unknown variable or an unknown model from available data. We aim to:
    • ( a ) (a) (a) Develop an appreciation of the main two approaches (Bayesian (贝叶斯统计推断) and classical (经典统计推断)), their differences, and similarities.
    • ( b ) (b) (b) Present the main categories of inference problems (parameter estimation (参数估计), hypothesis testing (假设检验), and significance testing (显著性检验)).
    • ( c ) (c) (c) Discuss the most important methodologies (maximum a posteriori probability rule (MAP 最大后验概率准则), least mean squares estimation (LMS 最小均方估计), maximum likelihood (最大似然估计), regression (回归), likelihood ratio tests (似然比检验), etc).

Bayesian versus Classical Statistics 贝叶斯统计和经典统计

  • Their fundamental difference relates to the nature of the unknown models or variables.
    • In the Bayesian view, they are treated as random variables with known prior distributions.
      • In particular, when trying to infer the nature of an unknown model, the Bayesian approach views the model as chosen randomly from a given model class. This is done by introducing a random variable Θ \Theta Θ that characterizes the model, and by postulating a prior probability distribution (先验概率分布) p Θ ( θ ) p_\Theta(\theta) pΘ(θ). Given the observed data x x x, one can, in principle, use Bayes’ rule to derive a posterior probability distribution (后验概率分布) p Θ ∣ X ( θ ∣ x ) p_{\Theta|X}(\theta|x) pΘX(θx). This captures all the information that x x x can provide about θ \theta θ.
    • By contrast, the classical approach views the unknown quantity θ \theta θ as a constant that happens to be unknown. It then strives to develop an estimate of θ \theta θ that has some performance guarantees. This introduces an important conceptual difference from other methods: we are not dealing with a single probabilistic model, but rather with multiple candidate probabilistic models, one for each possible value of θ \theta θ.

Model versus Variable Inference 模型推断和变量推断

  • Applications of statistical inference tend to be of two different types.
    • In model inference, the object of study is a real phenomenon or process for which we wish to construct or validate a model on the basis of available data (e.g., do planets follow elliptical trajectories?). Such a model can then be used to make predictions about the future, or to infer some hidden underlying causes.
    • In variable inference, we wish to estimate the value of one or more unknown variables by using some related, possibly noisy information (e.g., what is my current position, given a few GPS readings?).
  • The distinction between model and variable inference is not sharp; for example, by describing a model in terms of a set of variables, we can cast a model inference problem as a variable inference problem.
    • In any case, we will not emphasize this distinction in the sequel, because the same methodological principles apply to both types of inference.
    • In some applications, both model and variable inference issues may arise. For example, we may collect some initial data, use them to build a model, and then use the model to make inferences about the values of certain variables.

Example 8.1. A Noisy Channel.

  • A transmitter sends a sequence of binary messages s i ∈ { 0 , 1 } s_i \in \{ 0, 1\} si{ 0,1}, and a receiver observes
    X i = a s i + W i ,          i = 1 , . . . , n X_i = as_i + W_i,\ \ \ \ \ \ \ \ i = 1, ... , n Xi=asi+Wi,        i=1,...,nwhere the W i W_i Wi are zero mean normal random variables that model channel noise, and a a a is a scalar that represents the channel attenuation (信道衰减率).
  • In a model inference setting, a a a is unknown. The transmitter sends a pilot signal consisting of a sequence of messages s 1 , . . . , s n s_1, ... , s_n s1,...,sn, whose values are known by the receiver. On the basis of the observations X 1 , . . . , X n X_1, ... , X_n X1,...,Xn, the receiver wishes to estimate the value of a a a, that is, build a model of the channel.
  • Alternatively, in a variable inference setting, a a a is assumed to be known (possibly because it has already been inferred using a pilot signal, as above). The receiver observes X 1 , . . . , X n X_1, ... , X_n X1,...,Xn, and wishes to infer the values of s 1 , . . . , s n s_1, ... , s_n s1,...,sn.

A Rough Classification of Statistical Inference Problems

  • In an estimation problem (估计问题), a model is fully specified, except for an unknown, possibly multidimensional, parameter θ \theta θ, which we wish to estimate. This parameter can be viewed as either a random variable (Bayesian approach) or as an unknown constant (classical approach). The usual objective is to arrive at an estimate of θ \theta θ that is close to the true value in some sense. For example:
    • (a) In the noisy transmission problem of Example 8.1, use the knowledge of the pilot sequence and the observations to estimate a a a.
    • (b) On the basis of historical stock market data, estimate the mean and variance of the daily movement in the price of a particular stock.
  • In a binary hypothesis testing problem (二重假设检验问题), we start with two hypotheses and use the available data to decide which of the two is true. For example:
    • (a) In the noisy transmission problem of Example 8.1, use the knowledge of a a a and X i X_i Xi to decide whether s i s_i si was 0 or 1.
    • (b) Given a noisy picture, decide whether there is a person in the picture or not.
    • ( c ) (c) (c) Given a set of trials with two alternative medical treatments, decide which treatment is most effective.
  • More generally. in an m m m-ary hypothesis testing problem ( m m m 重假设检验问题), there is a finite number m m m of competing hypotheses. The performance of a particular method is typically judged by the probability that it makes an erroneous decision.

Bayesian Inference and the Posterior Distribution

贝叶斯推断后验分布

  • In Bayesian inference, the unknown quantity of interest, which we denote by Θ \Theta Θ, is modeled as a random variable or as a finite collection of random variables.

For simplicity, unless the contrary is explicitly stated, we view Θ \Theta Θ as a single random variable.

  • We aim to extract information about Θ \Theta Θ, based on observing a collection X = ( X 1 , . . . , X n ) X = (X_1,...,X_n) X=(X1,...,Xn) of related random variables, called observations, measurements, or an observation vector. For this, we assume that we know the joint distribution of Θ \Theta Θ and X X X. Equivalently, we assume that we know:
    • (a) A prior distribution p Θ p_\Theta pΘ or f Θ f_\Theta fΘ. depending on whether Θ \Theta Θ is discrete or continuous.
    • (b) A conditional distribution p X ∣ Θ p_{X|\Theta} pXΘ or f X ∣ Θ f_{X|\Theta} fXΘ, depending on whether X X X is discrete or continuous.
  • Once a particular value x x x of X X X has been observed, a complete answer to the Bayesian inference problem is provided by the posterior distribution p Θ ∣ X ( θ ∣ x ) p_{\Theta|X}(\theta|x) pΘX(θx) or f Θ ∣ X ( θ ∣ x ) f_{\Theta|X}(\theta|x) fΘX(θx) of θ \theta θ. This distribution is determined by the appropriate form of Bayes rule. It encapsulates everything there is to know about Θ \Theta Θ, given the available information, and it is the starting point for further analysis.
    在这里插入图片描述在这里插入图片描述

Example 8.2.
Romeo and Juliet start dating, but Juliet will be late on any date by a random amount X X X, uniformly distributed over the interval [ 0 , θ ] [0,\theta] [0,θ]. The parameter θ \theta θ is unknown and is modeled as the value of a random variable Θ \Theta Θ, uniformly distributed between zero and one hour. Assuming that Juliet was late by an amount x x x on their first date, how should Romeo use this information to update the distribution of Θ \Theta Θ?

SOLUTION

  • Here the prior PDF is
    在这里插入图片描述and the conditional PDF of the observation is
    在这里插入图片描述
  • Using Bayes’ rule, and taking into account that f Θ ( θ ) f X ∣ Θ ( x ∣ θ ) f_\Theta(\theta)f_{X|\Theta}(x |\theta) fΘ(θ)fXΘ(xθ) is nonzero only if 0 ≤ x ≤ θ ≤ 1 0\leq x\leq\theta\leq1 0xθ1, we find that for any x ∈ [ 0 , 1 ] x\in [0, 1] x[0,1], the posterior PDF is
    f Θ ∣ X ( θ ∣ x ) = f Θ ( θ ) f X ∣ Θ ( x ∣ θ ) ∫ 0 1 f Θ ( θ ′ ) f X ∣ Θ ( x ∣ θ ′ ) d θ ′ = 1 / θ ∫ x 1 1 θ ′ d θ ′ = 1 θ ⋅ ∣ ln ⁡ x ∣ ,        i f   0 ≤ x ≤ θ ≤ 1 f_{\Theta|X}(\theta|x)=\frac{f_\Theta(\theta)f_{X|\Theta}(x |\theta)}{\int_0^1f_\Theta(\theta')f_{X|\Theta}(x |\theta')d\theta'}=\frac{1/\theta}{\int_x^1\frac{1}{\theta'}d\theta'}=\frac{1}{\theta\cdot|\ln x|},\ \ \ \ \ \ if\ 0\leq x\leq\theta\leq1 fΘX(θx)=01fΘ(θ)fXΘ(xθ)dθfΘ(θ)fXΘ(xθ)=x1θ1dθ1/θ=θlnx1,      if 0xθ1and f Θ ∣ X ( θ ∣ x ) = 0 f_{\Theta|X}(\theta|x) = 0 fΘX(θx)=0 if θ < x \theta< x θ<x or θ > 1 \theta> 1 θ>1.
  • Consider now a variation involving the first n n n dates. Assume that Juliet is late by random amounts X 1 , . . . , X n X_1, ... , X_n X1,...,Xn , which given Θ = θ \Theta=\theta Θ=θ, are uniformly distributed in the interval [ 0 , θ ] [0,\theta] [0,θ], and conditionally independent. Let X = ( X 1 , . . . , X n ) X = (X_1, ... , X_n ) X=(X1,...,Xn) and x = ( x 1 , . . . , x n ) x = (x_1, ... , x_n ) x=(x1,...,xn). Similar to the case where n = 1 n = 1 n=1, we have
    在这里插入图片描述where
    x ‾ = max ⁡ { x 1 , . . . , x n } \overline x=\max\{x_1,...,x_n\} x=max{ x1,...,xn}The posterior PDF is
    在这里插入图片描述where c ( x ‾ ) c(\overline x) c(x) is a normalizing constant that depends only on x ‾ \overline x x
    c ( x ‾ ) = 1 ∫ x ‾ 1 1 ( θ ′ ) n d θ ′ c(\overline x)=\frac{1}{\int_{\overline x}^1\frac{1}{(\theta')^n}d\theta'} c(x)=x1(θ)n1dθ1

Example 8.3. Inference of a Common Mean of Normal Random Variables.
We observe a collection X = ( X 1 , . . . , X n ) X = (X_1, ... , X_n) X=(X1,...,Xn) of random variables, with an unknown common mean whose value we wish to infer. We assume that given the value of the common mean, the X i X_i Xi are normal and independent, with known variances σ 1 2 , . . . , σ n 2 \sigma_1^2,...,\sigma_n^2 σ12,...,σn2. In a Bayesian approach to this problem, we model the common mean as a random variable θ \theta θ, with a given prior. For concreteness, we assume a normal prior, with known mean x 0 x_0 x0 and known variance σ 0 2 \sigma_0^2 σ02.

  • Let us note, for future reference, that our model is equivalent to one of the form
    X i = Θ + W i ,           i = 1 , . . . , n X_i=\Theta+W_i,\ \ \ \ \ \ \ \ \ i=1,...,n Xi=Θ+Wi,         i=1,...,nwhere the random variables Θ \Theta Θ, W 1 , . . . , W n W_1, ... , W_n W1,...,Wn are independent and normal, with known means and variances. In particular, for any value θ \theta θ,
    E [ W i ] = E [ W i ∣ Θ = θ ] = 0 v a r ( W i ) = v a r ( X i ∣ Θ = θ ) = σ i 2 E[W_i]=E[W_i|\Theta=\theta]=0\\ var(W_i)=var(X_i|\Theta=\theta)=\sigma_i^2 E[Wi]=E[WiΘ=θ]=0var(Wi)=var(XiΘ=θ)=σi2A model of this type is common in many engineering applications involving several independent measurements of an unknown quantity.
  • According to our assumptions, we have
    f Θ ( θ ) = c 1 ⋅ exp ⁡ { − ( θ − x 0 ) 2 2 σ 0 2 } f_\Theta(\theta)=c_1\cdot\exp\bigg\{-\frac{(\theta-x_0)^2}{2\sigma_0^2}\bigg\} fΘ(θ)=c1exp{ 2σ02(θx0)2}and
    f X ∣ Θ ( x ∣ θ ) = c 2 ⋅ exp ⁡ { − ( x 1 − θ ) 2 2 σ 1 2 } . . . exp ⁡ { − ( x n − θ ) 2 2 σ n 2 } f_{X|\Theta}(x|\theta)=c_2\cdot\exp\bigg\{-\frac{(x_1-\theta)^2}{2\sigma_1^2}\bigg\}...\exp\bigg\{-\frac{(x_n-\theta)^2}{2\sigma_n^2}\bigg\} fXΘ(xθ)=c2exp{ 2σ12(x1θ)2}...exp{ 2σn2(xnθ)2}where c 1 c_1 c1 and c 2 c_2 c2 are normalizing constants that do not depend on θ \theta θ. We invoke Bayes’ rule,
    f Θ ∣ X ( θ ∣ x ) = f Θ ( θ ) f X ∣ Θ ( x ∣ θ ) ∫ f Θ ( θ ′ ) f X ∣ Θ ( x ∣ θ ′ ) d θ ′ f_{\Theta|X}(\theta|x)=\frac{f_\Theta(\theta)f_{X|\Theta}(x|\theta)}{\int f_\Theta(\theta')f_{X|\Theta}(x|\theta')d\theta'} fΘX(θx)=fΘ(θ)fXΘ(xθ)dθfΘ(θ)fXΘ(xθ)and note that the numerator term, f Θ ( θ ) f X ∣ Θ ( x ∣ θ ) f_\Theta(\theta)f_{X|\Theta}(x|\theta) fΘ(θ)fXΘ(xθ), is of the form
    c 1 c 2 ⋅ exp ⁡ { − ∑ i = 0 n ( x i − θ ) 2 2 σ i 2 } c_1c_2\cdot\exp\bigg\{-\sum_{i=0}^n\frac{(x_i-\theta)^2}{2\sigma_i^2}\bigg\} c1c2exp{ i=0n2σi2(xiθ)2}After some algebra, which involves completing the square inside the exponent (对指数的肩膀上的求和部分进行配平方), we find that the numerator is of the form
    d ⋅ exp ⁡ { − ( θ − m ) 2 2 v } d\cdot\exp\bigg\{-\frac{(\theta-m)^2}{2v}\bigg\} dexp{ 2v(θm)2}where
    m = ∑ i = 0 n x i / σ i 2 ∑ i = 0 n 1 / σ i 2 ,        v = 1 ∑ i = 0 n 1 / σ i 2 m=\frac{\sum_{i=0}^nx_i/\sigma_i^2}{\sum_{i=0}^n1/\sigma_i^2},\ \ \ \ \ \ v=\frac{1}{\sum_{i=0}^n1/\sigma_i^2} m=i=0n1/σi2i=0nxi/σi2,      v=i=0n1/σi21and d d d is a constant, which depends on the x i x_i xi but does not depend on θ \theta θ. Since the denominator term in Bayes’ rule does not depend on θ \theta θ either, we conclude that the posterior PDF has the form
    f Θ ∣ X ( θ ∣ x ) = a ⋅ exp ⁡ { − ( θ − m ) 2 2 v } f_{\Theta|X}(\theta|x)=a\cdot\exp\bigg\{-\frac{(\theta-m)^2}{2v}\bigg\} fΘX(θx)=aexp{ 2v(θm)2}for some normalizing constant a a a, which depends on the x i x_i xi, but not on θ \theta θ. We recognize this as a normal PDF, and we conclude that the posterior PDF is normal with mean m m m and variance v v v.
  • As a special case suppose that σ 0 2 , σ 1 2 , . . . , σ n 2 \sigma_0^2, \sigma_1^2,...,\sigma_n^2 σ02,σ12,...,σn2 have a common value σ 2 \sigma^2 σ2. Then, the posterior PDF of Θ \Theta Θ is normal with mean and variance
    m = x 0 + . . . + x n n + 1 ,         v = σ 2 n + 1 m=\frac{x_0+...+x_n}{n+1},\ \ \ \ \ \ \ v=\frac{\sigma^2}{n+1} m=n+1x0+...+xn,       v=n+1σ2respectively. In this case, the prior mean x 0 x_0 x0 acts just as another observation, and contributes equally to determine the posterior mean m m m of θ \theta θ. Notice also that the standard deviation of the posterior PDF of θ \theta θ tends to zero, at the rough rate of 1 / n 1/\sqrt n 1/n ,as the number of observations increases.
  • If the variances σ i 2 \sigma_i^2 σi2 are different, the formula for the posterior mean m m m is still a weighted average of the x i x_i xi, but with a larger weight on the observations with smaller variance.
  • The preceding example has the remarkable property that the posterior distribution of Θ \Theta Θ is in the same family as the prior distribution, namely, the family of normal distributions. This is appealing for two reasons:
    • (a) The posterior can be characterized in terms of only two numbers, the mean and the variance.
    • (b) The form of the solution opens up the possibility of efficient recursive inference. Suppose that after X 1 , . . . , X n X_1, ... , X_n X1,...,Xn are observed, an additional observation X n + 1 X_{n+1} Xn+1 is obtained. Instead of solving the inference problem from scratch, we can view f Θ ∣ X 1 , . . . , X n f_{\Theta|X_1,...,X_n} fΘX1,...,Xn as our prior, and use the new observation to obtain the new posterior f Θ ∣ X 1 , . . . , X n , X n + 1 f_{\Theta|X_1,...,X_n,X_{n+1}} fΘX1,...,Xn,Xn+1. It is then plausible (and can be formally established) that the new posterior of Θ \Theta Θ will have mean
      ( m / v ) + ( x n + 1 / σ n + 1 2 ) ( 1 / v ) + ( 1 / σ n + 1 2 ) \frac{(m/v)+(x_{n+1}/\sigma_{n+1}^2)}{(1/v)+(1/\sigma_{n+1}^2)} (1/v)+(1/σn+12)(m/v)+(xn+1/σn+12)and variance
      1 ( 1 / v ) + ( 1 / σ n + 1 2 ) \frac{1}{(1/v)+(1/\sigma_{n+1}^2)} (1/v)+(1/σn+12)1where m m m and v v v are the mean and variance of the old posterior f Θ ∣ X 1 , . . . , X n f_{\Theta|X_1,...,X_n} fΘX1,...,Xn.

Example 8.4. Beta Priors on the Bias of a Coin. (贝塔先验)
We wish to estimate the probability of heads, denoted by θ \theta θ, of a biased coin. We model θ \theta θ as the value of a random variable Θ \Theta Θ with known prior PDF f θ f_\theta fθ. We consider n n n independent tosses and let X X X be the number of heads observed.

  • From Bayes’ rule, the posterior PDF of θ \theta θ has the form, for θ ∈ [ 0 , 1 ] \theta\in [0, 1] θ[0,1],
    f Θ ∣ X ( θ ∣ k ) = c f Θ ( θ ) p X ∣ Θ ( k ∣ θ ) = d f Θ ( θ ) θ k ( 1 − θ ) n − k f_{\Theta|X}(\theta|k)=cf_\Theta(\theta)p_{X|\Theta}(k|\theta)=df_\Theta(\theta)\theta^k(1-\theta)^{n-k} fΘX(θk)=cfΘ(θ)pXΘ(kθ)=dfΘ(θ)θk(1θ)nkwhere c c c is a normalizing constant (independent of θ \theta θ), and d = c ( n k ) d=c\begin{pmatrix}n\\k\end{pmatrix} d=c(nk).
  • Suppose now that the prior is a beta density with integer parameters α > 0 \alpha> 0 α>0 and β > 0 \beta>0 β>0, of the form
    在这里插入图片描述where B ( α , β ) B(\alpha,\beta) B(α,β) is a normalizing constant, known as the Beta function, given by
    B ( α , β ) = ∫ 0 1 θ α − 1 ( 1 − θ ) β − 1 d θ = ( α − 1 ) ! ( β − 1 ) ! ( α + β − 1 ) ! B(\alpha,\beta)=\int_0^1\theta^{\alpha-1}(1-\theta)^{\beta-1}d\theta=\frac{(\alpha-1)!(\beta-1)!}{(\alpha+\beta-1)!} B(α,β)=01θα1(1θ)β1dθ=(α+β1)!(α1)!(β1)!the last equality can be obtained from integration by parts, or through a probabilistic argument (see Problem 30). Then, the posterior PDF of θ \theta θ is of the form
    f Θ ∣ X ( θ ∣ k ) = d B ( α , β ) θ k + α − 1 ( 1 − θ ) n − k + β − 1 ,             0 ≤ θ ≤ 1 f_{\Theta|X}(\theta|k)=\frac{d}{B(\alpha,\beta)}\theta^{k+\alpha-1}(1-\theta)^{n-k+\beta-1},\ \ \ \ \ \ \ \ \ \ \ 0\leq\theta\leq1 fΘX(θk)=B(α,β)dθk+α1(1θ)nk+β1,           0θ1and hence is a beta density with parameters
    α ′ = k + α ,             β ′ = n − k + β \alpha'=k+\alpha,\ \ \ \ \ \ \ \ \ \ \ \beta'=n-k+\beta α=k+α,           β=nk+β

猜你喜欢

转载自blog.csdn.net/weixin_42437114/article/details/114039346