Chapter 8 (Bayesian Statistical Inference): Bayesian Least Mean Squares Estimation (贝叶斯最小均方估计)

本文为 I n t r o d u c t i o n Introduction Introduction t o to to P r o b a b i l i t y Probability Probability 的读书笔记

Bayesian Least Mean Squares Estimation

  • In this section, we discuss in more detail the conditional expectation estimator. In particular, we show that it results in the least possible mean squared error (LMS).

  • We start by considering the simpler problem of estimating Θ \Theta Θ with a constant θ ^ \hat\theta θ^, in the absence of an observation X X X. The estimation error θ ^ − Θ \hat\theta-\Theta θ^Θ is random (because Θ \Theta Θ is random), but the mean squared error E [ ( θ ^ − Θ ) 2 ] E[(\hat\theta-\Theta)^2] E[(θ^Θ)2] is a number that depends on θ ^ \hat\theta θ^, and can be minimized over θ ^ \hat\theta θ^.
    E [ ( θ ^ − Θ ) 2 ] = v a r ( θ ^ − Θ ) + ( E [ θ ^ − Θ ] ) 2 = v a r ( Θ ) + ( E [ Θ ] − θ ^ ) 2 E[(\hat\theta-\Theta)^2]=var(\hat\theta-\Theta)+(E[\hat\theta-\Theta])^2=var(\Theta)+(E[\Theta]-\hat\theta)^2 E[(θ^Θ)2]=var(θ^Θ)+(E[θ^Θ])2=var(Θ)+(E[Θ]θ^)2It turns that the best possible estimate is to set θ ^ \hat\theta θ^ equal to E [ Θ ] E[\Theta] E[Θ]
  • Suppose now that we use an observation X X X to estimate Θ \Theta Θ, so as to minimize the mean squared error. Once we know the value x x x of X X X, the situation is identical to the one considered earlier, except that we are now in a new “universe,” where everything is conditioned on X = x X = x X=x. We can therefore adapt our earlier conclusion and assert that the conditional expectation E [ Θ ∣ X = x ] E[\Theta|X=x] E[ΘX=x] minimizes the conditional mean squared error E [ ( θ ^ − Θ ) 2 ∣ X = x ] E[(\hat\theta-\Theta)^2|X=x] E[(θ^Θ)2X=x] over all constants θ ^ \hat\theta θ^.
  • Generally, the (unconditional) mean squared estimation error associated with an estimator g ( X ) g(X) g(X) is defined as
    E [ ( Θ − g ( X ) ) 2 ] E[(\Theta-g(X))^2] E[(Θg(X))2]For any given value x x x of X X X, g ( x ) g(x) g(x) is a number, and therefore,
    E [ ( Θ − E [ Θ ∣ X = x ] ) 2 ∣ X = x ] ≤ E [ ( Θ − g ( x ) ) 2 ∣ X = x ] E[(\Theta-E[\Theta|X=x])^2|X=x]\leq E[(\Theta-g(x))^2|X=x] E[(ΘE[ΘX=x])2X=x]E[(Θg(x))2X=x]Thus,
    E [ ( Θ − E [ Θ ∣ X ] ) 2 ∣ X ] ≤ E [ ( Θ − g ( X ) ) 2 ∣ X ] E[(\Theta-E[\Theta|X])^2|X]\leq E[(\Theta-g(X))^2|X] E[(ΘE[ΘX])2X]E[(Θg(X))2X]which is now an inequality between random variables (functions of X X X). We take expectations of both sides, and use the law of iterated expectations, to conclude that
    E [ ( Θ − E [ Θ ∣ X ] ) 2 ] ≤ E [ ( Θ − g ( X ) ) 2 ] E[(\Theta-E[\Theta|X])^2]\leq E[(\Theta-g(X))^2] E[(ΘE[ΘX])2]E[(Θg(X))2]If we view E [ Θ ∣ X ] E[\Theta|X] E[ΘX] as an estimator/function of X X X, the preceding analysis shows that out of all possible estimators, the mean squared estimation error is minimizes when g ( X ) = E [ Θ ∣ X ] g(X)=E[\Theta|X] g(X)=E[ΘX].

Example 8.11.
Let Θ \Theta Θ be uniformly distributed over the interval [ 4 , 10 ] [4, 10] [4,10] and suppose that we observe Θ \Theta Θ with some random error W W W. In particular, we observe the value of the random variable
X = Θ + W X=\Theta+W X=Θ+Wwhere we assume that W W W is uniformly distributed over the interval [ − 1.1 ] [-1. 1] [1.1] and independent of Θ \Theta Θ. What is the LMS estimate of Θ \Theta Θ?

SOLUTION

  • To calculate E [ Θ ∣ X = x ] E[\Theta|X = x] E[ΘX=x], we note that f Θ ( θ ) = 1 / 6 f_\Theta(\theta) = 1/6 fΘ(θ)=1/6, if 4 ≤ θ ≤ 10 4\leq\theta\leq 10 4θ10, and f θ ( θ ) = 0 f_\theta(\theta) = 0 fθ(θ)=0, otherwise. Conditioned on Θ \Theta Θ being equal to some θ \theta θ, X X X is uniformly distributed over the interval [ θ − 1 , θ + 1 ] [\theta- 1, \theta + 1] [θ1,θ+1]. Thus, the joint PDF is given by
    f Θ , X ( θ , x ) = f Θ ( θ ) f X ∣ Θ ( x ∣ θ ) = 1 6 ⋅ 1 2 = 1 12 f_{\Theta,X}(\theta,x)=f_\Theta(\theta)f_{X|\Theta}(x|\theta)=\frac{1}{6}\cdot\frac{1}{2}=\frac{1}{12} fΘ,X(θ,x)=fΘ(θ)fXΘ(xθ)=6121=121if 4 ≤ θ ≤ 10 4\leq\theta\leq10 4θ10 and θ − 1 ≤ x ≤ θ + 1 \theta-1\leq x\leq\theta +1 θ1xθ+1, and is zero for all other values of ( θ , x ) (\theta, x) (θ,x). The parallelogram in the right-hand side of Fig. 8.8 is the set of pairs ( θ , x ) (\theta, x) (θ,x) for which f Θ , X ( θ , x ) f_{\Theta,X}(\theta, x) fΘ,X(θ,x) is nonzero.
    在这里插入图片描述
  • Given that X = x X = x X=x, the posterior PDF f Θ ∣ X f_{\Theta|X} fΘX is uniform on the corresponding vertical section of the parallelogram. Thus E [ Θ ∣ X = x ] E[\Theta|X = x] E[ΘX=x] is the midpoint of that section, which in this example happens to be a piecewise linear function of x x x.

Problem 13.

  • (a) Let Y 1 , . . . , Y n Y_1, ... , Y_n Y1,...,Yn be independent identically distributed random variables and let Y = Y 1 + ⋅ ⋅ ⋅ + Y n Y =Y_1+···+Y_n Y=Y1++Yn. Show that
    E [ Y 1 ∣ Y ] = Y n E[Y_1|Y]=\frac{Y}{n} E[Y1Y]=nY
  • (b) Let Θ \Theta Θ and W W W be independent zero-mean normal random variables, with positive integer variances k k k and m m m, respectively. Use the result of part (a) to find E [ Θ ∣ Θ + W ] E[\Theta |\Theta + W] E[ΘΘ+W].
  • ( c ) (c) (c) Repeat part (b) for the case where Θ \Theta Θ and W W W are independent Poisson random variables with integer means λ \lambda λ and μ μ μ, respectively.

SOLUTION

  • (a) By symmetry, we see that E [ Y i ∣ Y ] E[Y_i| Y] E[YiY] is the same for all i i i. Furthermore,
    E [ Y 1 + ⋅ ⋅ ⋅ + Y n ∣ Y ] = E [ Y ∣ Y ] = Y E[Y_1 +· · ·+ Y_n | Y] = E[Y | Y] = Y E[Y1++YnY]=E[YY]=YTherefore, E [ Y 1 ∣ Y ] = Y n E[Y_1|Y]=\frac{Y}{n} E[Y1Y]=nY.
  • (b) We can think of Θ \Theta Θ and W W W as sums of independent standard normal random variables:
    Θ = Θ 1 + . . . + Θ k ,       W = W 1 + . . . + W m \Theta=\Theta_1+...+\Theta_k,\ \ \ \ \ W=W_1+...+W_m Θ=Θ1+...+Θk,     W=W1+...+WmWe identify Y Y Y with Θ + W \Theta + W Θ+W and use the result from part (a), to obtain
    E [ Θ i ∣ Θ + W ] = Θ + W k + m E[\Theta_i|\Theta+W]=\frac{\Theta+W}{k+m} E[ΘiΘ+W]=k+mΘ+WThus,
    E [ Θ ∣ Θ + W ] = k E [ Θ i ∣ Θ + W ] = k k + m ( Θ + W ) E[\Theta|\Theta+W]=kE[\Theta_i|\Theta+W]=\frac{k}{k+m}(\Theta+W) E[ΘΘ+W]=kE[ΘiΘ+W]=k+mk(Θ+W)
  • ( c ) (c) (c) We recall that the sum of independent Poisson random variables is Poisson. Thus the argument in part (b) goes through, by thinking of Θ \Theta Θ and W W W as sums of λ \lambda λ (respectively, μ μ μ) independent Poisson random variables with mean one. We then obtain
    E [ Θ ∣ Θ + W ] = λ λ + μ ( Θ + W ) E[\Theta|\Theta+W]=\frac{\lambda}{\lambda+\mu}(\Theta+W) E[ΘΘ+W]=λ+μλ(Θ+W)

Some Properties of the Estimation Error

  • Let us use the notation
    Θ ^ = E [ Θ ∣ X ] ,               Θ ~ = Θ ^ − Θ \hat\Theta=E[\Theta|X],\ \ \ \ \ \ \ \ \ \ \ \ \ \tilde\Theta=\hat\Theta-\Theta Θ^=E[ΘX],             Θ~=Θ^Θfor the LMS estimator and the associated estimation error, respectively. The random variables Θ ^ \hat\Theta Θ^ and Θ ~ \tilde\Theta Θ~ have a number of useful properties, which were derived in Section 4.3.
    在这里插入图片描述

Example 8.14.

  • Let us say that the observation X X X is u n i n f o r m a t i v e uninformative uninformative if the mean squared estimation error E [ Θ ~ 2 ] = v a r ( Θ ~ ) E[\tilde\Theta^2]= var(\tilde\Theta) E[Θ~2]=var(Θ~) is the same as v a r ( Θ ) var(\Theta) var(Θ), the unconditional variance of Θ \Theta Θ. When is this the case?
  • Using the formula
    v a r ( Θ ) = v a r ( Θ ~ ) + v a r ( Θ ^ ) var(\Theta) = var(\tilde\Theta) + var(\hat\Theta) var(Θ)=var(Θ~)+var(Θ^)we see that X X X is uninformative if and only if v a r ( Θ ^ ) = 0 var(\hat\Theta)=0 var(Θ^)=0. The variance of a random variable is zero if and only if that random variable is a constant, equal to its mean. We conclude that X X X is uninformative if and only if the estimate Θ ^ = E [ Θ ] \hat\Theta = E[\Theta] Θ^=E[Θ], for every value of X X X.
  • If Θ \Theta Θ and X X X are independent, we have Θ ^ = E [ Θ ∣ X = x ] = E [ Θ ] \hat\Theta=E[\Theta |X = x] = E[\Theta] Θ^=E[ΘX=x]=E[Θ] for all x x x, and X X X is indeed uninformative, which is quite intuitive. The converse, however, is not true: it is possible for E [ Θ ∣ X = x ] E[\Theta |X = x] E[ΘX=x] to be always equal to the constant E [ Θ ] E[\Theta] E[Θ], without Θ \Theta Θ and X X X being independent. (In fact, if E [ Θ ∣ X = x ] = E [ Θ ] E[\Theta |X = x]=E[\Theta] E[ΘX=x]=E[Θ], it can be derived that Θ \Theta Θ and X X X are uncorrelated.)

The Case of Multiple Observations and Multiple Parameters

  • The preceding argument and its conclusions apply even if X X X is a vector of random variables, X = ( X 1 , . . . , X n ) X = (X_1, ... , X_n) X=(X1,...,Xn). Thus, the mean squared estimation error is minimized if we use E [ Θ ∣ X 1 , . . . , X n ] E[\Theta|X_1, ... , X_n] E[ΘX1,...,Xn] as our estimator
    E [ ( Θ − E [ Θ ∣ X 1 , . . . , X n ] ) 2 ] ≤ E [ ( Θ − g ( X 1 , . . . , X n ) ) 2 ] E[(\Theta-E[\Theta|X_1, ... , X_n])^2]\leq E[(\Theta-g(X_1, ... , X_n))^2] E[(ΘE[ΘX1,...,Xn])2]E[(Θg(X1,...,Xn))2]
  • This provides a complete solution to the general problem of LMS estimation, but is often difficult to implement, for the following reasons:
    • (a) In order to compute the conditional expectation E [ Θ ∣ X 1 , . . . , X n ] E[\Theta|X_1,...,X_n] E[ΘX1,...,Xn], we need a complete probabilistic model, that is, the joint PDF f Θ , X 1 , . . . , X n f_{\Theta,X_1, ... ,X_n} fΘ,X1,...,Xn.
    • (b) Even if this joint PDF is available, E [ Θ ∣ X 1 , . . . , X n ] E[\Theta|X_1, ... , X_n] E[ΘX1,...,Xn] can be a very complicated function of X 1 , . . . , X n X_1, ... , X_n X1,...,Xn.
  • As a consequence, practitioners often resort to approximations of the conditional expectation or focus on estimators that are not optimal but are simple and easy to implement.
    • The most common approach, discussed in the next section, involves a restriction to linear estimators.

  • Finally, let us consider the case where we want to estimate multiple parameters Θ 1 , . . . , Θ m \Theta_1, ... , \Theta_m Θ1,...,Θm. It is then natural to consider the criterion
    E [ ( Θ 1 − Θ ^ 1 ) 2 ] + . . . + E [ ( Θ m − Θ ^ m ) 2 ] E[(\Theta_1-\hat\Theta_1)^2]+...+E[(\Theta_m-\hat\Theta_m)^2] E[(Θ1Θ^1)2]+...+E[(ΘmΘ^m)2]and minimize it over all estimators Θ ^ 1 , . . . , Θ ^ m \hat\Theta_1, ... , \hat\Theta_m Θ^1,...,Θ^m. But this is equivalent to finding, an each i i i, an estimator Θ ^ i \hat\Theta_i Θ^i that minimizes E [ ( Θ i − Θ ^ i ) 2 ] E[(\Theta_i-\hat\Theta_i)^2] E[(ΘiΘ^i)2], so that we are essentially dealing with m m m decoupled estimation problems, one for each unknown parameter Θ i \Theta_i Θi, yielding Θ ^ i = E [ Θ i ∣ X 1 , . . . , X n ] \hat\Theta_i=E[\Theta_i|X_1,...,X_n] Θ^i=E[ΘiX1,...,Xn], for all i i i.

猜你喜欢

转载自blog.csdn.net/weixin_42437114/article/details/114150487