本文为 I n t r o d u c t i o n Introduction Introduction t o to to P r o b a b i l i t y Probability Probability 的读书笔记
目录
Bayesian Least Mean Squares Estimation
- In this section, we discuss in more detail the conditional expectation estimator. In particular, we show that it results in the least possible mean squared error (LMS).
- We start by considering the simpler problem of estimating Θ \Theta Θ with a constant θ ^ \hat\theta θ^, in the absence of an observation X X X. The estimation error θ ^ − Θ \hat\theta-\Theta θ^−Θ is random (because Θ \Theta Θ is random), but the mean squared error E [ ( θ ^ − Θ ) 2 ] E[(\hat\theta-\Theta)^2] E[(θ^−Θ)2] is a number that depends on θ ^ \hat\theta θ^, and can be minimized over θ ^ \hat\theta θ^.
E [ ( θ ^ − Θ ) 2 ] = v a r ( θ ^ − Θ ) + ( E [ θ ^ − Θ ] ) 2 = v a r ( Θ ) + ( E [ Θ ] − θ ^ ) 2 E[(\hat\theta-\Theta)^2]=var(\hat\theta-\Theta)+(E[\hat\theta-\Theta])^2=var(\Theta)+(E[\Theta]-\hat\theta)^2 E[(θ^−Θ)2]=var(θ^−Θ)+(E[θ^−Θ])2=var(Θ)+(E[Θ]−θ^)2It turns that the best possible estimate is to set θ ^ \hat\theta θ^ equal to E [ Θ ] E[\Theta] E[Θ] - Suppose now that we use an observation X X X to estimate Θ \Theta Θ, so as to minimize the mean squared error. Once we know the value x x x of X X X, the situation is identical to the one considered earlier, except that we are now in a new “universe,” where everything is conditioned on X = x X = x X=x. We can therefore adapt our earlier conclusion and assert that the conditional expectation E [ Θ ∣ X = x ] E[\Theta|X=x] E[Θ∣X=x] minimizes the conditional mean squared error E [ ( θ ^ − Θ ) 2 ∣ X = x ] E[(\hat\theta-\Theta)^2|X=x] E[(θ^−Θ)2∣X=x] over all constants θ ^ \hat\theta θ^.
- Generally, the (unconditional) mean squared estimation error associated with an estimator g ( X ) g(X) g(X) is defined as
E [ ( Θ − g ( X ) ) 2 ] E[(\Theta-g(X))^2] E[(Θ−g(X))2]For any given value x x x of X X X, g ( x ) g(x) g(x) is a number, and therefore,
E [ ( Θ − E [ Θ ∣ X = x ] ) 2 ∣ X = x ] ≤ E [ ( Θ − g ( x ) ) 2 ∣ X = x ] E[(\Theta-E[\Theta|X=x])^2|X=x]\leq E[(\Theta-g(x))^2|X=x] E[(Θ−E[Θ∣X=x])2∣X=x]≤E[(Θ−g(x))2∣X=x]Thus,
E [ ( Θ − E [ Θ ∣ X ] ) 2 ∣ X ] ≤ E [ ( Θ − g ( X ) ) 2 ∣ X ] E[(\Theta-E[\Theta|X])^2|X]\leq E[(\Theta-g(X))^2|X] E[(Θ−E[Θ∣X])2∣X]≤E[(Θ−g(X))2∣X]which is now an inequality between random variables (functions of X X X). We take expectations of both sides, and use the law of iterated expectations, to conclude that
E [ ( Θ − E [ Θ ∣ X ] ) 2 ] ≤ E [ ( Θ − g ( X ) ) 2 ] E[(\Theta-E[\Theta|X])^2]\leq E[(\Theta-g(X))^2] E[(Θ−E[Θ∣X])2]≤E[(Θ−g(X))2]If we view E [ Θ ∣ X ] E[\Theta|X] E[Θ∣X] as an estimator/function of X X X, the preceding analysis shows that out of all possible estimators, the mean squared estimation error is minimizes when g ( X ) = E [ Θ ∣ X ] g(X)=E[\Theta|X] g(X)=E[Θ∣X].
Example 8.11.
Let Θ \Theta Θ be uniformly distributed over the interval [ 4 , 10 ] [4, 10] [4,10] and suppose that we observe Θ \Theta Θ with some random error W W W. In particular, we observe the value of the random variable
X = Θ + W X=\Theta+W X=Θ+Wwhere we assume that W W W is uniformly distributed over the interval [ − 1.1 ] [-1. 1] [−1.1] and independent of Θ \Theta Θ. What is the LMS estimate of Θ \Theta Θ?
SOLUTION
- To calculate E [ Θ ∣ X = x ] E[\Theta|X = x] E[Θ∣X=x], we note that f Θ ( θ ) = 1 / 6 f_\Theta(\theta) = 1/6 fΘ(θ)=1/6, if 4 ≤ θ ≤ 10 4\leq\theta\leq 10 4≤θ≤10, and f θ ( θ ) = 0 f_\theta(\theta) = 0 fθ(θ)=0, otherwise. Conditioned on Θ \Theta Θ being equal to some θ \theta θ, X X X is uniformly distributed over the interval [ θ − 1 , θ + 1 ] [\theta- 1, \theta + 1] [θ−1,θ+1]. Thus, the joint PDF is given by
f Θ , X ( θ , x ) = f Θ ( θ ) f X ∣ Θ ( x ∣ θ ) = 1 6 ⋅ 1 2 = 1 12 f_{\Theta,X}(\theta,x)=f_\Theta(\theta)f_{X|\Theta}(x|\theta)=\frac{1}{6}\cdot\frac{1}{2}=\frac{1}{12} fΘ,X(θ,x)=fΘ(θ)fX∣Θ(x∣θ)=61⋅21=121if 4 ≤ θ ≤ 10 4\leq\theta\leq10 4≤θ≤10 and θ − 1 ≤ x ≤ θ + 1 \theta-1\leq x\leq\theta +1 θ−1≤x≤θ+1, and is zero for all other values of ( θ , x ) (\theta, x) (θ,x). The parallelogram in the right-hand side of Fig. 8.8 is the set of pairs ( θ , x ) (\theta, x) (θ,x) for which f Θ , X ( θ , x ) f_{\Theta,X}(\theta, x) fΘ,X(θ,x) is nonzero.
- Given that X = x X = x X=x, the posterior PDF f Θ ∣ X f_{\Theta|X} fΘ∣X is uniform on the corresponding vertical section of the parallelogram. Thus E [ Θ ∣ X = x ] E[\Theta|X = x] E[Θ∣X=x] is the midpoint of that section, which in this example happens to be a piecewise linear function of x x x.
Problem 13.
- (a) Let Y 1 , . . . , Y n Y_1, ... , Y_n Y1,...,Yn be independent identically distributed random variables and let Y = Y 1 + ⋅ ⋅ ⋅ + Y n Y =Y_1+···+Y_n Y=Y1+⋅⋅⋅+Yn. Show that
E [ Y 1 ∣ Y ] = Y n E[Y_1|Y]=\frac{Y}{n} E[Y1∣Y]=nY - (b) Let Θ \Theta Θ and W W W be independent zero-mean normal random variables, with positive integer variances k k k and m m m, respectively. Use the result of part (a) to find E [ Θ ∣ Θ + W ] E[\Theta |\Theta + W] E[Θ∣Θ+W].
- ( c ) (c) (c) Repeat part (b) for the case where Θ \Theta Θ and W W W are independent Poisson random variables with integer means λ \lambda λ and μ μ μ, respectively.
SOLUTION
- (a) By symmetry, we see that E [ Y i ∣ Y ] E[Y_i| Y] E[Yi∣Y] is the same for all i i i. Furthermore,
E [ Y 1 + ⋅ ⋅ ⋅ + Y n ∣ Y ] = E [ Y ∣ Y ] = Y E[Y_1 +· · ·+ Y_n | Y] = E[Y | Y] = Y E[Y1+⋅⋅⋅+Yn∣Y]=E[Y∣Y]=YTherefore, E [ Y 1 ∣ Y ] = Y n E[Y_1|Y]=\frac{Y}{n} E[Y1∣Y]=nY. - (b) We can think of Θ \Theta Θ and W W W as sums of independent standard normal random variables:
Θ = Θ 1 + . . . + Θ k , W = W 1 + . . . + W m \Theta=\Theta_1+...+\Theta_k,\ \ \ \ \ W=W_1+...+W_m Θ=Θ1+...+Θk, W=W1+...+WmWe identify Y Y Y with Θ + W \Theta + W Θ+W and use the result from part (a), to obtain
E [ Θ i ∣ Θ + W ] = Θ + W k + m E[\Theta_i|\Theta+W]=\frac{\Theta+W}{k+m} E[Θi∣Θ+W]=k+mΘ+WThus,
E [ Θ ∣ Θ + W ] = k E [ Θ i ∣ Θ + W ] = k k + m ( Θ + W ) E[\Theta|\Theta+W]=kE[\Theta_i|\Theta+W]=\frac{k}{k+m}(\Theta+W) E[Θ∣Θ+W]=kE[Θi∣Θ+W]=k+mk(Θ+W) - ( c ) (c) (c) We recall that the sum of independent Poisson random variables is Poisson. Thus the argument in part (b) goes through, by thinking of Θ \Theta Θ and W W W as sums of λ \lambda λ (respectively, μ μ μ) independent Poisson random variables with mean one. We then obtain
E [ Θ ∣ Θ + W ] = λ λ + μ ( Θ + W ) E[\Theta|\Theta+W]=\frac{\lambda}{\lambda+\mu}(\Theta+W) E[Θ∣Θ+W]=λ+μλ(Θ+W)
Some Properties of the Estimation Error
- Let us use the notation
Θ ^ = E [ Θ ∣ X ] , Θ ~ = Θ ^ − Θ \hat\Theta=E[\Theta|X],\ \ \ \ \ \ \ \ \ \ \ \ \ \tilde\Theta=\hat\Theta-\Theta Θ^=E[Θ∣X], Θ~=Θ^−Θfor the LMS estimator and the associated estimation error, respectively. The random variables Θ ^ \hat\Theta Θ^ and Θ ~ \tilde\Theta Θ~ have a number of useful properties, which were derived in Section 4.3.
Example 8.14.
- Let us say that the observation X X X is u n i n f o r m a t i v e uninformative uninformative if the mean squared estimation error E [ Θ ~ 2 ] = v a r ( Θ ~ ) E[\tilde\Theta^2]= var(\tilde\Theta) E[Θ~2]=var(Θ~) is the same as v a r ( Θ ) var(\Theta) var(Θ), the unconditional variance of Θ \Theta Θ. When is this the case?
- Using the formula
v a r ( Θ ) = v a r ( Θ ~ ) + v a r ( Θ ^ ) var(\Theta) = var(\tilde\Theta) + var(\hat\Theta) var(Θ)=var(Θ~)+var(Θ^)we see that X X X is uninformative if and only if v a r ( Θ ^ ) = 0 var(\hat\Theta)=0 var(Θ^)=0. The variance of a random variable is zero if and only if that random variable is a constant, equal to its mean. We conclude that X X X is uninformative if and only if the estimate Θ ^ = E [ Θ ] \hat\Theta = E[\Theta] Θ^=E[Θ], for every value of X X X. - If Θ \Theta Θ and X X X are independent, we have Θ ^ = E [ Θ ∣ X = x ] = E [ Θ ] \hat\Theta=E[\Theta |X = x] = E[\Theta] Θ^=E[Θ∣X=x]=E[Θ] for all x x x, and X X X is indeed uninformative, which is quite intuitive. The converse, however, is not true: it is possible for E [ Θ ∣ X = x ] E[\Theta |X = x] E[Θ∣X=x] to be always equal to the constant E [ Θ ] E[\Theta] E[Θ], without Θ \Theta Θ and X X X being independent. (In fact, if E [ Θ ∣ X = x ] = E [ Θ ] E[\Theta |X = x]=E[\Theta] E[Θ∣X=x]=E[Θ], it can be derived that Θ \Theta Θ and X X X are uncorrelated.)
The Case of Multiple Observations and Multiple Parameters
- The preceding argument and its conclusions apply even if X X X is a vector of random variables, X = ( X 1 , . . . , X n ) X = (X_1, ... , X_n) X=(X1,...,Xn). Thus, the mean squared estimation error is minimized if we use E [ Θ ∣ X 1 , . . . , X n ] E[\Theta|X_1, ... , X_n] E[Θ∣X1,...,Xn] as our estimator
E [ ( Θ − E [ Θ ∣ X 1 , . . . , X n ] ) 2 ] ≤ E [ ( Θ − g ( X 1 , . . . , X n ) ) 2 ] E[(\Theta-E[\Theta|X_1, ... , X_n])^2]\leq E[(\Theta-g(X_1, ... , X_n))^2] E[(Θ−E[Θ∣X1,...,Xn])2]≤E[(Θ−g(X1,...,Xn))2] - This provides a complete solution to the general problem of LMS estimation, but is often difficult to implement, for the following reasons:
- (a) In order to compute the conditional expectation E [ Θ ∣ X 1 , . . . , X n ] E[\Theta|X_1,...,X_n] E[Θ∣X1,...,Xn], we need a complete probabilistic model, that is, the joint PDF f Θ , X 1 , . . . , X n f_{\Theta,X_1, ... ,X_n} fΘ,X1,...,Xn.
- (b) Even if this joint PDF is available, E [ Θ ∣ X 1 , . . . , X n ] E[\Theta|X_1, ... , X_n] E[Θ∣X1,...,Xn] can be a very complicated function of X 1 , . . . , X n X_1, ... , X_n X1,...,Xn.
- As a consequence, practitioners often resort to approximations of the conditional expectation or focus on estimators that are not optimal but are simple and easy to implement.
- The most common approach, discussed in the next section, involves a restriction to linear estimators.
- Finally, let us consider the case where we want to estimate multiple parameters Θ 1 , . . . , Θ m \Theta_1, ... , \Theta_m Θ1,...,Θm. It is then natural to consider the criterion
E [ ( Θ 1 − Θ ^ 1 ) 2 ] + . . . + E [ ( Θ m − Θ ^ m ) 2 ] E[(\Theta_1-\hat\Theta_1)^2]+...+E[(\Theta_m-\hat\Theta_m)^2] E[(Θ1−Θ^1)2]+...+E[(Θm−Θ^m)2]and minimize it over all estimators Θ ^ 1 , . . . , Θ ^ m \hat\Theta_1, ... , \hat\Theta_m Θ^1,...,Θ^m. But this is equivalent to finding, an each i i i, an estimator Θ ^ i \hat\Theta_i Θ^i that minimizes E [ ( Θ i − Θ ^ i ) 2 ] E[(\Theta_i-\hat\Theta_i)^2] E[(Θi−Θ^i)2], so that we are essentially dealing with m m m decoupled estimation problems, one for each unknown parameter Θ i \Theta_i Θi, yielding Θ ^ i = E [ Θ i ∣ X 1 , . . . , X n ] \hat\Theta_i=E[\Theta_i|X_1,...,X_n] Θ^i=E[Θi∣X1,...,Xn], for all i i i.