本文为 $I n t r o d u c t i o n$ $t o$ $P r o b a b i l i t y$ 的读书笔记

Bayesian Linear LMS Estimation

In this section, we derive an estimator that minimizes the mean squared error within a restricted class of estimators: those that are linear functions of the observations. While this estimator may result in higher mean squared error, it has a significant practical advantage: it requires simple calculations. It is thus a useful alternative to the conditional expectation/LMS estimator in cases where the latter is hard to compute.

A linear estimator of a random variable $\Theta$ , based on observations $X_1, ... , X_n$ , has the form
$\hat\Theta=a_1X_1+...+a_nX_n+b$ Given a particular choice of the scalars $a_1, ... , a_n,b$ , the corresponding mean squared error is
$E[(\Theta-a_1X_1-...-a_nX_n-b)^2]$ The linear LMS estimator chooses $a_1, ... , a_n,b$ to minimize the above expression.

Linear LMS Estimation Based on a Single Observation

We are interested in finding $a$ and $b$ that minimize the mean squared estimation $E[(\Theta-aX-b)^2]$ associated with linear estimator $a X + b$ of $\Theta$ .
Suppose that $a$ has already been chosen. How should we choose $b$ ? This is the same as choosing a constant $b$ to estimate the random variable $\Theta -aX$ .
$\begin{aligned}E[(\Theta -aX-b)^2]&=var(\Theta -aX-b)+(E[\Theta -aX-b])^2 \\&=var(\Theta -aX)+(E[\Theta -aX]-b)^2\end{aligned}$ The best choice is
$b=E[\Theta -aX]=E[\Theta]-aE[X]$
With this choice of $b$ , it remains to minimize, with respect to $a$ , the expression
$\begin{aligned}E[(\Theta -aX-E[\Theta]+aE[X])^2]&=var(\Theta -aX)\\ &=\sigma_\Theta^2+a^2\sigma_X^2-2a\cdot cov(\Theta,X) \end{aligned}$ To minimize it (a quadratic function of $a$ ), we set its derivative to zero and solve for $a$ . This yields
$a=\frac{cov(\Theta,X)}{\sigma_X^2}=\frac{\rho\sigma_\Theta\sigma_X}{\sigma_X^2}=\rho\frac{\sigma_\Theta}{\sigma_X}$
With this choice of $a$ , the linear LMS estimator $\hat\Theta$ of $\Theta$ based on $X$ is
$\hat\Theta=a(X-E[X])+E[\Theta]=\rho\frac{\sigma_\Theta}{\sigma_X}(X-E[X])+E[\Theta]$ The mean squared estimation error of the resulting linear estimator $\hat\Theta$ is given by
$E[(\Theta-\hat\Theta)^2]=\sigma_\Theta^2+a^2\sigma_X^2-2a\cdot cov(\Theta,X)=(1-\rho^2)\sigma_\Theta^2$

The formula for the linear LMS estimator only involves the means, variances, and covariance of $\Theta$ and $X$ .
Furthermore, it has an intuitive interpretation. Suppose, for concreteness, that the correlation coefficient $\rho$ is positive. The estimater starts with the baseline estimate $E[\Theta]$ for $\Theta$ , which it then adjusts by taking into account the value of $X - E [X]$ . For example, when $X$ is larger than its mean, the positive correlation between $X$ and $\Theta$ suggests that $\Theta$ is expected to be larger than its mean. Accordingly, the resulting estimate is set to a value larger than $E[\Theta]$ . The value of $\rho$ also affects the quality of the estimate.

Properties of LMS estimation.

Let $\Theta$ and $X$ be two random variables with positive variances. Let $\hat\Theta$ be the linear LMS estimator of $\Theta$ based on $X$ , and let $\tilde \Theta_L = \hat\Theta_L-\Theta$ be the associated error. Similarly, let $\hat\Theta$ be the LMS estimator $E[\Theta |X]$ of $\Theta$ based on $X$ , and let $\tilde \Theta = \hat\Theta-\Theta$ be the associated error. It can be shown that
- $E[\tilde \Theta_L]=0$
- $cov(\tilde\Theta_L,X)=0$ , i.e. the estimation error $\tilde\Theta_L$ is uncorrelated with the observation $X$ .
- $var(\Theta) = var(\hat\Theta_L) + var(\tilde\Theta_L)$
- The LMS estimation error $\tilde\Theta$ is uncorrelated with any function $h (X)$ of the observation $X$ .

The proof can be found in Problem 23.

Example 8.16. Linear LMS Estimation of the Bias of a Coin.
We revisit the coin tossing problem, and derive the linear LMS estimator. Here, the probability of heads of the coin is modeled as a random variable $\Theta$ whose prior distribution is uniform over the interval $[0, 1]$ . The coin is tossed $n$ times, independently, resulting in a random number of heads, denoted by $X$ . Thus, if $\Theta$ is equal to $\theta$ , the random variable $X$ has a binomial distribution with parameters $n$ and $\theta$ .

SOLUTION

We have $E[\Theta] = 1/2$ , and
$E[X]=E[E[X|\Theta]]=E[n\Theta]=\frac{n}{2}$
The variance of $\Theta$ is $1 / 12$ , so that $\sigma_\Theta = 1/\sqrt{12}$ . Also, $E[\Theta^2]= 1/3$ . If $\Theta$ takes the value $\theta$ , the (conditional) variance of $X$ is $n\theta(1 - \theta)$ . Using the law of total variance, we obtain
$\begin{aligned}var(X)&=E[var(X|\Theta)]+var(E[X|\Theta]) \\&=E[n\Theta(1-\Theta)]+var(n\Theta) \\&=nE[\Theta]-nE[\Theta^2]+n^2var(\Theta) \\&=\frac{n(n+2)}{12}\end{aligned}$
In order to find the covariance of $X$ and $\Theta$ , we use the formula
$\begin{aligned}cov(\Theta,X)&=E[\Theta X]-E[\Theta]E[X] \\&=E[E[\Theta X|\Theta ]]-\frac{n}{4} \\&=E[\Theta E[X|\Theta ]]-\frac{n}{4} \\&=E[n\Theta^2]-\frac{n}{4} \\&=\frac{n}{12}\end{aligned}$
Putting everything together, we conclude that the linear LMS estimator takes the form
$\hat\Theta=\frac{1}{2}+\frac{n/12}{n(n+2)/12}(X-\frac{n}{2})=\frac{X+1}{n+2}$

Problem 16.
The joint PDF of random variables $X$ and $\Theta$ is of the form
在这里插入图片描述 where $c$ is a constant and $S$ is the set
$S=\{(x,\theta)|0\leq x\leq2,0\leq\theta\leq2,x-1\leq\theta\leq x\}$ We want to estimate $\theta$ based on $X$ .

(a) Find the LMS estimator $g (X)$ of $\Theta$ .
(b) Calculate $E[(\Theta - g(X))^2| X =x]$ , $E [g (X)]$ , and $v a r (g (X))$ .
( $c$ ) Calculate the mean squared error $E[(\Theta - g(X))^2]$ . Is it the same as $E[var(\Theta|X)]$ ?
(d) Calculate $var(\Theta)$ using the law of total variance.
(e) Derive the linear LMS estimator of $\Theta$ based on $X$ , and calculate its mean squared error.

SOLUTION

(a) The LMS estimator is
(b)
- We first derive the conditional variance $E[(\Theta - g(X))^2| X =x]$ .
  - If $x\in [0, 1]$ , the conditional PDF of $\Theta$ is uniform over the interval $[0, x]$ , and
    $E[(\Theta - g(X))^2| X =x]=\frac{x^2}{12}$
  - Similarly, if $\in [1, 2]$ , the conditional PDF of $\Theta$ is uniform over the interval $[x - 1, x]$ , and
    $E[(\Theta - g(X))^2| X =x]=\frac{1}{12}$
- We now evaluate the expectation and variance of $g (X)$ . Note that $(\Theta,X)$ is uniform over a region with area $3 / 2$ , so that the constant $c$ must be equal to $2 / 3$ . We have
  $\begin{aligned}E[g(X)]&=E[E[\Theta|X]]=E[\Theta] \\&=\int\int\theta f_{X,\Theta}(x,\theta)d\theta dx \\&=\int_0^1\int_0^x\theta\frac{2}{3}d\theta dx+\int_1^2\int_{x-1}^x\theta\frac{2}{3}d\theta dx \\&=\frac{7}{9}\end{aligned}$
- Furthermore,
  $\begin{aligned}var(g(X))&=var(E[\Theta|X]) \\&=E[(E[\Theta|X])^2]-(E[E[\Theta|X]])^2 \\&=\int_0^2(E[\Theta|X])^2f_X(x)dx-(E[\Theta])^2 \\&=\int_0^1(\frac{1}{2}x)^2\cdot\frac{2}{3}xdx+\int_1^2(x-\frac{1}{2})^2\cdot\frac{2}{3}dx-(\frac{7}{9})^2 \\&=\frac{103}{648} \end{aligned}$
( $c$ )
$\begin{aligned}E[var(\Theta|X)]&=E[E[(\Theta-E[\Theta|X])^2|X]]=E[(\Theta - g(X))^2] \\&=\int_0^1\frac{x^2}{12}\cdot\frac{2}{3}xdx+\int_1^2\frac{1}{12}\cdot\frac{2}{3}dx=\frac{5}{72} \end{aligned}$
(d)
$var(\Theta)=E[var(\Theta|X)]+var(E[\Theta|X])=\frac{5}{72}+\frac{103}{648}=\frac{37}{162}$
(e) The linear LMS estimator is
$\hat\Theta=E[\Theta]+\frac{cov(X,\Theta)}{\sigma_X^2}(X-E[X])$ We have
$E[X]=\int_0^1\int_0^x\frac{2}{3}xd\theta dx+\int_1^2\int_{x-1}^x\frac{2}{3}xd\theta dx=\frac{2}{9}+1=\frac{11}{9} \\E[X^2]=\int_0^1\int_0^x\frac{2}{3}x^2d\theta dx+\int_1^2\int_{x-1}^x\frac{2}{3}x^2d\theta dx=\frac{1}{6}+\frac{14}{9}=\frac{31}{18} \\var(X)=E[X^2]-(E[X])^2=\frac{71}{162} \\E[\Theta]=\int_0^1\int_0^x\frac{2}{3}\theta d\theta dx+\int_1^2\int_{x-1}^x\frac{2}{3}\theta d\theta dx=\frac{1}{9}+\frac{2}{3}=\frac{7}{9} \\E[X\Theta]=\int_0^1\int_0^x\frac{2}{3}x\theta d\theta dx+\int_1^2\int_{x-1}^x\frac{2}{3}x\theta d\theta dx=\frac{1}{12}+\frac{17}{18}=\frac{37}{36}\\ cov(X,\Theta)=E[X\Theta]-E[X]E[\Theta]=\frac{37}{36}-\frac{11}{9}\cdot\frac{7}{9}$ Thus, the linear LMS estimator is
$\hat\Theta=\frac{7}{9}+\frac{\frac{37}{36}-\frac{11}{9}\cdot\frac{7}{9}}{\frac{71}{162}}(X-\frac{11}{9})=0.5625+0.1761X$ Its mean squared error is
$E[(\Theta-\hat\Theta)^2]=E[(\Theta-0.5625-0.1761X)^2]\approx0.2023$

The Case of Multiple Observations and Multiple Parameters

If there are multiple parameters $\Theta_i$ to be estimated, we may consider the criterion
$E[(\Theta_1-\hat\Theta_1)^2]+...+E[(\Theta_m-\hat\Theta_m)^2]$ and minimize it over all estimators $\hat\Theta_1,...,\hat\Theta_m$ that are linear functions of the observations. This is equivalent to finding, for each $i$ , a linear estimator $\hat\Theta_i$ that minimizes $E[(\Theta_i-\hat\Theta_i)^2]$ , so that we are essentially dealing with $m$ decoupled linear estimation problems, one for each unknown parameter.

In the case where there are multiple observations with a certain independence property, the formula for the linear LMS estimator simplifies as we will now describe.
Let $\Theta$ be a random variable with mean $μ$ and variance $\sigma_0^2$ , and let $X_1, ... , X_n$ be observations of the form
$X_i =\Theta + W_i$ where the $W_i$ are random variables with mean 0 and variance $\sigma_i^2$ , which represent observation errors. Under the assumption that the random variables $\Theta, W_1, ... , W_n$ are uncorrelated, the linear LMS estimator of $\Theta$ , based on the observations $X_1,. . . , X_n$ , turns out to be
$\hat\Theta=\frac{\mu/\sigma_0^2+\sum_{i=1}^nX_i/\sigma_i^2}{\sum_{i=0}^n1/\sigma_i^2}$ The derivation involves forming the function
$h(a_1,...,a_n,b)=\frac{1}{2}E[(\Theta-a_1X_1-...-a_nX_n-b)^2]$ and minimizing it by setting to zero its partial derivatives with respect to $a_1, ... , a_n , b$ . We will show that the minimizing values of $a_1, ... , a_n, b$ are
$b^*=\frac{\mu/\sigma_0^2}{\sum_{i=0}^n1/\sigma_i^2},\ \ \ \ \ \ a_j^*=\frac{1/\sigma_j^2}{\sum_{i=0}^n1/\sigma_i^2},\ \ \ j=1,...,n$ from which the formula for the linear LMS estimator given earlier follows.

PROOF

To this end, it is sufficient to show that the partial derivatives of $h$ , with respect to $a_1, ... , a_n , b$ , are all equal to 0 when evaluated at $a_1^*, ... , a_n^*, b^*$ . (Because the quadratic function $h$ is nonnegative, it can be shown that any point at which its derivatives are zero must be a minimum.)
By differentiating $h$ , we obtain
$\frac{\partial h}{\partial b}\bigg|_{a_i^*,b^*}=E\bigg[\bigg(\sum_{i=1}^na_i^*-1\bigg)\Theta+\sum_{i=1}^na_i^*W_i+b^*\bigg]\\ \frac{\partial h}{\partial a_i}\bigg|_{a_i^*,b^*}=E\bigg[X_i\bigg(\bigg(\sum_{i=1}^na_i^*-1\bigg)\Theta+\sum_{i=1}^na_i^*W_i+b^*\bigg)\bigg]$
From the expressions for $b^*$ and $a^*$ , we see that
$\sum_{i=1}^na_i^*-1=-\frac{b^*}{\mu}$ It follows that
$\frac{\partial h}{\partial b}\bigg|_{a_i^*,b^*}=E\bigg[\bigg(-\frac{b^*}{\mu}\bigg)\Theta+\sum_{i=1}^na_i^*W_i+b^*\bigg]=0$
Using, in addition, the equations
$E[X_i(\mu-\Theta)]=E[(\Theta-\mu + W_i+\mu)(\mu-\Theta)]=-\sigma_0^2\\ E[X_iW_i]=E[(\Theta + W_i)W_i]=\sigma_i^2,\ \ \ \ \ \ for\ all\ i\\ E[X_jW_i]=E[(\Theta + W_j)W_i]=0,\ \ \ \ \ \ for\ all\ i\ and\ j\ with\ i\neq\ j$
we obtain
$\begin{aligned}\frac{\partial h}{\partial a_i}\bigg|_{a_i^*,b^*}&=E\bigg[X_i\bigg(\bigg(-\frac{b^*}{\mu}\bigg)\Theta+\sum_{i=1}^na_i^*W_i+b^*\bigg)\bigg] \\&=E\bigg[X_i\bigg((\mu-\Theta)\frac{b^*}{\mu}+\sum_{i=1}^na_i^*W_i\bigg)\bigg] \\&=\frac{b^*}{\mu}E\bigg[X_i(\mu-\Theta)\bigg]+\sum_{i=1}^nE\bigg[a_i^*W_i\bigg] \\&=-\sigma_0^2\frac{b^*}{\mu}+a_i^*\sigma_i^2 \\&=0\end{aligned}$

Problem 24. Properties of linear LMS estimation based on multiple observations.
Let $\Theta, X_1, ... , X_n$ be random variables with given variances and covariances. Let $\Theta_L$ be the linear LMS estimator of $\Theta$ based on $X_1 , ... , X_n$ , and let $\tilde\Theta_L＝\hat\Theta_L-\Theta$ be the associated error. Show that $E[\tilde\Theta_L] = 0$ and that $\tilde\Theta$ is uncorrelated with $X_i$ for every $i$ .

扫描二维码关注公众号，回复： 13138417 查看本文章

SOLUTION

We start by showing that $E[\tilde\Theta_LX_i] = 0$ , for all $i$ . Consider a new linear estimator of the form $\hat\Theta_L+aX_i$ , where $a$ is a scalar parameter. Since $\hat\Theta_L$ is a linear LMS estimator, its mean squared error $E[(\hat\Theta_L-\Theta)^2]$ is no larger than the mean squared error $h(a)=E[(\hat\Theta_L+aX_i-\Theta)^2]$ of the new estimator. Therefore, the function $h (a)$ attains its minimum value when $a = 0$ . Note that
$h(a)=E[(\hat\Theta_L+aX_i-\Theta)^2]=E[(\tilde\Theta_L+aX_i)^2]=E[\tilde\Theta_L^2]+a^2E[X_i^2]+2aE[\tilde\Theta_LX_i]$ The condition $(d h / d a) (0) = 0$ yields $E[\tilde\Theta_LX_i]= 0$ .
Let us now repeat the above argument, but with the constant $1$ replacing the random variable $X_i$ . Following the same steps, we obtain $E[\tilde\Theta_L] = 0$ . Finally, note that
$cov(\tilde\Theta_L, X_i)= E[\tilde\Theta_LX_i] - E[\tilde\Theta_L] E[X_i] = 0 - 0·E[X_i ] = 0$ so that $\tilde\Theta_L$ and $X_i$ are uncorrelated.

Linear Estimation and Normal Models

The linear LMS estimator is generally inferior to the LMS estimator $E[\Theta|X_1,...,X_n]$ . However, if the LMS estimator be linear in the observations $X_1, ... , X_n$ , then it is also the linear LMS estimator, i.e., the two estimators coincide.

An important example where this occurs is the estimation of a normal random variable $\Theta$ on the basis of observations $X_i = \Theta + W_i$ , where the $W_i$ are independent zero mean normal noise terms, independent of $\Theta$ .
This is a manifestation of a property that can be shown to hold more generally: if $\Theta, X_1, ... , X_n$ are all linear functions of a collection of independent normal random variables, then the LMS and the linear LMS estimators coincide. They also coincide with the MAP estimator, since the normal distribution is symmetric and unimodal.
The above discussion leads to an interesting interpretation of linear LMS estimation: the estimator is the same as the one that would have been obtained if we were to pretend that the random variables involved were normal, with the given means, variances, and covariances. Thus, there are two alternative perspectives on linear LMS estimation: either as a computational shortcut (avoid the evaluation of a possibly complicated formula for $E[\Theta|X]$ ), or as a model simplification (replace less tractable distributions by normal ones).

Problem 20. Estimation with spherically invariant PDFs.
Let $\Theta$ and $X$ be continuous random variables with joint PDF of the form
$f_{\Theta,X}(\theta,x)=h(q(\theta,x))$ where $h$ is a nonnegative scalar function, and $q(\theta, x)$ is a quadratic function of the form
$q(\theta, x)=a(\theta-\overline\theta)^2+b(x-\overline x)^2-2c(\theta-\overline\theta)(x-\overline x)$ Here $\overline\theta,\overline x$ are some scalars with $\neq 0$ . Derive the LMS and linear LMS estimates, for any $x$ such that $E[\Theta |X = x]$ is well-defined and finite. Assuming that $q(\theta, x)\geq0$ for all $x,\theta$ , and that $h$ is monotonically decreasing, derive the MAP estimate and show that it coincides with the LMS and linear LMS estimates.

SOLUTION

The posterior is given by
$f_{\Theta|X}(\theta|x)=\frac{h(q(\theta,x))}{f_X(x)}$ To motivate the derivation of the LMS and linear LMS estimates, consider first the MAP estimate, assuming that $q(\theta, x\geq0$ for all $\theta$ , and that $h$ is monotonically decreasing. The MAP estimate maxinmizes $h(q(\theta,x))$ and, since $h$ is a decreasing function, it minimizes $q(\theta,x)$ over $\theta$ . By setting to 0 the derivative of $q(\theta,x)$ with respect to $\theta$ , we obtain
$\hat\theta=\overline\theta+\frac{c}{a}(x-\overline x)$
We will now show that $\hat\theta$ is equal to the LMS and linear LMS estimates [without the assumption that $q(\theta, x)\geq0$ for all $\theta$ , and that $h$ is monotonically decreasing]. We write
$\theta-\overline\theta=\theta-\hat\theta+\frac{c}{a}(x-\overline x)$ and substitute in the formula for $q(\theta, x)$ to obtain after some algebra
$q(\theta, x)=a(\theta-\hat\theta)^2+(b-\frac{c^2}{a})(x-\overline x)^2$ Thus, for any given $x$ , the posterior is a function of $\theta$ that is symmetric around $\hat\theta$ . This implies that $\hat\theta$ is equal to the conditional mean $E[\Theta |X = x]$ , whenever $E[\Theta |X = x]$ is well-defined and finite. Furthermore, we have
$E[\Theta |X]=\overline\theta+\frac{c}{a}(X-\overline x)$ Since $E[\Theta|X]$ is linear in $X$ , it is also the linear LMS estimator.

The Choice of Variables in Linear Estimation

Consider an unknown random variable $\Theta$ , observations $X_1, ... ,X_n$ , and transformed observations $Y_i= h(X_i), i = 1, ... , n$ , where the function $h$ is one-to-one. The transformed observations $Y_i$ convey the same information as the original observations $X_i$ , and therefore the LMS estimator based on $Y_1, ... , Y_n$ is the same as the one based on $X_1, ... , X_n$ :
$E[\Theta|h(X_1),...,h(X_n)]=E[\Theta|X_1,...,X_n]$
On the other hand, linear LMS estimation is based on the premise that the class of linear functions of the observations $X_1, ... , X_n$ contains reasonably good estimators of $\Theta$ ; this may not always be the case. For example, suppose that $\Theta$ is the unknown variance of some distribution and $X_1, ... , X_n$ represent independent random variables drawn from that distribution. Then, it would be unreasonable to expect that a good estimator of $\Theta$ can be obtained with a linear function of $X_1, ... , X_n$ . This suggests that it may be helpful to transform the observations so that good estimators of $\Theta$ can be found within the class of linear functions of the transformed observations.

Problem 17.
Let $\Theta$ be a positive random variable, with known mean $μ$ and variance $\sigma$ to be estimated on the basis of a measurement $X$ of the form $=\sqrt\Theta W$ . We assume that $W$ is independent of $\Theta$ with zero mean, unit variance, and known fourth moment $E[W^4 ]$ . Thus, the conditional mean and variance of $X$ given $\Theta$ are 0 and $\Theta$ , respectively, so we are essentially trying to estimate the variance of $X$ given an observed value. Find the linear LMS estimator of $\Theta$ based on $X$ , and the linear LMS estimator of $\Theta$ based on $X^2$ .

SOLUTION

We have
$cov(\Theta,X) = E[\Theta^{3/2}W] -E[\Theta]E[X] = E[\Theta^{3/2}]E[W] - E[\Theta]E[X] = 0$ so the linear LMS estimator of $\Theta$ is simply $\hat\Theta = \mu$ , and does not make use of the available observation.
Let us now consider the transformed observation $X^2 =\Theta W^2$ , and linear estimators of the form $\hat\Theta = aY + b$ . We have
$E[Y]=E[\Theta W^2]=E[\Theta]E[ W^2]=\mu\cdot(1+0)=\mu \\E[\Theta Y]=E[\Theta^2W^2]=E[\Theta^2]E[W^2]=(\mu^2+\sigma^2)\cdot(1+0)=\mu^2+\sigma^2 \\cov(\Theta,Y)=E[\Theta Y]-E[\Theta]E[Y]=\sigma^2 \\var(Y)=E[\Theta^2W^4]-([E[\Theta W^2]])^2=(\mu^2+\sigma^2)E[W^4]-\mu^2$ Thus, the linear LMS estimator of $\Theta$ based on $Y$ is of the form
$\hat\Theta=\mu+\frac{\sigma^2}{(\mu^2+\sigma^2)E[W^4]-\mu^2}(Y-\mu)$ and makes effective use of the observation: the estimate of $\Theta$ , the conditional variance of $X$ becomes large whenever a large value of $X^2$ is observed.

Chapter 8 (Bayesian Statistical Inference): Bayesian Linear LMS Estimation (贝叶斯线性最小均方估计)

目录

Bayesian Linear LMS Estimation

Linear LMS Estimation Based on a Single Observation

The Case of Multiple Observations and Multiple Parameters

Linear Estimation and Normal Models

The Choice of Variables in Linear Estimation

猜你喜欢