本文为 $I n t r o d u c t i o n$ $t o$ $P r o b a b i l i t y$ 的读书笔记

Statistical Inference 统计推断

Statistical inference is the process of extracting information about an unknown variable or an unknown model from available data. We aim to:
- $(a)$ Develop an appreciation of the main two approaches (Bayesian (贝叶斯统计推断) and classical (经典统计推断)), their differences, and similarities.
- $(b)$ Present the main categories of inference problems (parameter estimation (参数估计), hypothesis testing (假设检验), and significance testing (显著性检验)).
- $(c)$ Discuss the most important methodologies (maximum a posteriori probability rule (MAP 最大后验概率准则), least mean squares estimation (LMS 最小均方估计), maximum likelihood (最大似然估计), regression (回归), likelihood ratio tests (似然比检验), etc).

Bayesian versus Classical Statistics 贝叶斯统计和经典统计

Their fundamental difference relates to the nature of the unknown models or variables.
- In the Bayesian view, they are treated as random variables with known prior distributions.
  - In particular, when trying to infer the nature of an unknown model, the Bayesian approach views the model as chosen randomly from a given model class. This is done by introducing a random variable $\Theta$ that characterizes the model, and by postulating a prior probability distribution (先验概率分布) $p_\Theta(\theta)$ . Given the observed data $x$ , one can, in principle, use Bayes’ rule to derive a posterior probability distribution (后验概率分布) $p_{\Theta|X}(\theta|x)$ . This captures all the information that $x$ can provide about $\theta$ .
- By contrast, the classical approach views the unknown quantity $\theta$ as a constant that happens to be unknown. It then strives to develop an estimate of $\theta$ that has some performance guarantees. This introduces an important conceptual difference from other methods: we are not dealing with a single probabilistic model, but rather with multiple candidate probabilistic models, one for each possible value of $\theta$ .

Model versus Variable Inference 模型推断和变量推断

Applications of statistical inference tend to be of two different types.
- In model inference, the object of study is a real phenomenon or process for which we wish to construct or validate a model on the basis of available data (e.g., do planets follow elliptical trajectories?). Such a model can then be used to make predictions about the future, or to infer some hidden underlying causes.
- In variable inference, we wish to estimate the value of one or more unknown variables by using some related, possibly noisy information (e.g., what is my current position, given a few GPS readings?).
The distinction between model and variable inference is not sharp; for example, by describing a model in terms of a set of variables, we can cast a model inference problem as a variable inference problem.
- In any case, we will not emphasize this distinction in the sequel, because the same methodological principles apply to both types of inference.
- In some applications, both model and variable inference issues may arise. For example, we may collect some initial data, use them to build a model, and then use the model to make inferences about the values of certain variables.

Example 8.1. A Noisy Channel.

A transmitter sends a sequence of binary messages $s_i \in \{ 0, 1\}$ , and a receiver observes
$X_i = as_i + W_i,\ \ \ \ \ \ \ \ i = 1, ... , n$ where the $W_i$ are zero mean normal random variables that model channel noise, and $a$ is a scalar that represents the channel attenuation (信道衰减率).
In a model inference setting, $a$ is unknown. The transmitter sends a pilot signal consisting of a sequence of messages $s_1, ... , s_n$ , whose values are known by the receiver. On the basis of the observations $X_1, ... , X_n$ , the receiver wishes to estimate the value of $a$ , that is, build a model of the channel.
Alternatively, in a variable inference setting, $a$ is assumed to be known (possibly because it has already been inferred using a pilot signal, as above). The receiver observes $X_1, ... , X_n$ , and wishes to infer the values of $s_1, ... , s_n$ .

A Rough Classification of Statistical Inference Problems

In an estimation problem (估计问题), a model is fully specified, except for an unknown, possibly multidimensional, parameter $\theta$ , which we wish to estimate. This parameter can be viewed as either a random variable (Bayesian approach) or as an unknown constant (classical approach). The usual objective is to arrive at an estimate of $\theta$ that is close to the true value in some sense. For example:
- (a) In the noisy transmission problem of Example 8.1, use the knowledge of the pilot sequence and the observations to estimate $a$ .
- (b) On the basis of historical stock market data, estimate the mean and variance of the daily movement in the price of a particular stock.
In a binary hypothesis testing problem (二重假设检验问题), we start with two hypotheses and use the available data to decide which of the two is true. For example:
- (a) In the noisy transmission problem of Example 8.1, use the knowledge of $a$ and $X_i$ to decide whether $s_i$ was 0 or 1.
- (b) Given a noisy picture, decide whether there is a person in the picture or not.
- $(c)$ Given a set of trials with two alternative medical treatments, decide which treatment is most effective.
More generally. in an $m$ -ary hypothesis testing problem ( $m$ 重假设检验问题), there is a finite number $m$ of competing hypotheses. The performance of a particular method is typically judged by the probability that it makes an erroneous decision.

Bayesian Inference and the Posterior Distribution

贝叶斯推断 和 后验分布

In Bayesian inference, the unknown quantity of interest, which we denote by $\Theta$ , is modeled as a random variable or as a finite collection of random variables.

For simplicity, unless the contrary is explicitly stated, we view $\Theta$ as a single random variable.

We aim to extract information about $\Theta$ , based on observing a collection $X = (X_1,...,X_n)$ of related random variables, called observations, measurements, or an observation vector. For this, we assume that we know the joint distribution of $\Theta$ and $X$ . Equivalently, we assume that we know:
- (a) A prior distribution $p_\Theta$ or $f_\Theta$ . depending on whether $\Theta$ is discrete or continuous.
- (b) A conditional distribution $p_{X|\Theta}$ or $f_{X|\Theta}$ , depending on whether $X$ is discrete or continuous.
Once a particular value $x$ of $X$ has been observed, a complete answer to the Bayesian inference problem is provided by the posterior distribution $p_{\Theta|X}(\theta|x)$ or $f_{\Theta|X}(\theta|x)$ of $\theta$ . This distribution is determined by the appropriate form of Bayes rule. It encapsulates everything there is to know about $\Theta$ , given the available information, and it is the starting point for further analysis.

Example 8.2.
Romeo and Juliet start dating, but Juliet will be late on any date by a random amount $X$ , uniformly distributed over the interval $[0,\theta]$ . The parameter $\theta$ is unknown and is modeled as the value of a random variable $\Theta$ , uniformly distributed between zero and one hour. Assuming that Juliet was late by an amount $x$ on their first date, how should Romeo use this information to update the distribution of $\Theta$ ?

SOLUTION

Here the prior PDF is
and the conditional PDF of the observation is
Using Bayes’ rule, and taking into account that $f_\Theta(\theta)f_{X|\Theta}(x |\theta)$ is nonzero only if $0\leq x\leq\theta\leq1$ , we find that for any $x\in [0, 1]$ , the posterior PDF is
$f_{\Theta|X}(\theta|x)=\frac{f_\Theta(\theta)f_{X|\Theta}(x |\theta)}{\int_0^1f_\Theta(\theta')f_{X|\Theta}(x |\theta')d\theta'}=\frac{1/\theta}{\int_x^1\frac{1}{\theta'}d\theta'}=\frac{1}{\theta\cdot|\ln x|},\ \ \ \ \ \ if\ 0\leq x\leq\theta\leq1$ and $f_{\Theta|X}(\theta|x) = 0$ if $\theta< x$ or $\theta> 1$ .
Consider now a variation involving the first $n$ dates. Assume that Juliet is late by random amounts $X_1, ... , X_n$ , which given $\Theta=\theta$ , are uniformly distributed in the interval $[0,\theta]$ , and conditionally independent. Let $X = (X_1, ... , X_n )$ and $x = (x_1, ... , x_n )$ . Similar to the case where $n = 1$ , we have
where
$\overline x=\max\{x_1,...,x_n\}$ The posterior PDF is
where $c(\overline x)$ is a normalizing constant that depends only on $\overline x$ ：
$c(\overline x)=\frac{1}{\int_{\overline x}^1\frac{1}{(\theta')^n}d\theta'}$

Example 8.3. Inference of a Common Mean of Normal Random Variables.
We observe a collection $X = (X_1, ... , X_n)$ of random variables, with an unknown common mean whose value we wish to infer. We assume that given the value of the common mean, the $X_i$ are normal and independent, with known variances $\sigma_1^2,...,\sigma_n^2$ . In a Bayesian approach to this problem, we model the common mean as a random variable $\theta$ , with a given prior. For concreteness, we assume a normal prior, with known mean $x_0$ and known variance $\sigma_0^2$ .

Let us note, for future reference, that our model is equivalent to one of the form
$X_i=\Theta+W_i,\ \ \ \ \ \ \ \ \ i=1,...,n$ where the random variables $\Theta$ , $W_1, ... , W_n$ are independent and normal, with known means and variances. In particular, for any value $\theta$ ,
$E[W_i]=E[W_i|\Theta=\theta]=0\\ var(W_i)=var(X_i|\Theta=\theta)=\sigma_i^2$ A model of this type is common in many engineering applications involving several independent measurements of an unknown quantity.
According to our assumptions, we have
$f_\Theta(\theta)=c_1\cdot\exp\bigg\{-\frac{(\theta-x_0)^2}{2\sigma_0^2}\bigg\}$ and
$f_{X|\Theta}(x|\theta)=c_2\cdot\exp\bigg\{-\frac{(x_1-\theta)^2}{2\sigma_1^2}\bigg\}...\exp\bigg\{-\frac{(x_n-\theta)^2}{2\sigma_n^2}\bigg\}$ where $c_1$ and $c_2$ are normalizing constants that do not depend on $\theta$ . We invoke Bayes’ rule,
$f_{\Theta|X}(\theta|x)=\frac{f_\Theta(\theta)f_{X|\Theta}(x|\theta)}{\int f_\Theta(\theta')f_{X|\Theta}(x|\theta')d\theta'}$ and note that the numerator term, $f_\Theta(\theta)f_{X|\Theta}(x|\theta)$ , is of the form
$c_1c_2\cdot\exp\bigg\{-\sum_{i=0}^n\frac{(x_i-\theta)^2}{2\sigma_i^2}\bigg\}$ After some algebra, which involves completing the square inside the exponent (对指数的肩膀上的求和部分进行配平方), we find that the numerator is of the form
$d\cdot\exp\bigg\{-\frac{(\theta-m)^2}{2v}\bigg\}$ where
$m=\frac{\sum_{i=0}^nx_i/\sigma_i^2}{\sum_{i=0}^n1/\sigma_i^2},\ \ \ \ \ \ v=\frac{1}{\sum_{i=0}^n1/\sigma_i^2}$ and $d$ is a constant, which depends on the $x_i$ but does not depend on $\theta$ . Since the denominator term in Bayes’ rule does not depend on $\theta$ either, we conclude that the posterior PDF has the form
$f_{\Theta|X}(\theta|x)=a\cdot\exp\bigg\{-\frac{(\theta-m)^2}{2v}\bigg\}$ for some normalizing constant $a$ , which depends on the $x_i$ , but not on $\theta$ . We recognize this as a normal PDF, and we conclude that the posterior PDF is normal with mean $m$ and variance $v$ .
As a special case suppose that $\sigma_0^2, \sigma_1^2,...,\sigma_n^2$ have a common value $\sigma^2$ . Then, the posterior PDF of $\Theta$ is normal with mean and variance
$m=\frac{x_0+...+x_n}{n+1},\ \ \ \ \ \ \ v=\frac{\sigma^2}{n+1}$ respectively. In this case, the prior mean $x_0$ acts just as another observation, and contributes equally to determine the posterior mean $m$ of $\theta$ . Notice also that the standard deviation of the posterior PDF of $\theta$ tends to zero, at the rough rate of $1/\sqrt n$ ，as the number of observations increases.
If the variances $\sigma_i^2$ are different, the formula for the posterior mean $m$ is still a weighted average of the $x_i$ , but with a larger weight on the observations with smaller variance.
The preceding example has the remarkable property that the posterior distribution of $\Theta$ is in the same family as the prior distribution, namely, the family of normal distributions. This is appealing for two reasons:
- (a) The posterior can be characterized in terms of only two numbers, the mean and the variance.
- (b) The form of the solution opens up the possibility of efficient recursive inference. Suppose that after $X_1, ... , X_n$ are observed, an additional observation $X_{n+1}$ is obtained. Instead of solving the inference problem from scratch, we can view $f_{\Theta|X_1,...,X_n}$ as our prior, and use the new observation to obtain the new posterior $f_{\Theta|X_1,...,X_n,X_{n+1}}$ . It is then plausible (and can be formally established) that the new posterior of $\Theta$ will have mean
  $\frac{(m/v)+(x_{n+1}/\sigma_{n+1}^2)}{(1/v)+(1/\sigma_{n+1}^2)}$ and variance
  $\frac{1}{(1/v)+(1/\sigma_{n+1}^2)}$ where $m$ and $v$ are the mean and variance of the old posterior $f_{\Theta|X_1,...,X_n}$ .

Example 8.4. Beta Priors on the Bias of a Coin. (贝塔先验)
We wish to estimate the probability of heads, denoted by $\theta$ , of a biased coin. We model $\theta$ as the value of a random variable $\Theta$ with known prior PDF $f_\theta$ . We consider $n$ independent tosses and let $X$ be the number of heads observed.

From Bayes’ rule, the posterior PDF of $\theta$ has the form, for $\theta\in [0, 1]$ ,
$f_{\Theta|X}(\theta|k)=cf_\Theta(\theta)p_{X|\Theta}(k|\theta)=df_\Theta(\theta)\theta^k(1-\theta)^{n-k}$ where $c$ is a normalizing constant (independent of $\theta$ ), and $d=c\begin{pmatrix}n\\k\end{pmatrix}$ .
Suppose now that the prior is a beta density with integer parameters $\alpha> 0$ and $\beta>0$ , of the form
where $B(\alpha,\beta)$ is a normalizing constant, known as the Beta function, given by
$B(\alpha,\beta)=\int_0^1\theta^{\alpha-1}(1-\theta)^{\beta-1}d\theta=\frac{(\alpha-1)!(\beta-1)!}{(\alpha+\beta-1)!}$ the last equality can be obtained from integration by parts, or through a probabilistic argument (see Problem 30). Then, the posterior PDF of $\theta$ is of the form
$f_{\Theta|X}(\theta|k)=\frac{d}{B(\alpha,\beta)}\theta^{k+\alpha-1}(1-\theta)^{n-k+\beta-1},\ \ \ \ \ \ \ \ \ \ \ 0\leq\theta\leq1$ and hence is a beta density with parameters
$\alpha'=k+\alpha,\ \ \ \ \ \ \ \ \ \ \ \beta'=n-k+\beta$

Chapter 8 (Bayesian Statistical Inference): Bayesian Inference and the Posterior Distribution

目录

Statistical Inference 统计推断

Bayesian versus Classical Statistics 贝叶斯统计和经典统计

Model versus Variable Inference 模型推断和变量推断

A Rough Classification of Statistical Inference Problems

Bayesian Inference and the Posterior Distribution

猜你喜欢