统计估计(statistical estimation)

即到手的数据概率分布是未知的，我们只能从样本集合里估计数据潜在的概率分布(underlying propability distribution).

基础

1. 估计(estimator) $\hat\mu$ :从样本得到的定量估计，比如期望的estimator定义是：
  $\begin{matrix} (1) & \hat{μ} = \frac{1}{n} \sum_{i = 1}^{N} x_{i} \end{matrix}$ $\hat\mu=\frac{1}{n}\sum_{i=1}^Nx_i \tag{1}$
  准确来说，estimator是一个在所有样本 ${\{x_i\}}_{i=1}^N$ 上的函数，因此是一个随机变量。
1. 估计值(estimate):estimator确定的值。
1. 统计估计的两种方法
  - 参数模型(parameter model) $g(x;\theta)$ ,概率密度(或者质量)函数加上有限维度的参数 $\theta$ 。
  - 非参数模型(nonparametric method):不带参数或者是无限多参数的参数模型。

假设以下的样本 $\mathcal D={\{x_i\}}_{i=1}^N$ 都在 $f(x)$ 上i.i.d。

点估计

参数估计：

最大似然估计(MLE)确定参数值，使得生成我们已有样本的可能性最大：

\begin{matrix} (2) & L (θ) = \prod_{i = 1}^{n} g (x_{i}; θ) \end{matrix}

$L(\theta)=\prod_{i=1}^ng(x_i;\theta)\tag{2}$

\begin{matrix} (3) & {\hat{θ}}_{M L} = {a r g m a x}_{θ} L (θ) \end{matrix}

$\hat\theta_{ML}= {argmax}_{\theta} L(\theta) \tag{3}$

MLE里的参数 $\theta$ 被认为是确定的随机变量(deterministic variable),但是在贝叶斯推理(Bayes inference)中， $\theta$ 被认为是一个随机变量(random variable)，则有：

\begin{matrix} (4) & P r i o r P r o b a b i l i t y : p (θ) \end{matrix}

$Prior Probability:p(\theta) \tag{4}$

\begin{matrix} (5) & L i k e l i h o o d : p (D | θ) \end{matrix}

$Likelihood:p(\mathcal D \vert \theta ) \tag{5}$

\begin{matrix} (6) & P o s t e r i o r P r o b a b i l i t y : p (θ | D) \end{matrix}

$PosteriorProbability:p(\theta\vert \mathcal D) \tag{6}$
那么贝叶斯点估计最大化(4)可以利用条件概率得到：

\begin{matrix} (7) & a r g a x_{θ} p (D | θ) = \frac{p (θ, D)}{p (θ)} = \frac{p (θ | D) p (D)}{p (θ)} \end{matrix}

$argax_{\theta}p(\mathcal D \vert \theta )=\frac{p(\theta, \mathcal D)}{p(\theta)}\\= \frac{p(\theta \vert \mathcal D)p(\mathcal D)}{p(\theta)} \tag{7}$
由此,MLE等价于：

\begin{matrix} (8) & P o s t e r i o r E x p e c t a t i o n : \int θ p (θ | D) d θ \end{matrix}

$PosteriorExpectation:\int\theta p(\theta\vert \mathcal D)d\theta \tag{8}$

\begin{matrix} (9) & P o s t e r i o r M o d e : {a r g m a x}_{θ} p (θ | D) \end{matrix}

$PosteriorMode:{argmax}_{\theta}p(\theta \vert \mathcal D) \tag{9}$

公式(9)又称为最大后验概率(maximum a posterior probability estimation,MAP).

由此导出计算后验概率的公式：

\begin{matrix} (10) & p (θ | D) = \frac{p (D | θ) p (θ)}{p (θ)} = \frac{p (D | θ) p (θ)}{\int p (D | θ^{^{'}}) p (θ^{^{'}}) d θ} \end{matrix}

$p(\theta \vert \mathcal D)=\frac{p(\mathcal D \vert \theta)p(\theta)}{p(\theta)}=\frac{p(\mathcal D \vert \theta)p(\theta)}{\int p(\mathcal D \vert \theta^{'})p(\theta^{'})d\theta} \tag{10}$

非参数估计：

核密度估计(Kernel Density Estimation,KDE):在样本 $\mathcal D = {\{x_i\}}_{i=1}^n$ 上用核函数近似密度函数 $f(x)$ ：

$\begin{matrix} (11) & {\hat{f}}_{K D E} (x) = \frac{1}{n} \sum_{i = 1}^{n} K (x, x_{i}), \end{matrix}$ $\hat f_{KDE}(x)=\frac{1}{n}\sum_{i=1}^nK(x,x_i),\tag{11}$
这里 $K(x,x^{'})$ 是核函数。
最邻近密度估计(Nearest neighbor density estimation,NNDE)