Bias-Variance+Noise Decomposition in Linear Regression

Model:
$\begin{aligned} &y = F(\mathbf{x}) + v\\ &\text{$F(\mathbf{x})$ 在这里可以看做的oracle model，不随training data的改变而改变。} \end{aligned}$
where $v$ is additive $white$ noise with $\sigma^2_v$ . (Note: noise does not have to be gaussian, but does have to be white)
That means, for any $\mathbf{x}_0$ , we have
$\begin{aligned} & E_{y|x}[y_0|\mathbf{x}_0] = F(\mathbf{x}_0) \\ & \text{这里的 $(\mathbf{x}_0,y_0)$ 可以看做是test data point} \end{aligned}$

The expected loss with a predictor $\hat{f}$ is taken w.r.t $\mathbf{x}_0$ and $y_0$ : (Can be interpreted as Expectation w.r.t test data)
$\begin{aligned} E_{\mathbf{x},y}[(y_0-\hat{f}(\mathbf{x}_0))^2] &=E_{\mathbf{x},y}[(y_0-F(\mathbf{x}_0) + F(\mathbf{x}_0)-\hat{f}(\mathbf{x}_0))^2] \\ &=E_{\mathbf{x}, y}[(y_0 - F(\mathbf{x}_0))^2] \qquad (1) (=\sigma^2_v)\\ &+E_{\mathbf{x},y}[(F(\mathbf{x}_0)-\hat{f}(\mathbf{x}_0))^2] \qquad (2) (important)\\ &+2E_{\mathbf{x},y}[(y_0-F(\mathbf{x}_0))(F(\mathbf{x}_0)-\hat{f}(\mathbf{x}_0))] \qquad (3)(=0) \end{aligned}$
The cross term (3) can be written as:
$\begin{aligned} & E_{\mathbf{x},y}[(y_0-F(\mathbf{x}_0))(F(\mathbf{x}_0)-\hat{f}(\mathbf{x}_0))] \\ &=\int\int(y_0-F(\mathbf{x}_0))(F(\mathbf{x}_0)-\hat{f}(\mathbf{x}_0))p(y_0|\mathbf{x}_0)p(\mathbf{x}_0)dy_0d\mathbf{x}_0\\ &=\int\{E_{y|\mathbf{x}}[(y_0-F(\mathbf{x}_0))]\}(F(\mathbf{x}_0)-\hat{f}(\mathbf{x}_0))p(\mathbf{x}_0)d\mathbf{x}_0 \\ &=0 \\ & \text{这里困惑我的问题是:为什么 $\mathbf{x}_0$ 固定后，$\hat{f}(\mathbf{x}_0)$ 相对于$E_{y|\mathbf{x}}$ on something是定值？ } \\ & \text{因为 $\hat{f}$ 是你在training data中得到的模型，这个模型只是跟training data $X$ 有关，}\\ & \text{当得到$\hat{f}$后，$\hat{f}(\mathbf{x}_0)$ 只跟你要带入的input $\mathbf{x}_0$有关，所以他跟$y_0$ 是无关的。} \\ &了解了这层关系后接下来的公式都是信手拈来。 \\ &下面的公式对这个解释更直观一些。 \end{aligned}$

Another way to think about the above equation:
$\begin{aligned} & E_{\mathbf{x},y}[(y_0-F(\mathbf{x}_0))(F(\mathbf{x}_0)-\hat{f}(\mathbf{x}_0))] \\ &=E_{\mathbf{x}}\{E_{y|\mathbf{x}}[(y_0-F(\mathbf{x}_0))(F(\mathbf{x}_0)-\hat{f}(\mathbf{x}_0))|\mathbf{x}_0]\} \\ &=E_{\mathbf{x}}\{E_{y|\mathbf{x}}[(y_0-F(\mathbf{x}_0))|\mathbf{x}_0](F(\mathbf{x}_0)-\hat{f}(\mathbf{x}_0))\} \\ &= E_{\mathbf{x}}\{(E_{y|\mathbf{x}}[y_0|\mathbf{x}_0]-F(\mathbf{x}_0))(F(\mathbf{x}_0)-\hat{f}(\mathbf{x}_0))\} \\ &(Note: 因为E_{y|x}[y_0|\mathbf{x}_0] = F(\mathbf{x}_0))\\ &=0 \end{aligned}$

We will analyze (2). More clearly, $\hat{f}(\mathbf{x_0})=\hat{f}(\mathbf{x_0},X)$ (which $\mathbf{depends}$ on $X$ ). Let’s define $\bar{f}(\mathbf{x_0})=E_X(\hat{f}(\mathbf{x_0}))$ (which does $\mathbf{not \ depend}$ on $X$ ). Then the term inside (2) can be rewritten as:

$\begin{aligned} & (F(\mathbf{x_0})-\hat{f}(\mathbf{x}_0))^2 \qquad (******)\\ &= (F(\mathbf{x_0}) - \bar{f}(\mathbf{x_0}) + \bar{f}(\mathbf{x_0}) -\hat{f}(\mathbf{x_0}))^2 \\ &= (F(\mathbf{x_0}) - \bar{f}(\mathbf{x_0}))^2 \qquad (4)\\ &+(\bar{f}(\mathbf{x_0}) -\hat{f}(\mathbf{x_0}))^2 \qquad (5)\\ &+2(F(\mathbf{x_0}) - \bar{f}(\mathbf{x_0}))(\bar{f}(\mathbf{x_0}) -\hat{f}(\mathbf{x_0})) \qquad (6) \end{aligned}$
The $\mathbf{expectation}$ will be taken w.r.t. the random $\mathbf{training}$ data set $X$ , the cross term (6) can be written as:
$\begin{aligned} E_X[2(F(\mathbf{x_0}) - \bar{f}(\mathbf{x_0}))(\bar{f}(\mathbf{x_0}) -\hat{f}(\mathbf{x_0}))] &= 2(F(\mathbf{x_0}) - \bar{f}(\mathbf{x_0}))E_X(\bar{f}(\mathbf{x_0}) -\hat{f}(\mathbf{x_0})) \\ &=0 \end{aligned}$

Then we take expectation of $(******)$ w.r.t. $X$ :
(Note: 这部分跟我们常见的 $E(\hat{\theta}-\theta)^2$ 很相似，并且通过下面的分析我们可以更好地理解这个公式。记住， $\hat{\theta}$ 是关于training data的变量，所以 $E(\hat{\theta}-\theta)^2$ 中的Expectation是w.r.t Training Data)
$\begin{aligned} E_X[(F(\mathbf{x_0})) - (\hat{f}(\mathbf{x}_0))^2]&=(F(\mathbf{x_0}) - \bar{f}(\mathbf{x_0}))^2 \\ &+E_X[(\bar{f}(\mathbf{x_0}) -\hat{f}(\mathbf{x_0}))^2]\\ &=Bias^2 + Variance \end{aligned}$

Putting it all together (1), (2) and (3), we have the decomposition of the expected error:
$\begin{aligned} E_{X, \mathbf{x}_0, y_0}[(y_0 - \hat{f}(x_0,X))] &= \sigma^2_v \qquad (\text{noise variance} ) \\ &+\int(F(\mathbf{x}_0) - \bar{f}(\mathbf{x}_0))^2p(\mathbf{x}_0)d\mathbf{x}_0 \qquad \text{expected squared bias} \\ &+ \int E_X[(\bar{f}(\mathbf{x_0}) -\hat{f}(\mathbf{x_0}))^2]p(\mathbf{x}_0)d\mathbf{x}_0 \qquad \text{expected variance} \end{aligned}$

Short Summary：Data is divided into two parts: training and testing.
Expected Squared Error can be viewed as true error or prediction error, which comes from both training error + test error.

Reference: https://web.archive.org/web/20140821063842/http://ttic.uchicago.edu/~gregory/courses/wis-ml2012/lectures/biasVarDecom.pdf

Bias-Variance+Noise Decomposition in Linear Regression

猜你喜欢