Bias-Variance+Noise Decomposition in Linear Regression

Model:
y = F ( x ) + v F ( x )  在这里可以看做的oracle model,不随training data的改变而改变。 \begin{aligned} &y = F(\mathbf{x}) + v\\ &\text{$F(\mathbf{x})$ 在这里可以看做的oracle model,不随training data的改变而改变。} \end{aligned}
where v v is additive w h i t e white noise with σ v 2 \sigma^2_v . (Note: noise does not have to be gaussian, but does have to be white)
That means, for any x 0 \mathbf{x}_0 , we have
E y x [ y 0 x 0 ] = F ( x 0 ) 这里的  ( x 0 , y 0 )  可以看做是test data point \begin{aligned} & E_{y|x}[y_0|\mathbf{x}_0] = F(\mathbf{x}_0) \\ & \text{这里的 $(\mathbf{x}_0,y_0)$ 可以看做是test data point} \end{aligned}

The expected loss with a predictor f ^ \hat{f} is taken w.r.t x 0 \mathbf{x}_0 and y 0 y_0 : (Can be interpreted as Expectation w.r.t test data)
E x , y [ ( y 0 f ^ ( x 0 ) ) 2 ] = E x , y [ ( y 0 F ( x 0 ) + F ( x 0 ) f ^ ( x 0 ) ) 2 ] = E x , y [ ( y 0 F ( x 0 ) ) 2 ] ( 1 ) ( = σ v 2 ) + E x , y [ ( F ( x 0 ) f ^ ( x 0 ) ) 2 ] ( 2 ) ( i m p o r t a n t ) + 2 E x , y [ ( y 0 F ( x 0 ) ) ( F ( x 0 ) f ^ ( x 0 ) ) ] ( 3 ) ( = 0 ) \begin{aligned} E_{\mathbf{x},y}[(y_0-\hat{f}(\mathbf{x}_0))^2] &=E_{\mathbf{x},y}[(y_0-F(\mathbf{x}_0) + F(\mathbf{x}_0)-\hat{f}(\mathbf{x}_0))^2] \\ &=E_{\mathbf{x}, y}[(y_0 - F(\mathbf{x}_0))^2] \qquad (1) (=\sigma^2_v)\\ &+E_{\mathbf{x},y}[(F(\mathbf{x}_0)-\hat{f}(\mathbf{x}_0))^2] \qquad (2) (important)\\ &+2E_{\mathbf{x},y}[(y_0-F(\mathbf{x}_0))(F(\mathbf{x}_0)-\hat{f}(\mathbf{x}_0))] \qquad (3)(=0) \end{aligned}
The cross term (3) can be written as:
E x , y [ ( y 0 F ( x 0 ) ) ( F ( x 0 ) f ^ ( x 0 ) ) ] = ( y 0 F ( x 0 ) ) ( F ( x 0 ) f ^ ( x 0 ) ) p ( y 0 x 0 ) p ( x 0 ) d y 0 d x 0 = { E y x [ ( y 0 F ( x 0 ) ) ] } ( F ( x 0 ) f ^ ( x 0 ) ) p ( x 0 ) d x 0 = 0 这里困惑我的问题是:为什么  x 0  固定后, f ^ ( x 0 )  相对于 E y x  on something是定值?  因为  f ^  是你在training data中得到的模型,这个模型只是跟training data  X  有关, 当得到 f ^ 后, f ^ ( x 0 )  只跟你要带入的input  x 0 有关,所以他跟 y 0  是无关的。 \begin{aligned} & E_{\mathbf{x},y}[(y_0-F(\mathbf{x}_0))(F(\mathbf{x}_0)-\hat{f}(\mathbf{x}_0))] \\ &=\int\int(y_0-F(\mathbf{x}_0))(F(\mathbf{x}_0)-\hat{f}(\mathbf{x}_0))p(y_0|\mathbf{x}_0)p(\mathbf{x}_0)dy_0d\mathbf{x}_0\\ &=\int\{E_{y|\mathbf{x}}[(y_0-F(\mathbf{x}_0))]\}(F(\mathbf{x}_0)-\hat{f}(\mathbf{x}_0))p(\mathbf{x}_0)d\mathbf{x}_0 \\ &=0 \\ & \text{这里困惑我的问题是:为什么 $\mathbf{x}_0$ 固定后,$\hat{f}(\mathbf{x}_0)$ 相对于$E_{y|\mathbf{x}}$ on something是定值? } \\ & \text{因为 $\hat{f}$ 是你在training data中得到的模型,这个模型只是跟training data $X$ 有关,}\\ & \text{当得到$\hat{f}$后,$\hat{f}(\mathbf{x}_0)$ 只跟你要带入的input $\mathbf{x}_0$有关,所以他跟$y_0$ 是无关的。} \\ &了解了这层关系后接下来的公式都是信手拈来。 \\ &下面的公式对这个解释更直观一些。 \end{aligned}

Another way to think about the above equation:
E x , y [ ( y 0 F ( x 0 ) ) ( F ( x 0 ) f ^ ( x 0 ) ) ] = E x { E y x [ ( y 0 F ( x 0 ) ) ( F ( x 0 ) f ^ ( x 0 ) ) x 0 ] } = E x { E y x [ ( y 0 F ( x 0 ) ) x 0 ] ( F ( x 0 ) f ^ ( x 0 ) ) } = E x { ( E y x [ y 0 x 0 ] F ( x 0 ) ) ( F ( x 0 ) f ^ ( x 0 ) ) } ( N o t e : E y x [ y 0 x 0 ] = F ( x 0 ) ) = 0 \begin{aligned} & E_{\mathbf{x},y}[(y_0-F(\mathbf{x}_0))(F(\mathbf{x}_0)-\hat{f}(\mathbf{x}_0))] \\ &=E_{\mathbf{x}}\{E_{y|\mathbf{x}}[(y_0-F(\mathbf{x}_0))(F(\mathbf{x}_0)-\hat{f}(\mathbf{x}_0))|\mathbf{x}_0]\} \\ &=E_{\mathbf{x}}\{E_{y|\mathbf{x}}[(y_0-F(\mathbf{x}_0))|\mathbf{x}_0](F(\mathbf{x}_0)-\hat{f}(\mathbf{x}_0))\} \\ &= E_{\mathbf{x}}\{(E_{y|\mathbf{x}}[y_0|\mathbf{x}_0]-F(\mathbf{x}_0))(F(\mathbf{x}_0)-\hat{f}(\mathbf{x}_0))\} \\ &(Note: 因为E_{y|x}[y_0|\mathbf{x}_0] = F(\mathbf{x}_0))\\ &=0 \end{aligned}

We will analyze (2). More clearly, f ^ ( x 0 ) = f ^ ( x 0 , X ) \hat{f}(\mathbf{x_0})=\hat{f}(\mathbf{x_0},X) (which d e p e n d s \mathbf{depends} on X X ). Let’s define f ˉ ( x 0 ) = E X ( f ^ ( x 0 ) ) \bar{f}(\mathbf{x_0})=E_X(\hat{f}(\mathbf{x_0})) (which does n o t   d e p e n d \mathbf{not \ depend} on X X ). Then the term inside (2) can be rewritten as:

( F ( x 0 ) f ^ ( x 0 ) ) 2 ( ) = ( F ( x 0 ) f ˉ ( x 0 ) + f ˉ ( x 0 ) f ^ ( x 0 ) ) 2 = ( F ( x 0 ) f ˉ ( x 0 ) ) 2 ( 4 ) + ( f ˉ ( x 0 ) f ^ ( x 0 ) ) 2 ( 5 ) + 2 ( F ( x 0 ) f ˉ ( x 0 ) ) ( f ˉ ( x 0 ) f ^ ( x 0 ) ) ( 6 ) \begin{aligned} & (F(\mathbf{x_0})-\hat{f}(\mathbf{x}_0))^2 \qquad (******)\\ &= (F(\mathbf{x_0}) - \bar{f}(\mathbf{x_0}) + \bar{f}(\mathbf{x_0}) -\hat{f}(\mathbf{x_0}))^2 \\ &= (F(\mathbf{x_0}) - \bar{f}(\mathbf{x_0}))^2 \qquad (4)\\ &+(\bar{f}(\mathbf{x_0}) -\hat{f}(\mathbf{x_0}))^2 \qquad (5)\\ &+2(F(\mathbf{x_0}) - \bar{f}(\mathbf{x_0}))(\bar{f}(\mathbf{x_0}) -\hat{f}(\mathbf{x_0})) \qquad (6) \end{aligned}
The e x p e c t a t i o n \mathbf{expectation} will be taken w.r.t. the random t r a i n i n g \mathbf{training} data set X X , the cross term (6) can be written as:
E X [ 2 ( F ( x 0 ) f ˉ ( x 0 ) ) ( f ˉ ( x 0 ) f ^ ( x 0 ) ) ] = 2 ( F ( x 0 ) f ˉ ( x 0 ) ) E X ( f ˉ ( x 0 ) f ^ ( x 0 ) ) = 0 \begin{aligned} E_X[2(F(\mathbf{x_0}) - \bar{f}(\mathbf{x_0}))(\bar{f}(\mathbf{x_0}) -\hat{f}(\mathbf{x_0}))] &= 2(F(\mathbf{x_0}) - \bar{f}(\mathbf{x_0}))E_X(\bar{f}(\mathbf{x_0}) -\hat{f}(\mathbf{x_0})) \\ &=0 \end{aligned}

Then we take expectation of ( ) (******) w.r.t. X X :
(Note: 这部分跟我们常见的 E ( θ ^ θ ) 2 E(\hat{\theta}-\theta)^2 很相似,并且通过下面的分析我们可以更好地理解这个公式。记住, θ ^ \hat{\theta} 是关于training data的变量,所以 E ( θ ^ θ ) 2 E(\hat{\theta}-\theta)^2 中的Expectation是w.r.t Training Data)
E X [ ( F ( x 0 ) ) ( f ^ ( x 0 ) ) 2 ] = ( F ( x 0 ) f ˉ ( x 0 ) ) 2 + E X [ ( f ˉ ( x 0 ) f ^ ( x 0 ) ) 2 ] = B i a s 2 + V a r i a n c e \begin{aligned} E_X[(F(\mathbf{x_0})) - (\hat{f}(\mathbf{x}_0))^2]&=(F(\mathbf{x_0}) - \bar{f}(\mathbf{x_0}))^2 \\ &+E_X[(\bar{f}(\mathbf{x_0}) -\hat{f}(\mathbf{x_0}))^2]\\ &=Bias^2 + Variance \end{aligned}

Putting it all together (1), (2) and (3), we have the decomposition of the expected error:
E X , x 0 , y 0 [ ( y 0 f ^ ( x 0 , X ) ) ] = σ v 2 ( noise variance ) + ( F ( x 0 ) f ˉ ( x 0 ) ) 2 p ( x 0 ) d x 0 expected squared bias + E X [ ( f ˉ ( x 0 ) f ^ ( x 0 ) ) 2 ] p ( x 0 ) d x 0 expected variance \begin{aligned} E_{X, \mathbf{x}_0, y_0}[(y_0 - \hat{f}(x_0,X))] &= \sigma^2_v \qquad (\text{noise variance} ) \\ &+\int(F(\mathbf{x}_0) - \bar{f}(\mathbf{x}_0))^2p(\mathbf{x}_0)d\mathbf{x}_0 \qquad \text{expected squared bias} \\ &+ \int E_X[(\bar{f}(\mathbf{x_0}) -\hat{f}(\mathbf{x_0}))^2]p(\mathbf{x}_0)d\mathbf{x}_0 \qquad \text{expected variance} \end{aligned}

Short Summary:Data is divided into two parts: training and testing.
Expected Squared Error can be viewed as true error or prediction error, which comes from both training error + test error.

Reference: https://web.archive.org/web/20140821063842/http://ttic.uchicago.edu/~gregory/courses/wis-ml2012/lectures/biasVarDecom.pdf

猜你喜欢

转载自blog.csdn.net/weixin_32334291/article/details/88772942