Model: y=F(x)+vF(x)在这里可以看做的oracle model,不随training data的改变而改变。 where v is additive white noise with σv2. (Note: noise does not have to be gaussian, but does have to be white) That means, for any x0, we have Ey∣x[y0∣x0]=F(x0)这里的(x0,y0)可以看做是test data point
The expected loss with a predictor f^ is taken w.r.t x0 and y0: (Can be interpreted as Expectation w.r.t test data) Ex,y[(y0−f^(x0))2]=Ex,y[(y0−F(x0)+F(x0)−f^(x0))2]=Ex,y[(y0−F(x0))2](1)(=σv2)+Ex,y[(F(x0)−f^(x0))2](2)(important)+2Ex,y[(y0−F(x0))(F(x0)−f^(x0))](3)(=0) The cross term (3) can be written as: Ex,y[(y0−F(x0))(F(x0)−f^(x0))]=∫∫(y0−F(x0))(F(x0)−f^(x0))p(y0∣x0)p(x0)dy0dx0=∫{Ey∣x[(y0−F(x0))]}(F(x0)−f^(x0))p(x0)dx0=0这里困惑我的问题是:为什么x0固定后,f^(x0)相对于Ey∣x on something是定值?因为f^是你在training data中得到的模型,这个模型只是跟training data X有关,当得到f^后,f^(x0)只跟你要带入的input x0有关,所以他跟y0是无关的。了解了这层关系后接下来的公式都是信手拈来。下面的公式对这个解释更直观一些。
Another way to think about the above equation: Ex,y[(y0−F(x0))(F(x0)−f^(x0))]=Ex{Ey∣x[(y0−F(x0))(F(x0)−f^(x0))∣x0]}=Ex{Ey∣x[(y0−F(x0))∣x0](F(x0)−f^(x0))}=Ex{(Ey∣x[y0∣x0]−F(x0))(F(x0)−f^(x0))}(Note:因为Ey∣x[y0∣x0]=F(x0))=0
We will analyze (2). More clearly, f^(x0)=f^(x0,X)(which depends on X). Let’s define fˉ(x0)=EX(f^(x0))(which does notdepend on X). Then the term inside (2) can be rewritten as:
(F(x0)−f^(x0))2(∗∗∗∗∗∗)=(F(x0)−fˉ(x0)+fˉ(x0)−f^(x0))2=(F(x0)−fˉ(x0))2(4)+(fˉ(x0)−f^(x0))2(5)+2(F(x0)−fˉ(x0))(fˉ(x0)−f^(x0))(6) The expectation will be taken w.r.t. the random training data set X, the cross term (6) can be written as: EX[2(F(x0)−fˉ(x0))(fˉ(x0)−f^(x0))]=2(F(x0)−fˉ(x0))EX(fˉ(x0)−f^(x0))=0
Then we take expectation of (∗∗∗∗∗∗) w.r.t. X: (Note: 这部分跟我们常见的E(θ^−θ)2很相似,并且通过下面的分析我们可以更好地理解这个公式。记住,θ^ 是关于training data的变量,所以E(θ^−θ)2中的Expectation是w.r.t Training Data) EX[(F(x0))−(f^(x0))2]=(F(x0)−fˉ(x0))2+EX[(fˉ(x0)−f^(x0))2]=Bias2+Variance
Putting it all together (1), (2) and (3), we have the decomposition of the expected error: EX,x0,y0[(y0−f^(x0,X))]=σv2(noise variance)+∫(F(x0)−fˉ(x0))2p(x0)dx0expected squared bias+∫EX[(fˉ(x0)−f^(x0))2]p(x0)dx0expected variance
Short Summary:Data is divided into two parts: training and testing. Expected Squared Error can be viewed as true error or prediction error, which comes from both training error + test error.