机器学习白板推导系列三 线性回归

(系列三) 线性回归1-最小二乘法及其几何意义

最小二乘法:(矩阵表达;几何意义)
概率角度:最小二乘法等价于噪声为高斯分布的极大似然估计
加上正则化后:L1 ->Lasso,L2 -> Ridge岭回归

假设数据集为 D = ( x 1 , y 1 ) , ( x 2 , y 2 ) , ⋯   , ( x N , y N ) , x i ∈ R p , y i ∈ R , i = 1 , 2 , ⋯   , N D={(x_1, y_1), (x_2, y_2), \cdots, (x_N, y_N)}, x_{i} \in \mathbb{R}^{p}, y_{i} \in \mathbb{R} , i=1,2, \cdots, N D=(x1,y1),(x2,y2),,(xN,yN),xiRp,yiR,i=1,2,,N
数据使用矩阵表示 x = ( x 1 x 2 ⋯ x N ) ⊤ = [ x 1 T x 2 T ⋮ x N T ] = [ x 11 x 12 ⋯ x 1 p x 21 x 22 ⋯ x 2 p ⋮ ⋮ ⋱ ⋮ x m 1 x m 2 ⋯ x m B ] N x p x=\left(x_{1} x_{2} \cdots x_{N}\right)^{\top}=\left[\begin{array}{c} x_{1}^{T} \\ x_{2}^{T} \\ \vdots \\ x_{N}^{T} \end{array}\right]=\left[\begin{array}{cccc} x_{11} & x_{12} & \cdots & x_{1 p} \\ x_{21} & x_{22} & \cdots & x_{2 p} \\ \vdots & \vdots & \ddots & \vdots \\ x_{m 1} & x_{m 2} & \cdots & x_{m_{B}} \end{array}\right]_{N x p} x=(x1x2xN)=x1Tx2TxNT=x11x21xm1x12x22xm2x1px2pxmBNxp
y i y_i yi是一个一维实数值
Y = ( y 1 y 2 ⋮ y N ) N × 1 Y=\left(\begin{array}{l} y_{1} \\ y_{2} \\ \vdots \\ y_{N} \end{array}\right)_{N \times{1}} Y=y1y2yNN×1


定义拟合的直线 f ( w ) = w T x f(w)=w^{T} x f(w)=wTx,其中w为p维列向量,其中这里隐含了偏置b不过当作 w 0 w_0 w0包含进去。

最小二乘法估计:
损失函数:
L ( w ) = ∑ i = 1 N ∥ w ⊤ x i − y i ∥ 2 = ∑ i = 1 N ( w ⊤ x i − y i ) 2 L(w)=\sum_{i=1}^{N}\left\|w^{\top} x_{i}-y_{i}\right\|^{2}=\sum_{i=1}^{N}\left(w^{\top} x_{i}-y_{i}\right)^{2} L(w)=i=1Nwxiyi2=i=1N(wxiyi)2
= ( w ⊤ x 1 − y 1 w ⊤ x 2 − y 2 ⋯ w ⊤ x N − y N ) ( w ⊤ x 1 − y 1 w ⊤ x 2 − y 2 ⋮ w ⊤ x N − y N ) =\left(\begin{array}{c}w^{\top} x_{1}-y_{1} & w^{\top} x_{2}-y_{2} & \cdots & w^{\top} x_N -y_{N}\end{array}\right) \left(\begin{array}{c} w^{\top} x_{1}-y_{1} \\ w^{\top} x_{2}-y_{2} \\ \vdots \\ w^{\top} x_{N}-y_{N} \end{array}\right) =(wx1y1wx2y2wxNyN)wx1y1wx2y2wxNyN
(即两个向量相乘)
前一个横向量还可进一步写为:
( w ⊤ x 1 w ⊤ x 2 ⋯ w ⊤ x N ) − ( y 1 y 2 ⋯ y N ) = w ⊤ ( x 1 x 2 ⋯ x N ) − ( y 1 y 2 ⋯ y N ) = w ⊤ X ⊤ − Y ⊤ \begin{aligned} &\left(\begin{array}{c}w ^{\top}x_{1} & w^{\top} x_{2} & \cdots & w^{\top} x_{N}\end{array}\right)-\left(\begin{array}{c}y_{1} & y_{2} & \cdots & y_{N}\end{array}\right) \\ =& w^{\top}\left(\begin{array}{c}x_{1} & x_{2} & \cdots & x_{N}\end{array}\right)-\left(\begin{array}{c}y_{1} & y_{2} & \cdots & y_{N}\end{array}\right) \\ =& w^{\top} X^{\top}-Y^{\top} \end{aligned} ==(wx1wx2wxN)(y1y2yN)w(x1x2xN)(y1y2yN)wXY
同理右边的列向量可写为XW-Y,即横向量的转置,所以XW是位置调换的

可得 L ( w ) = ( w ⊤ X ⊤ − Y ⊤ ) ( X w − Y ) L(w)=\left(w^{\top} X^{\top}-Y^{\top}\right)(X w-Y) L(w)=(wXY)(XwY)

为了求导方便继续展开
= w ⊤ X ⊤ X w − Y ⊤ X w − w ⊤ X ⊤ Y + Y ⊤ Y =w^{\top} X^{\top} X w-Y^{\top} X w-w^{\top} X^{\top} Y+Y^{\top} Y =wXXwYXwwXY+YY
注意这里每一项都是实数,中间两项是相等的
= w ⊤ X ⊤ X w − 2 w ⊤ X ⊤ Y + Y ⊤ Y =w^{\top} X^{\top} X w-2w^{\top} X^{\top} Y+Y^{\top} Y =wXXw2wXY+YY

要估计的 w ^ = arg ⁡ min ⁡ L ( w ) \hat{w} = \arg\min L(w) w^=argminL(w)
对L(w)进行求导,注意矩阵求导的
∂ L ( w ) ∂ w = 2 X ⊤ X w − 2 X ⊤ Y ≜ 0 \frac{\partial L(w)}{\partial w}=2 X^{\top} X w-2 X^{\top} Y \triangleq 0 wL(w)=2XXw2XY0
⇒ X ⊤ X w = X ⊤ Y \Rightarrow X^{\top} X w = X^{\top} Y XXw=XY
⇒ w = ( X ⊤ X ) − 1 X ⊤ Y \Rightarrow w = (X^{\top} X)^{-1} X^{\top} Y w=(XX)1XY
此为我们得到的解,即最小二乘估计的矩阵形式的表达,把 ( X ⊤ X ) − 1 X ⊤ (X^{\top} X)^{-1} X^{\top} (XX)1X称为伪逆 X + X^+ X+

几何解释1:误差与所有红色距离有关

几何解释2:

数据X是Nxp维的,可构成一个p维子空间,Y通常是不在这个子空间中的,即最小二乘是在p维子空间中找向量f(w)使得f(w)与Y的距离最小,可知f(w)是Y在p维子空间的投影
即可知,Y-f(w)与p维子空间的基向量垂直
X ⊤ ( Y − f ( w ) ) = 0 X^{\top}(Y-f(w))=0 X(Yf(w))=0
X ⊤ ( Y − X w ) = 0 X^{\top}(Y-Xw)=0 X(YXw)=0
X ⊤ Y = X ⊤ X w X^{\top}Y=X^{\top}Xw XY=XXw
⇒ w = ( X ⊤ X ) − 1 X ⊤ Y \Rightarrow w = (X^{\top} X)^{-1} X^{\top} Y w=(XX)1XY

(系列三) 线性回归2-最小二乘法-概率视角-高斯噪声-MLE

假设数据集为 D = ( x 1 , y 1 ) , ( x 2 , y 2 ) , ⋯   , ( x N , y N ) , x i ∈ R p , y i ∈ R , i = 1 , 2 , ⋯   , N D={(x_1, y_1), (x_2, y_2), \cdots, (x_N, y_N)}, x_{i} \in \mathbb{R}^{p}, y_{i} \in \mathbb{R} , i=1,2, \cdots, N D=(x1,y1),(x2,y2),,(xN,yN),xiRp,yiR,i=1,2,,N

数据使用矩阵表示 x = ( x 1 x 2 ⋯ x N ) ⊤ = [ x 1 T x 2 T ⋮ x N T ] = [ x 11 x 12 ⋯ x 1 p x 21 x 22 ⋯ x 2 p ⋮ ⋮ ⋱ ⋮ x m 1 x m 2 ⋯ x m B ] N x p x=\left(x_{1} x_{2} \cdots x_{N}\right)^{\top}=\left[\begin{array}{c} x_{1}^{T} \\ x_{2}^{T} \\ \vdots \\ x_{N}^{T} \end{array}\right]=\left[\begin{array}{cccc} x_{11} & x_{12} & \cdots & x_{1 p} \\ x_{21} & x_{22} & \cdots & x_{2 p} \\ \vdots & \vdots & \ddots & \vdots \\ x_{m 1} & x_{m 2} & \cdots & x_{m_{B}} \end{array}\right]_{N x p} x=(x1x2xN)=x1Tx2TxNT=x11x21xm1x12x22xm2x1px2pxmBNxp
y i y_i yi是一个一维实数值
Y = ( y 1 y 2 ⋮ y N ) N × 1 Y=\left(\begin{array}{l} y_{1} \\ y_{2} \\ \vdots \\ y_{N} \end{array}\right)_{N \times{1}} Y=y1y2yNN×1

最小二乘法估计:
损失函数:
L ( w ) = ∑ i = 1 N ∥ w ⊤ x i − y i ∥ 2 = ∑ i = 1 N ( w ⊤ x i − y i ) 2 L(w)=\sum_{i=1}^{N}\left\|w^{\top} x_{i}-y_{i}\right\|^{2}=\sum_{i=1}^{N}\left(w^{\top} x_{i}-y_{i}\right)^{2} L(w)=i=1Nwxiyi2=i=1N(wxiyi)2

要估计的 w ^ = arg ⁡ min ⁡ L ( w ) = ( X ⊤ X ) − 1 X ⊤ Y \hat{w} = \arg\min L(w) = (X^{\top} X)^{-1} X^{\top} Y w^=argminL(w)=(XX)1XY

假设数据中的噪声服从高斯分布 ε ∼ N ( 0 , σ 2 ) \varepsilon \sim N\left(0, \sigma^{2}\right) εN(0,σ2)

y = f ( w ) + ε = w ⊤ x + ε y=f(w) + \varepsilon = w^{\top}x + \varepsilon y=f(w)+ε=wx+ε
ε \varepsilon ε服从正态分布,则 y ∣ x ; w ∼ N ( w ⊤ x , σ 2 ) y|x;w \sim N(w^{\top}x, \sigma^{2}) yx;wN(wx,σ2)
P ( y ∣ x ; w ) = 1 2 π σ exp ⁡ { − ( y − w ⊤ x ) 2 2 σ 2 } P(y|x;w) = \frac{1}{\sqrt{2\pi}\sigma}\exp\{-\frac{(y-w^{\top}x)^2}{2\sigma^2}\} P(yx;w)=2π σ1exp{ 2σ2(ywx)2}

用极大似然估计MLE:
$L(w) = logP(Y|X;w) = \log \prod^N_{i=1} P(y_i |x_i;w) $
= ∑ i = 1 N log ⁡ P ( y i ∣ x i ; w ) = \sum^N_{i=1} \log P(y_i |x_i;w) =i=1NlogP(yixi;w)
= ∑ i = 1 N log ⁡ 1 2 π σ + log ⁡ exp ⁡ { − ( y − w ⊤ x ) 2 2 σ 2 } =\sum^N_{i=1} \log \frac{1}{\sqrt{2\pi}\sigma} + \log\exp\{-\frac{(y-w^{\top}x)^2}{2\sigma^2}\} =i=1Nlog2π σ1+logexp{ 2σ2(ywx)2}
= ∑ i = 1 N ( log ⁡ 1 2 π σ − 1 2 σ 2 ( y − w ⊤ x ) 2 ) =\sum^N_{i=1} ( \log \frac{1}{\sqrt{2\pi}\sigma} -\frac{1}{2\sigma^2}(y-w^{\top}x)^2) =i=1N(log2π σ12σ21(ywx)2)

根据极大似然估计
w ^ = arg ⁡ max ⁡ w L ( w ) \hat{w} = \underset{w}{\arg\max} L(w) w^=wargmaxL(w)
= arg ⁡ max ⁡ w − 1 2 σ 2 ( y − w ⊤ x ) 2 =\underset{w}{\arg\max} -\frac{1}{2\sigma^2}(y-w^{\top}x)^2 =wargmax2σ21(ywx)2
= arg ⁡ min ⁡ w ( y − w ⊤ x ) 2 = \underset{w}{\arg\min}(y-w^{\top}x)^2 =wargmin(ywx)2
此公式和最小二乘估计的损失函数是一模一样的
L ( w ) = ∑ i = 1 N ∥ w ⊤ x i − y i ∥ 2 = ∑ i = 1 N ( w ⊤ x i − y i ) 2 L(w)=\sum_{i=1}^{N}\left\|w^{\top} x_{i}-y_{i}\right\|^{2}=\sum_{i=1}^{N}\left(w^{\top} x_{i}-y_{i}\right)^{2} L(w)=i=1Nwxiyi2=i=1N(wxiyi)2

最小二乘估计隐含着噪声服从正态分布这样的假设

(系列三) 线性回归3-正则化-岭回归-频率角度

最小二乘的损失函数:
L ( w ) = ∑ i = 1 N ∥ w ⊤ x i − y i ∥ 2 = ∑ i = 1 N ( w ⊤ x i − y i ) 2 L(w)=\sum_{i=1}^{N}\left\|w^{\top} x_{i}-y_{i}\right\|^{2}=\sum_{i=1}^{N}\left(w^{\top} x_{i}-y_{i}\right)^{2} L(w)=i=1Nwxiyi2=i=1N(wxiyi)2

要估计的 w ^ = arg ⁡ min ⁡ L ( w ) = ( X ⊤ X ) − 1 X ⊤ Y \hat{w} = \arg\min L(w) = (X^{\top} X)^{-1} X^{\top} Y w^=argminL(w)=(XX)1XY

数据是 X N x p X_{Nxp} XNxp,有N个样本, x i ∈ R p x_i \in \mathbb{R}^p xiRp,通常N要远大于p

但实际中样本可能不多,这个时候 X T X X^TX XTX可能不可逆,容易造成过拟合

过拟合的3种处理办法
1.加数据
2.特征选择/特征提取(PCA)
3.正则化

正则化框架
arg ⁡ min ⁡ w [ L ( w ) + λ P ( w ) ] \underset{w}{\arg\min}[L(w)+\lambda P(w)] wargmin[L(w)+λP(w)]
P(w)是惩罚项

L1:Lasso, P ( w ) = ∥ w ∥ P(w)=\|w\| P(w)=w
L2:Ridge,岭回归,权值衰减, P ( w ) = ∥ w ∥ 2 2 = W T W P(w)=\|w\|^2_2=W^TW P(w)=w22=WTW

带L2的优化的目标函数:
J ( w ) = ∑ i = 1 N ∥ w ⊤ x i − y i ∥ 2 + λ w ⊤ w J(w)=\sum_{i=1}^{N}\left\|w^{\top} x_{i}-y_{i}\right\|^{2}+\lambda w^{\top} w J(w)=i=1Nwxiyi2+λww
= ( W T X T − Y T ) ( X W − Y ) + λ w ⊤ w =(W^TX^T-Y^T)(XW-Y)+\lambda w^{\top} w =(WTXTYT)(XWY)+λww
= w ⊤ X ⊤ X w − 2 w ⊤ X ⊤ Y + Y ⊤ Y + λ w ⊤ w =w^{\top} X^{\top} X w-2 w^{\top} X^{\top} Y+Y^{\top} Y+\lambda w^{\top} w =wXXw2wXY+YY+λww
把第一项和最后一项可以合并起来,其中I是单位矩阵
= w ⊤ ( X ⊤ X + λ I ) w − 2 w ⊤ X ⊤ Y + Y ⊤ Y =w^{\top}\left(X^{\top} X+\lambda I\right) w-2 w^{\top} X^{\top} Y+Y^{\top} Y =w(XX+λI)w2wXY+YY

w ^ = arg ⁡ min ⁡ J ( w ) \hat{w} = \arg\min J(w) w^=argminJ(w)
∂ J ( ω ) ∂ w = 2 ( X ⊤ X + λ I ) w − 2 X ⊤ Y = 0 \frac{\partial J(\omega)}{\partial w}=2\left(X^{\top} X+\lambda I\right) w-2 X^{\top} Y=0 wJ(ω)=2(XX+λI)w2XY=0
可得 w ^ = ( X ⊤ X + λ I ) − 1 X ⊤ Y \hat{w}=\left(X^{\top} X+\lambda I\right)^{-1} X^{\top} Y w^=(XX+λI)1XY
对比原式多了一个 λ I \lambda I λI,所以会一定可逆

(系列三) 线性回归4-正则化-岭回归-贝叶斯角度

正则化的几何解释:

噪声 ε ∈ N ( 0 , σ 2 ) \varepsilon \in N(0, \sigma^2) εN(0,σ2)

贝叶斯角度:
w ∼ N ( 0 , σ 0 2 ) w \sim N(0, \sigma_0^2) wN(0,σ02)
贝叶斯定理 P ( w ∣ y ) = P ( y ∣ w ) ⋅ P ( w ) P ( y ) P(w | y)=\frac{P(y | w) \cdot P(w)}{P(y)} P(wy)=P(y)P(yw)P(w)

最大后验估计MAP
w ^ = arg ⁡ max ⁡ w P ( w ∣ y ) = argmax ⁡ P ( y ∣ w ) ⋅ P ( w ) \hat{w}=\arg \max _{w} P(w | y)=\operatorname{argmax} P(y | w) \cdot P(w) w^=argmaxwP(wy)=argmaxP(yw)P(w)
因P(y)是一个常量,可做这个变换
由于 P ( w ) = 1 2 π σ 0 exp ⁡ { − ∥ w ∥ 2 2 σ 0 2 } P(w)=\frac{1}{\sqrt{2 \pi} \sigma_{0}} \exp \left\{-\frac{\|w\|^{2}}{2 \sigma_{0}^{2}}\right\} P(w)=2π σ01exp{ 2σ02w2}
P ( y ∣ w ) = 1 2 π σ exp ⁡ { − ( y − w 2 x ) 2 2 σ 2 } P(y | w)=\frac{1}{\sqrt{2 \pi} \sigma} \exp \left\{-\frac{\left(y-w^{2} x\right)^{2}}{2 \sigma^{2}}\right\} P(yw)=2π σ1exp{ 2σ2(yw2x)2}
P ( y ∣ w ) ⋅ p ( w ) = 1 2 π σ 1 2 π σ 0 exp ⁡ { − ( y − w ⊤ x ) 2 2 σ 2 − ∥ w ∥ 2 2 σ 0 2 } P(y | w) \cdot p(w)=\frac{1}{\sqrt{2 \pi} \sigma} \frac{1}{\sqrt{2 \pi} \sigma_{0}} \exp \left\{-\frac{\left(y-w^{\top} x\right)^{2}}{2 \sigma^{2}}-\frac{\|w\|^{2}}{2 \sigma_{0}^{2}}\right\} P(yw)p(w)=2π σ12π σ01exp{ 2σ2(ywx)22σ02w2}

即原式为
= arg ⁡ max ⁡ w log ⁡ [ P ( y ∣ w ) ⋅ P ( w ) ] =\arg \max _{w} \log [P(y | w) \cdot P(w)] =argmaxwlog[P(yw)P(w)]

= arg ⁡ max ⁡ w log ⁡ ( 1 2 π σ 1 2 π σ 0 ) + log ⁡ exp ⁡ { − ( y − w ⊤ x ) 2 2 σ 2 − ∥ w ∥ 2 2 σ 0 2 } =\arg \max _{w} \log(\frac{1}{\sqrt{2 \pi} \sigma} \frac{1}{\sqrt{2 \pi} \sigma_{0}}) + \log\exp \left\{-\frac{\left(y-w^{\top} x\right)^{2}}{2 \sigma^{2}}-\frac{\|w\|^{2}}{2 \sigma_{0}^{2}}\right\} =argmaxwlog(2π σ12π σ01)+logexp{ 2σ2(ywx)22σ02w2}
因为前面是一个常数,可删除
= arg ⁡ min ⁡ w ( y − w ⊤ x ) 2 2 σ 2 + ∥ w ∥ 2 2 σ 0 2 =\arg \min _{w} \frac{\left(y-w^{\top} x\right)^{2}}{2 \sigma^{2}}+\frac{\|w\|^{2}}{2 \sigma_{0}^{2}} =argminw2σ2(ywx)2+2σ02w2

= arg ⁡ min ⁡ w ( y − w ⊤ x ) 2 + σ 2 σ 0 2 ∥ w ∥ 2 =\arg \min _{w}\left(y-w^{\top} x\right)^{2}+\frac{\sigma^{2}}{\sigma_{0}^{2}}\|w\|^{2} =argminw(ywx)2+σ02σ2w2
式子中都省略了 ∑ i = 1 N \sum_{i=1}^N i=1N
w ^ M A P = arg ⁡ min ⁡ w ∑ i = 1 N ( y − w ⊤ x ) 2 + σ 2 σ 0 2 ∥ w ∥ 2 \hat{w}_{MAP}=\arg \min _{w}\sum_{i=1}^N \left(y-w^{\top} x\right)^{2}+\frac{\sigma^{2}}{\sigma_{0}^{2}}\|w\|^{2} w^MAP=argminwi=1N(ywx)2+σ02σ2w2
左边是loss函数,右边是惩罚项,其中 σ 2 σ 0 2 \frac{\sigma^{2}}{\sigma_{0}^{2}} σ02σ2可看作 λ \lambda λ

得到结论:
1.最小二乘估计LSE ⟺ 极大似然估计MLE(噪声服从高斯分布)
2.正则化最小二乘估计Regularized LSE ⟺ 最大后验概率估计MAP(先验和噪声服从高斯分布)

猜你喜欢

转载自blog.csdn.net/u011703187/article/details/104588814