机器学习之最小二乘线性回归原理解读与公式推导

(Ordinary) Least Squares Linear Regression

一、条件

样本集呈线性分布

二、原理

用一个超平面/直线去拟合样本集,使样本点的标签值与预测值的差的平方和最小

注:不是样本点到直线的距离最小

h ( x i 1 , x i 2 , ⋯   , x i d ) = ∑ j = 1 d w j x i j − θ h(x_{i1},x_{i2},\cdots,x_{id}) = \sum\limits_{j=1}^d w_jx_{ij} -\theta h(xi1,xi2,,xid)=j=1dwjxijθ , i = 1 , 2 , ⋯   , n i=1,2,\cdots,n i=1,2,,n

x i 0 = 1 x_{i0}=1 xi0=1 , w 0 = θ w_0=\theta w0=θ , i = 1 , 2 , ⋯   , n i = 1,2,\cdots,n i=1,2,,n x ⃗ i = [ 1 x i 1 x i 2 ⋯ x i d ] T \vec x_i = \begin{bmatrix} 1&x_{i1}&x_{i2}&\cdots&x_{id} \end{bmatrix}^T x i=[1xi1xi2xid]T , w ⃗ = [ θ w 1 w 2 ⋯ w d ] T \vec w = \begin{bmatrix} \theta&w_1&w_2&\cdots&w_d \end{bmatrix}^T w =[θw1w2wd]T h ( x ⃗ i ) = w ⃗ T ⋅ x ⃗ i h(\vec x_i)=\vec w^T\cdot\vec x_i h(x i)=w Tx i

  1. 构造损失函数

    L ( h ) = 1 n ∑ i = 1 n ( h ( x ⃗ i ) − y i ) 2 L(h) = \frac{1}{n}\sum\limits_{i=1}^n(h(\vec x_i)-y_i)^2 L(h)=n1i=1n(h(x i)yi)2 ,即均方误差

  2. 求损失函数取最小值时对应的假设 h h h

    假设 h h h w ⃗ \vec w w 有关,将 L ( h ) L(h) L(h) 化为自变量为 w ⃗ \vec w w 的函数

    L ( w ⃗ ) = 1 n ∑ i = 1 n ( w ⃗ T ⋅ x ⃗ i − y i ) 2 L(\vec w) = \frac{1}{n}\sum\limits_{i=1}^n(\vec w^T\cdot \vec x_i-y_i)^2 L(w )=n1i=1n(w Tx iyi)2

    X = [ x ⃗ 1 T x ⃗ 2 T ⋯ x ⃗ n T ] T \mathbf X=\begin{bmatrix} \vec x_1^T&\vec x_2^T&\cdots&\vec x_n^T \end{bmatrix}^T X=[x 1Tx 2Tx nT]T y ⃗ = [ y 1 y 2 ⋯ y n ] T \vec y = \begin{bmatrix} y_1&y_2&\cdots&y_n \end{bmatrix}^T y =[y1y2yn]T

    L ( w ⃗ ) = 1 n ( X ⋅ w ⃗ − y ⃗ ) T ⋅ ( X ⋅ w ⃗ − y ⃗ ) L(\vec w) = \frac{1}{n}(\mathbf X\cdot\vec w- \vec y)^T\cdot(\mathbf X\cdot\vec w- \vec y) L(w )=n1(Xw y )T(Xw y )

    = 1 n ( w ⃗ T X T X w ⃗ − w ⃗ T X T y ⃗ − y ⃗ T X w ⃗ + y ⃗ T y ⃗ ) =\frac{1}{n}(\vec w^T\mathbf X^T\mathbf X\vec w-\vec w^T\mathbf X^T\vec y-\vec y^T\mathbf X\vec w+\vec y^T\vec y) =n1(w TXTXw w TXTy y TXw +y Ty )

    = 1 n ( w ⃗ T X T X w ⃗ − 2 w ⃗ T X T y ⃗ + y ⃗ T y ⃗ ) =\frac{1}{n}(\vec w^T\mathbf X^T\mathbf X\vec w-2\vec w^T\mathbf X^T\vec y+\vec y^T\vec y) =n1(w TXTXw 2w TXTy +y Ty ) ,因为 w ⃗ T X T y ⃗ \vec w^T\mathbf X^T\vec y w TXTy y ⃗ T X w ⃗ \vec y^T\mathbf X\vec w y TXw 均为 1 × 1 1\times1 1×1 矩阵

    1. 梯度下降法

    2. 解析法

      w ⃗ ∗ \vec w^* w 使 ∂ ∂ w ⃗ L ( w ⃗ ∗ ) = 0 \frac{\partial}{\partial \vec w}L(\vec w^*) = 0 w L(w )=0 ,则 w ⃗ ∗ \vec w^* w 即为 L ( w ⃗ ) L(\vec w) L(w ) 对最优解(凸优化问题)

      ∂ ∂ w ⃗ L ( w ⃗ ) = 2 X T X w ⃗ − 2 y ⃗ T X \frac{\partial}{\partial \vec w}L(\vec w) = 2\mathbf X^T\mathbf X\vec w-2\vec y^T\mathbf X w L(w )=2XTXw 2y TX

      2 X T X w ⃗ ∗ − 2 y ⃗ T X = 0 2\mathbf X^T\mathbf X\vec w^*-2\vec y^T\mathbf X = 0 2XTXw 2y TX=0

      w ⃗ ∗ = ( X T X ) − 1 y ⃗ T X = ( X T X ) − 1 X T y ⃗ \vec w^*=(\mathbf X^T\mathbf X)^{-1}\vec y^T \mathbf X = (\mathbf X^T\mathbf X)^{-1}\mathbf X^T\vec y w =(XTX)1y TX=(XTX)1XTy

猜你喜欢

转载自blog.csdn.net/qq_52554169/article/details/130888741