CS229 Machine Learning学习笔记

第2集、监督学习应用、梯度下降

线性回归的Normal Equations

令全体训练样本构成的矩阵为

\[X=\begin{pmatrix}(x^{(1)})^T\\\vdots\\(x^{(m)})^T\end{pmatrix}\]

对应的真实值

\[y=\begin{pmatrix}y^{(1)}\\\vdots\\y^{(m)}\end{pmatrix}\]

参数

\[\theta=(\theta_0,\theta_1,\cdots,\theta_n)^T\]

则有

\[X\theta=\begin{pmatrix}(x^{(1)})^T\theta\\\vdots\\(x^{(m)})^T\theta\end{pmatrix} =\begin{pmatrix}h_\theta(x^{(1)})\\\vdots\\h_\theta(x^{(m)})\end{pmatrix}\]

\[\frac 1 2 \|X\theta-y\|^2=\frac 1 2 (X\theta-y)^T(X\theta-y)=\frac 1 2 \sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2=J(\theta)\]

则使得\(J(\theta)\)最小的点,一定是极小值点,此时\(\frac {\partial J(\theta)}{\partial \theta_i}=0,\ \ i=0,\cdots,n\)

矩阵求导求解参数

    1. (矩阵的迹的性质)令\(A,B,C\)为n阶方阵,则\(\mathrm{tr}(AB)=\mathrm{tr}(BA)\),\(\mathrm{tr}(ABC)=\mathrm{tr}(CAB)=\mathrm{tr}(BCA)\),以此类推

\(A\in \mathbb{R}^{m\times n}\),\(f(A)\in \mathbb{R}\),\(f:\mathbb{R}^{m\times n}\to \mathbb{R}\),则\(f(A)\)对A的梯度为:

\[\nabla_A f(A)= \begin {pmatrix} \frac{\partial f}{\partial A_{1,1}}&\cdots&\frac{\partial f}{\partial A_{1,n}}\\ \vdots & \ddots & \vdots\\ \frac{\partial f}{\partial A_{m,1}}&\cdots & \frac{\partial f}{\partial A_{m,n}} \end{pmatrix}\]

下面不加证明地给出几个公式:

  • Formula 1.\[\nabla_A\mathrm{tr}(AB)=B^T\]

  • Formula 2.\[\nabla_A\mathrm{tr}(ABA^TC)=CAB+C^TAB^T\]

  • Formula 3.\[\nabla_{A^T}f(A)=(\nabla_{A}f(A))^T\]

\[\nabla_\theta J(\theta)=\nabla_\theta [\frac 1 2 (X\theta-y)^T(X\theta-y)]\]

\[=\frac 1 2 \nabla_\theta [(X\theta-y)^T(X\theta-y)]\]

\[=\frac 1 2 \nabla_\theta [(\theta^T X^T-y^T)(X\theta-y)]\]

\[=\frac 1 2 \nabla_\theta (\theta^T X^T X\theta-\theta^T X^T y-y^TX\theta+y^Ty)\]

(注意这里nabla算子后面的是一个实数,\(\mathrm{tr}a=a\),所以可以对整个式子直接加上tr)

\[=\frac 1 2 \nabla_\theta \mathrm{tr}(\theta^T X^T X\theta-\theta^T X^T y-y^TX\theta+y^Ty)\]

\[=\frac 1 2 \nabla_\theta [\mathrm{tr}(\theta^T X^T X\theta)-\mathrm{tr}(\theta^T X^T y)-\mathrm{tr}(y^TX\theta)]\]

(由于\(y^Ty\)\(\theta\)无关,因此可以消掉)

\[=\frac 1 2 \nabla_\theta [\mathrm{tr}(\theta^T X^T X\theta)-\mathrm{tr}(\theta^T X^T y)-\mathrm{tr}(\theta y^T X)]\]

\[=\frac 1 2 \nabla_\theta [\mathrm{tr}(\theta^T X^T X\theta)-2\mathrm{tr}(\theta y^T X)]\]

\[=\frac 1 2 \nabla_\theta [\mathrm{tr}(\theta^T X^T X\theta)-2(\theta y^T X)]\]

(\((\theta y^T X)\)是实数,实数a有:\(\mathrm{tr}(a)=a\))

\[=\frac 1 2 \nabla_\theta [\mathrm{tr}(\theta \theta^T X^T X)-2(\theta y^T X)]\]

\[=\frac 1 2 \nabla_\theta [\mathrm{tr}(\theta I \theta^T X^T X)-2(\theta y^T X)]\]

\[=\frac 1 2 [X^TX\theta+X^TX\theta-2\nabla_\theta (\theta y^T X)]\]

(使用公式2,令\(A=\theta,B=I,C=X^TX\))

\[=X^TX\theta-\nabla_\theta (\theta y^T X)\]

\[=X^TX\theta-X^Ty=0\]

(使用公式1)

从而可得Normal Equations:

\[X^TX\theta=X^Ty\]

\[\theta=(X^TX)^{-1}X^Ty\]

猜你喜欢

转载自www.cnblogs.com/qpswwww/p/9298394.html