Gradient Descent Vectorization

我们假定 $X$ 为数据集，其中每一行 $x^{(i)}$ 为一个样本，列数代表其特征数量；
$Y$ 为其真实值，每行 $y^{(i)}$ 与每个输入样本对应；
$\Theta$ 为每个特征的权重;
$X = \left[ \begin{matrix} x_0^{(0)} & x_1^{(0)} & \cdots & x_n^{(0)} \\ x_0^{(1)} & x_1^{(1)} & \cdots & x_n^{(1)} \\ \vdots & \vdots & \ddots & \vdots\\ x_0^{(m)} & x_0^{(m)} & \cdots & x_0^{(m)} \end{matrix} \right] = \left[ \begin{matrix} x^{(0)} \\ x^{(1)} \\ \vdots \\ x^{(m)} \end{matrix} \right]$

$Y = \left[ \begin{matrix} y^{(0)} \\ y^{(1)} \\ \vdots\\ y^{(m)} \\ \end{matrix} \right]$

$\Theta = \left[ \begin{matrix} \theta_0 \\ \theta_1 \\ \vdots \\ \theta_n \\ \end{matrix} \right]$

神经网络结构
如图所示为一个简单的神经网络，仅包含一个输入层和一个输出层，没有任何的隐藏层；对于其中的某一个样本 $x^{(i)}$ 来说，其输出的预测值 $\hat y^{(i)}$ 为：
$\hat y^{(i)} = \theta_0x_0^{(i)} + \theta_1x_1^{(i)} + ... + \theta_nx_n^{(i)} = x^{(i)}\theta \tag{1}$
则所有样本的预测值 $\hat Y$ 为：
$\hat Y = \left[ \begin{matrix} \hat y^{(0)} \\ \hat y^{(1)} \\ \vdots \\ \hat y^{(m)} \end{matrix} \right]$
我们以预测值与真实值的平方误差为损失函数 $L$ ：
$L = \frac{1}{m}\sum_{i=1}^m\frac{1}{2}(\hat y^{(i)} - y^{(i)})^2 \\ = \frac{1}{2m}\sum_{i=1}^m( x^{(i)}\theta - y^{(i)})^2 \tag{2}$
假设我们现在要计算 $\theta_j$ 经梯度下降后的更新值，其中 $\alpha$ 为学习率：
$\theta_j = \theta_j - \alpha\frac{\partial L}{\partial \theta_j} \tag{3}$
我们对损失函数，即公式(2)求 $\theta_j$ 的微分：
$\frac{\partial L}{\partial \theta_j} = \frac{1}{m}\sum_{i=1}^m( x^{(i)}\theta - y^{(i)})\frac{\partial( x^{(i)}\theta)}{\partial \theta_j} \\ =\frac{1}{m}\sum_{i=1}^m( x^{(i)}\theta - y^{(i)})x_j^{(i)} \tag{4}$
我们记 $e^{(i)} = x^{(i)}\theta -y^{(i)}$ ，用 $E$ 表示所有的 $e^{(i)}$ 有：
$E=\left[ \begin{matrix} e^{(0)} \\ e^{(1)} \\ \vdots \\ e^{(m)} \end{matrix} \right]= \left[ \begin{matrix} x^{(0)}\theta -y^{(0)} \\ x^{(1)}\theta -y^{(1)} \\ \vdots \\ x^{(m)}\theta -y^{(m)} \end{matrix} \right] = X\Theta-Y$

则公式(4)可表示为：
$\frac{\partial L}{\partial \theta_j}=\frac{1}{m}\sum_{i=1}^me^{(i)}x_j^{(i)} \\ =\frac{1}{m}(x_j^{(0)},x_j^{(1)},...,x_j^{(m)})E \tag{5}$
将公式(5)代入公式(3)中可以得到：
$\theta_j = \theta_j - \alpha\frac{1}{m}(x_j^{(0)},x_j^{(1)},...,x_j^{(m)})E \tag{6}$
因此，我们可以得到所有权重的梯度更新为：
$\Theta = \left[ \begin{matrix} \theta_0 \\ \theta_1 \\ \vdots \\ \theta_n \\ \end{matrix} \right] = \left[ \begin{matrix} \theta_0 \\ \theta_1 \\ \vdots \\ \theta_n \\ \end{matrix} \right] - \frac{\alpha}{m} \left[ \begin{matrix} x_0^{(0)} & x_0^{(1)} & \cdots & x_0^{(m)} \\ x_1^{(0)} & x_1^{(1)} & \cdots & x_1^{(m)} \\ \vdots & \vdots & \ddots & \vdots\\ x_n^{(0)} & x_n^{(1)} & \cdots & x_n^{(m)} \end{matrix} \right]E \\ = \Theta- \frac{\alpha}{m}X^TE=\Theta- \frac{\alpha}{m}X^T(X\Theta-Y)$

Gradient Descent Vectorization

猜你喜欢