一个神经网络的梯度反向传播的计算过程

梯度下降算法广泛应用于寻找多变量函数的最小值,但是在神经网络的世界中,由基本函数表示的变量,参数,和函数的关系错综复杂,求解梯度并不是一件容易的事情,无法直接使用梯度下降算法。于是研究这一块儿的专家就研究出了一种误差反向传播法,根据计算流的拓扑关系,即可推导出每一层的误差情况,为了便于自己学和理解,这里我以一个网络为例,来说明这个算法的计算过程的细节。

网络拓扑结构图如下:

先看正向传播:

\begin{bmatrix} z_1\\ z_2\\ z_3 \end{bmatrix} =\begin{bmatrix} w_{11}& w_{12}\\ w_{21}& w_{22}\\ w_{31}&w_{32} \end{bmatrix}\begin{bmatrix} x_1\\ x_2\end{bmatrix}

\begin{bmatrix} h_1\\ h_2\\ h_3 \end{bmatrix} =\delta \bigg( \begin{bmatrix} z_{1}\\ z_{2}\\ z_{3} \end{bmatrix}\bigg)

\hat{y}= [w_{1} \ \ w_{2} \ \ w_{3}]\begin{bmatrix} h_{1}\\ h_{2}\\ h_{3} \end{bmatrix}

E=\frac{1}{2}(\hat{y}-y)^2

再来看反向传播:

\frac{\partial E}{\partial \hat{y}} = \hat{y}-y

 \frac{\partial \hat{y}}{\partial w_1} = h_1     \frac{\partial \hat{y}}{\partial w_2} = h_2     \frac{\partial \hat{y}}{\partial w_3} = h_3

\frac{\partial \hat{y}}{\partial h_1} = w_1      \frac{\partial \hat{y}}{\partial h_2} = w_2      \frac{\partial \hat{y}}{\partial h_3} = w_3

\begin{bmatrix} \frac{\partial E}{\partial w_1}\\ \\ \frac{\partial E}{\partial w_2}\\ \\ \frac{\partial E}{\partial w_3} \end{bmatrix} =\begin{bmatrix} \frac{\partial E}{\partial \hat{y}}\frac{\partial \hat{y}}{\partial {w_1}}\\ \\ \frac{\partial E}{\partial \hat{y}}\frac{\partial \hat{y}}{\partial {w_2}}\\ \\ \frac{\partial E}{\partial \hat{y}}\frac{\partial \hat{y}}{\partial {w_3}} \end{bmatrix}=\begin{bmatrix} (\hat{y}-y)h_1\\ \\ (\hat{y}-y)h_2\\ \\ (\hat{y}-y)h_3 \end{bmatrix}

\begin{bmatrix} \frac{\partial E}{\partial h_1}\\ \\ \frac{\partial E}{\partial h_2}\\ \\ \frac{\partial E}{\partial h_3} \end{bmatrix} =\begin{bmatrix} \frac{\partial E}{\partial \hat{y}}\frac{\partial \hat{y}}{\partial {h_1}}\\ \\ \frac{\partial E}{\partial \hat{y}}\frac{\partial \hat{y}}{\partial {h_2}}\\ \\ \frac{\partial E}{\partial \hat{y}}\frac{\partial \hat{y}}{\partial {h_3}} \end{bmatrix}=\begin{bmatrix} (\hat{y}-y)w_1\\ \\ (\hat{y}-y)w_2\\ \\ (\hat{y}-y)w_3 \end{bmatrix}

由于

所以:

h_n = sigmoid(z_n)

\frac{\partial h_n}{\partial z_n}=h_n(1-h_n)

\\ \frac{\partial h_1}{\partial z_1}=h_1(1-h_1) \\ \frac{\partial h_2}{\partial z_2}=h_1(1-h_2) \\ \frac{\partial h_3}{\partial z_3}=h_1(1-h_3) \\

所以:

\begin{bmatrix} \frac{\partial E}{\partial z_1}\\ \\ \frac{\partial E}{\partial z_2}\\ \\ \frac{\partial E}{\partial z_3} \end{bmatrix} =\begin{bmatrix} \frac{\partial E}{\partial \hat{h_1}}\frac{\partial h_1}{\partial {z_1}}\\ \\ \frac{\partial E}{\partial \hat{h_2}}\frac{\partial h_2}{\partial {z_2}}\\ \\ \frac{\partial E}{\partial \hat{h_3}}\frac{\partial h_3}{\partial {z_3}} \end{bmatrix}=\begin{bmatrix} (\hat{y}-y)w_1 \cdot h_1(1-h_1)\\ \\ (\hat{y}-y)w_2\cdot h_2(1-h_2)\\ \\ (\hat{y}-y)w_3\cdot h_3(1-h_3) \end{bmatrix}

最后一部,要找源头的导数了:

z_1=w_{11}x_1 + w_{12}x_2

z_2=w_{21}x_1 + w_{22}x_2

z_3=w_{31}x_1 + w_{32}x_2

所以:

\frac{\partial z_1}{\partial w_{11}} = x_1    \frac{\partial z_1}{\partial w_{12}} = x_2

\frac{\partial z_2}{\partial w_{21}} = x_1   \frac{\partial z_2}{\partial w_{22}} = x_2  

\frac{\partial z_3}{\partial w_{31}} = x_1    \frac{\partial z_3}{\partial w_{32}} = x_2

\frac{\partial z_1}{\partial x_1} = w_{11}   \frac{\partial z_1}{\partial x_2} = w_{12}

\frac{\partial z_2}{\partial x_1} = w_{21}   \frac{\partial z_2}{\partial x_2 } = w_{22}

\frac{\partial z_3}{\partial x_1 } = w_{31}   \frac{\partial z_3}{\partial x_2} = w_{32}

所以:

\frac{\partial E}{\partial w_{11}} = \frac{\partial E}{\partial z_{1}}\cdot \frac{\partial z_1}{\partial w_{11}}=(\hat{y}-y)w_1h_1(1-h_1)\cdot x_1

\frac{\partial E}{\partial w_{12}} = \frac{\partial E}{\partial z_{1}}\cdot \frac{\partial z_1}{\partial w_{12}}=(\hat{y}-y)w_1h_1(1-h_1)\cdot x_2

\frac{\partial E}{\partial w_{21}} = \frac{\partial E}{\partial z_{2}}\cdot \frac{\partial z_2}{\partial w_{21}}=(\hat{y}-y)w_2h_2(1-h_2)\cdot x_1

\frac{\partial E}{\partial w_{22}} = \frac{\partial E}{\partial z_{2}}\cdot \frac{\partial z_2}{\partial w_{22}}=(\hat{y}-y)w_2h_2(1-h_2)\cdot x_2

\frac{\partial E}{\partial w_{31}} = \frac{\partial E}{\partial z_{3}}\cdot \frac{\partial z_3}{\partial w_{31}}=(\hat{y}-y)w_3h_3(1-h_3)\cdot x_1

\frac{\partial E}{\partial w_{32}} = \frac{\partial E}{\partial z_{3}}\cdot \frac{\partial z_3}{\partial w_{32}}=(\hat{y}-y)w_3h_3(1-h_3)\cdot x_2

所以:

\mathbf{\\ \begin{bmatrix} \frac{\partial \hat{y}}{\partial w_{11}} &\frac{\partial \hat{y}}{\partial w_{12}} \\ \frac{\partial \hat{y}}{\partial w_{21}}&\frac{\partial \hat{y}}{\partial w_{22}} \\ \frac{\partial \hat{y}}{\partial w_{31}}&\frac{\partial \hat{y}}{\partial w_{32}} \end{bmatrix} =\begin{bmatrix} w_1h_1(1-h_1)\cdot x_1 & w_1h_1(1-h_1)\cdot x_2\\ w_2h_2(1-h_2)\cdot x_1& w_2h_2(1-h_2)\cdot x_2\\ w_3h_3(1-h_3)\cdot x_1& w_3h_3(1-h_3)\cdot x_2\end{bmatrix}=\begin{bmatrix}w_1h_1(1-h_1) \\ w_2h_2(1-h_2)\\w_3h_3(1-h_3) \end{bmatrix}\cdot \begin{bmatrix} x_1 & x_2 \end{bmatrix}}

\mathbf{\\ \begin{bmatrix} \frac{\partial E}{\partial w_{11}} &\frac{\partial E}{\partial w_{12}} \\ \frac{\partial E}{\partial w_{21}}&\frac{\partial E}{\partial w_{22}} \\ \frac{\partial E}{\partial w_{31}}&\frac{\partial E}{\partial w_{32}} \end{bmatrix} =\begin{bmatrix} (\hat{y}-y)w_1h_1(1-h_1)\cdot x_1 & (\hat{y}-y)w_1h_1(1-h_1)\cdot x_2\\ (\hat{y}-y)w_2h_2(1-h_2)\cdot x_1& (\hat{y}-y)w_2h_2(1-h_2)\cdot x_2\\ (\hat{y}-y)w_3h_3(1-h_3)\cdot x_1& (\hat{y}-y)w_3h_3(1-h_3)\cdot x_2\end{bmatrix}=\begin{bmatrix}(\hat{y}-y)w_1h_1(1-h_1) \\ (\hat{y}-y)w_2h_2(1-h_2)\\(\hat{y}-y)w_3h_3(1-h_3) \end{bmatrix}\cdot \begin{bmatrix} x_1 & x_2 \end{bmatrix}}

=

\mathbf{\\ \begin{bmatrix} \frac{\partial E}{\partial w_{11}} &\frac{\partial E}{\partial w_{12}} \\ \frac{\partial E}{\partial w_{21}}&\frac{\partial E}{\partial w_{22}} \\ \frac{\partial E}{\partial w_{31}}&\frac{\partial E}{\partial w_{32}} \end{bmatrix} =\begin{bmatrix} (\hat{y}-y)w_1h_1(1-h_1)\cdot x_1 & (\hat{y}-y)w_1h_1(1-h_1)\cdot x_2\\ (\hat{y}-y)w_2h_2(1-h_2)\cdot x_1& (\hat{y}-y)w_2h_2(1-h_2)\cdot x_2\\ (\hat{y}-y)w_3h_3(1-h_3)\cdot x_1& (\hat{y}-y)w_3h_3(1-h_3)\cdot x_2\end{bmatrix}=\begin{bmatrix}(\hat{y}-y)w_1h_1(1-h_1) \\ (\hat{y}-y)w_2h_2(1-h_2)\\(\hat{y}-y)w_3h_3(1-h_3) \end{bmatrix}\cdot \begin{bmatrix} x_1 & x_2 \end{bmatrix} = (\hat{y}-y) \cdot \begin{bmatrix}w_1h_1(1-h_1) \\ w_2h_2(1-h_2)\\w_3h_3(1-h_3) \end{bmatrix}\cdot \begin{bmatrix} x_1 & x_2 \end{bmatrix}}

通过上面的公式,我们来看反向传播的反向是什么意思?

上述公式的三个部分正好对应的正好是反向传播对应的下面三个阶段。

基于上图,我们扩展多个输出会怎样?

当有多个输出的时候,问题一下子变得复杂起来,每个\hat{y_j}都对w_{mn}的改变有自己的想法,这个时候该怎么办呢?

兵来将挡,水来土掩,方法是把\hat{y}对每个w_{mn}的想法按照列向量的方式排列,最后加起来,作为总的想法,更新权重向量的值.

\mathbf{\\ \begin{bmatrix} \frac{\partial \hat{y_1}}{\partial w_{11}} \\ \frac{\partial \hat{y_1}}{\partial w_{12}} \\ \frac{\partial \hat{y_1}}{\partial w_{21}} \\ \frac{\partial \hat{y_1}}{\partial w_{22}} \\ \frac{\partial \hat{y_1}}{\partial w_{31}} \\ \frac{\partial \hat{y_1}}{\partial w_{32}} \end{bmatrix} = \begin{bmatrix} w'_{11}h_1(1-h_1)\cdot x_1 \\ w'_{11}h_1(1-h_1)\cdot x_2\\ w'_{12}h_2(1-h_2)\cdot x_1 \\ w'_{12}h_2(1-h_2)\cdot x_2\\ w'_{13}h_3(1-h_3)\cdot x_1 \\ w'_{13}h_3(1-h_3)\cdot x_2\end{bmatrix}

\vdots

\mathbf{\\ \begin{bmatrix} \frac{\partial \hat{y_j}}{\partial w_{11}} \\ \frac{\partial \hat{y_j}}{\partial w_{12}} \\ \frac{\partial \hat{y_j}}{\partial w_{21}} \\ \frac{\partial \hat{y_j}}{\partial w_{22}} \\ \frac{\partial \hat{y_j}}{\partial w_{31}} \\ \frac{\partial \hat{y_j}}{\partial w_{32}} \end{bmatrix} = \begin{bmatrix} w'_{j1}h_1(1-h_1)\cdot x_1 \\ w'_{j1}h_1(1-h_1)\cdot x_2\\ w'_{j2}h_2(1-h_2)\cdot x_1 \\ w'_{j2}h_2(1-h_2)\cdot x_2\\ w'_{j3}h_3(1-h_3)\cdot x_1 \\ w'_{j3}h_3(1-h_3)\cdot x_2\end{bmatrix}

E=\frac{1}{2}\sum_{j=1}^{k} (\hat{y_j}-y_j)^2=\frac{1}{2}(\hat{y_1}-y_1)^2 + \frac{1}{2}(\hat{y_2}-y_2)^2 + \cdots + \frac{1}{2}(\hat{y_k}-y_k)^2

所以,根据全微分法则:

\frac{\partial E}{\partial w_{11}}= \frac{\partial E}{\partial \hat{y_1}} \frac{\partial \hat{y_1}}{w_{11}} + \frac{\partial E}{\partial \hat{y_2}} \frac{\partial \hat{y_2}}{w_{11}}+\cdots + \frac{\partial E}{\partial \hat{y_j}} \frac{\partial \hat{y_j}}{w_{11}}+ \cdots + \frac{\partial E}{\partial \hat{y_k}} \frac{\partial \hat{y_k}}{w_{11}}

\cdots

\frac{\partial E}{\partial w_{mn}}= \frac{\partial E}{\partial \hat{y_1}} \frac{\partial \hat{y_1}}{w_{mn}} + \frac{\partial E}{\partial \hat{y_2}} \frac{\partial \hat{y_2}}{w_{mn}}+\cdots + \frac{\partial E}{\partial \hat{y_j}} \frac{\partial \hat{y_j}}{w_{mn}}+ \cdots + \frac{\partial E}{\partial \hat{y_k}} \frac{\partial \hat{y_k}}{w_{mn}}

最终的调整梯度为针对每个\hat{y}的梯度列向量的和:

\mathbf{\\ \begin{bmatrix} \frac{\partial E}{\partial w_{11}} \\ \frac{\partial E}{\partial w_{12}} \\ \frac{\partial E}{\partial w_{21}} \\ \frac{\partial E}{\partial w_{22}} \\ \frac{\partial E}{\partial w_{31}} \\ \frac{\partial E}{\partial w_{32}} \end{bmatrix} =\begin{bmatrix} \frac{\partial E}{\partial \hat{y_1}} \frac{\partial \hat{y_1}}{\partial w_{11}} \\ \frac{\partial E}{\partial \hat{y_1}} \frac{\partial \hat{y_1}}{\partial w_{12}} \\ \frac{\partial E}{\partial \hat{y_1}} \frac{\partial \hat{y_1}}{\partial w_{21}} \\ \frac{\partial E}{\partial \hat{y_1}} \frac{\partial \hat{y_1}}{\partial w_{22}} \\ \frac{\partial E}{\partial \hat{y_1}} \frac{\partial \hat{y_1}}{\partial w_{31}} \\ \frac{\partial E}{\partial \hat{y_1}} \frac{\partial \hat{y_1}}{\partial w_{32}} \end{bmatrix} + \cdots + \mathbf{\\ \begin{bmatrix} \frac{\partial E}{\partial \hat{y_j}} \frac{\partial \hat{y_j}}{\partial w_{11}} \\ \frac{\partial E}{\partial \hat{y_j}} \frac{\partial \hat{y_j}}{\partial w_{12}} \\ \frac{\partial E}{\partial \hat{y_j}} \frac{\partial \hat{y_j}}{\partial w_{21}} \\ \frac{\partial E}{\partial \hat{y_j}} \frac{\partial \hat{y_j}}{\partial w_{22}} \\ \frac{\partial E}{\partial \hat{y_j}} \frac{\partial \hat{y_j}}{\partial w_{31}} \\ \frac{\partial E}{\partial \hat{y_j}} \frac{\partial \hat{y_j}}{\partial w_{32}} \end{bmatrix} + \cdots+ \mathbf{\\ \begin{bmatrix} \frac{\partial E}{\partial \hat{y_k}} \frac{\partial \hat{y_k}}{\partial w_{11}} \\ \frac{\partial E}{\partial \hat{y_k}} \frac{\partial \hat{y_k}}{\partial w_{12}} \\ \frac{\partial E}{\partial \hat{y_k}} \frac{\partial \hat{y_k}}{\partial w_{21}} \\ \frac{\partial E}{\partial \hat{y_k}} \frac{\partial \hat{y_k}}{\partial w_{22}} \\ \frac{\partial E}{\partial \hat{y_k}} \frac{\partial \hat{y_k}}{\partial w_{31}} \\ \frac{\partial E}{\partial \hat{y_k}} \frac{\partial \hat{y_k}}{\partial w_{32}} \end{bmatrix}

\\ \begin{bmatrix} \frac{\partial E}{\partial w_{11}} \\ \frac{\partial E}{\partial w_{12}} \\ \frac{\partial E}{\partial w_{21}} \\ \frac{\partial E}{\partial w_{22}} \\ \frac{\partial E}{\partial w_{31}} \\ \frac{\partial E}{\partial w_{32}} \end{bmatrix} =\begin{bmatrix} \frac{\partial E}{\partial \hat{y_1}} \frac{\partial \hat{y_1}}{\partial w_{11}} \\ \frac{\partial E}{\partial \hat{y_1}} \frac{\partial \hat{y_1}}{\partial w_{12}} \\ \frac{\partial E}{\partial \hat{y_1}} \frac{\partial \hat{y_1}}{\partial w_{21}} \\ \frac{\partial E}{\partial \hat{y_1}} \frac{\partial \hat{y_1}}{\partial w_{22}} \\ \frac{\partial E}{\partial \hat{y_1}} \frac{\partial \hat{y_1}}{\partial w_{31}} \\ \frac{\partial E}{\partial \hat{y_1}} \frac{\partial \hat{y_1}}{\partial w_{32}} \end{bmatrix} + \cdots + \vec{0}+\cdots +\vec{0} =\ \ \ \begin{bmatrix} \frac{\partial E}{\partial \hat{y_1}} \frac{\partial \hat{y_1}}{\partial w_{11}} \\ \frac{\partial E}{\partial \hat{y_1}} \frac{\partial \hat{y_1}}{\partial w_{12}} \\ \frac{\partial E}{\partial \hat{y_1}} \frac{\partial \hat{y_1}}{\partial w_{21}} \\ \frac{\partial E}{\partial \hat{y_1}} \frac{\partial \hat{y_1}}{\partial w_{22}} \\ \frac{\partial E}{\partial \hat{y_1}} \frac{\partial \hat{y_1}}{\partial w_{31}} \\ \frac{\partial E}{\partial \hat{y_1}} \frac{\partial \hat{y_1}}{\partial w_{32}} \end{bmatrix}

再来看看反向传播矩阵转置的规则

E=\frac{1}{2}\sum_{j=1}^{k} (\hat{y_j}-y_j)^2=\frac{1}{2}(\hat{y_1}-y_1)^2 + \cdots + \frac{1}{2}(\hat{y_j}-y_j)^2

\begin{bmatrix} \hat{y_1}\\ \vdots \\ \hat{y_j}\\ \vdots \end{bmatrix} = \begin{bmatrix} w'_{11} &w'_{12} &w'_{13} \\ \cdots & \cdots & \cdots \\ w'_{j1} &w'_{j2} &w'_{j3} \\ \cdots &\cdots & \cdots \end{bmatrix}\begin{bmatrix} h_1\\ h_2\\ h_3 \end{bmatrix}

\frac{\partial E}{\partial \hat{h_1}} = \frac{\partial E}{\partial \hat{y_1}}\frac{\partial {\hat{y_1}}}{\partial h_1} + \cdots + \frac{\partial E}{\partial \hat{y_j}}\frac{\partial {\hat{y_j}}}{\partial h_1} + \cdots

\frac{\partial E}{\partial \hat{h_2}} = \frac{\partial E}{\partial \hat{y_1}}\frac{\partial {\hat{y_1}}}{\partial h_2} + \cdots + \frac{\partial E}{\partial \hat{y_j}}\frac{\partial {\hat{y_j}}}{\partial h_2} + \cdots
\frac{\partial E}{\partial \hat{h_3}} = \frac{\partial E}{\partial \hat{y_1}}\frac{\partial {\hat{y_1}}}{\partial h_3} + \cdots + \frac{\partial E}{\partial \hat{y_j}}\frac{\partial {\hat{y_j}}}{\partial h_3} + \cdots

\frac{\partial {\hat{y_1}}}{\partial h_1}=w'_{11} \ \ \frac{\partial {\hat{y_j}}}{\partial h_1}=w'_{j1}

\frac{\partial {\hat{y_1}}}{\partial h_2}=w'_{12} \ \ \frac{\partial {\hat{y_j}}}{\partial h_2}=w'_{j2}

\frac{\partial {\hat{y_1}}}{\partial h_3}=w'_{13} \ \ \frac{\partial {\hat{y_j}}}{\partial h_3}=w'_{j3}

所以:

\begin{bmatrix} \frac{\partial E}{\partial \hat{h_1}}\\ \frac{\partial E}{\partial \hat{h_2}}\\ \frac{\partial E}{\partial \hat{h_3}} \end{bmatrix}=\begin{bmatrix} w'_{11}& \cdots & w'_{j1} & \cdots \\ w'_{12}& \cdots & w'_{j2}& \cdots \\ w'_{13}& \cdots & w'_{j3}& \cdots \end{bmatrix}\begin{bmatrix} \frac{\partial E}{\partial \hat{y_1}} \\ \cdots \\ \frac{\partial E}{\partial \hat{y_j}}\\ \cdots \end{bmatrix}

另外针对w_{ij}的导数:

\frac{\partial E}{\partial w'_{11}} = \frac{\partial E}{\partial \hat{y_1}}\frac{\partial {\hat{y_1}}}{\partial w'_{11}} + \cdots + \frac{\partial E}{\partial \hat{y_j}}\frac{\partial {\hat{y_j}}}{\partial w'_{11}} + \cdots=\frac{\partial E}{\partial \hat{y_1}}\frac{\partial {\hat{y_1}}}{\partial w'_{11}}=\frac{\partial E}{\partial \hat{y_1}}h_1

\frac{\partial E}{\partial w'_{12}} = \frac{\partial E}{\partial \hat{y_1}}\frac{\partial {\hat{y_1}}}{\partial w'_{12}} + \cdots + \frac{\partial E}{\partial \hat{y_j}}\frac{\partial {\hat{y_j}}}{\partial w'_{12}} + \cdots=\frac{\partial E}{\partial \hat{y_1}}\frac{\partial {\hat{y_1}}}{\partial w'_{12}}=\frac{\partial E}{\partial \hat{y_1}}h_2

\frac{\partial E}{\partial w'_{13}} = \frac{\partial E}{\partial \hat{y_1}}\frac{\partial {\hat{y_1}}}{\partial w'_{13}} + \cdots + \frac{\partial E}{\partial \hat{y_j}}\frac{\partial {\hat{y_j}}}{\partial w'_{13}} + \cdots=\frac{\partial E}{\partial \hat{y_1}}\frac{\partial {\hat{y_1}}}{\partial w'_{13}}=\frac{\partial E}{\partial \hat{y_1}}h_3

\cdots\cdots\cdots

\frac{\partial E}{\partial w'_{j1}} = \frac{\partial E}{\partial \hat{y_1}}\frac{\partial {\hat{y_1}}}{\partial w'_{j1}} + \cdots + \frac{\partial E}{\partial \hat{y_j}}\frac{\partial {\hat{y_j}}}{\partial w'_{j1}} + \cdots=\frac{\partial E}{\partial \hat{y_j}}\frac{\partial {\hat{y_j}}}{\partial w'_{j1}}=\frac{\partial E}{\partial \hat{y_j}}h_1

\frac{\partial E}{\partial w'_{j2}} = \frac{\partial E}{\partial \hat{y_1}}\frac{\partial {\hat{y_1}}}{\partial w'_{j2}} + \cdots + \frac{\partial E}{\partial \hat{y_j}}\frac{\partial {\hat{y_j}}}{\partial w'_{j2}} + \cdots=\frac{\partial E}{\partial \hat{y_j}}\frac{\partial {\hat{y_j}}}{\partial w'_{j2}}=\frac{\partial E}{\partial \hat{y_j}}h_2

\frac{\partial E}{\partial w'_{j3}} = \frac{\partial E}{\partial \hat{y_1}}\frac{\partial {\hat{y_1}}}{\partial w'_{j3}} + \cdots + \frac{\partial E}{\partial \hat{y_j}}\frac{\partial {\hat{y_j}}}{\partial w'_{j3}} + \cdots=\frac{\partial E}{\partial \hat{y_j}}\frac{\partial {\hat{y_j}}}{\partial w'_{j3}}=\frac{\partial E}{\partial \hat{y_j}}h_3

\cdots\cdots\cdots

所以:

\begin{bmatrix} \frac{\partial E}{\partial w'_{11}} &\frac{\partial E}{\partial w'_{12}}&\frac{\partial E}{\partial w'_{13}} \\ \cdots & \cdots & \cdots \\ \frac{\partial E}{\partial w'_{j1}} &\frac{\partial E}{\partial w'_{j2}} &\frac{\partial E}{\partial w'_{j3}} \\ \cdots &\cdots & \cdots \end{bmatrix}=\begin{bmatrix} \frac{\partial E}{\partial \hat{y_1}}h_1 &\frac{\partial E}{\partial \hat{y_1}}h_2&\frac{\partial E}{\partial \hat{y_1}}h_3 \\ \cdots & \cdots & \cdots \\\frac{\partial E}{\partial \hat{y_j}}h_1 &\frac{\partial E}{\partial \hat{y_j}}h_2 &\frac{\partial E}{\partial \hat{y_j}}h_3\\ \cdots &\cdots & \cdots \end{bmatrix}=\begin{bmatrix} \frac{\partial E}{\partial \hat{y_1}}\\ \cdots \\ \frac{\partial E}{\partial \hat{y_j}} \\ \cdots \end{bmatrix}\begin{bmatrix} h_1 & h_2 & h_3 \end{bmatrix}

再加上前面的推导结论:

\begin{bmatrix} \frac{\partial E}{\partial \hat{h_1}}\\ \frac{\partial E}{\partial \hat{h_2}}\\ \frac{\partial E}{\partial \hat{h_3}} \end{bmatrix}=\begin{bmatrix} w'_{11}& \cdots & w'_{j1} & \cdots \\ w'_{12}& \cdots & w'_{j2}& \cdots \\ w'_{13}& \cdots & w'_{j3}& \cdots \end{bmatrix}\begin{bmatrix} \frac{\partial E}{\partial \hat{y_1}} \\ \cdots \\ \frac{\partial E}{\partial \hat{y_j}}\\ \cdots \end{bmatrix}

我们可以得出反向传播的两个公式,这两个公式对理解反向传播的代码非常重要,它将繁杂的导数计算替换为数列的递推关系式,这里用x代替h.

\frac{\partial \boldsymbol{Loss}}{\partial \boldsymbol{x}}=\boldsymbol{W^T}\cdot \frac{\partial \boldsymbol{Loss}}{\partial \boldsymbol{\hat{y}}}

\frac{\partial \boldsymbol{Loss}}{\partial \boldsymbol{W}}=\frac{\partial \boldsymbol{Loss}}{\partial \boldsymbol{\hat{y}}}\cdot\boldsymbol{ x^T}

而在递推关系中,起到衔接两个递推式的枢纽计算,就是

\frac{\partial Loss}{\partial \hat{y}}

在有些书中,这个式子被记为:

\boldsymbol{\delta^{l}_j=\frac{\partial Loss}{\partial y^l_j}}

被叫做是神经元误差。

将神经元误差带入上式,可以得到:

\mathbf{\\\frac{\partial Loss}{\partial w^l_{ji}}=\delta^l_jx^{l-1}_i, \ \ \frac{\partial Loss}{\partial b^l_j}=\delta^l_j}

所以,反向传播的计算可以转换为矩阵相乘的形式,这就为神经网络梯度的反向传播软件实现提供了统一的形式,因为计算机不擅长导数计算,但是擅长处理数值计算和递推,以上的式子可以通过递推

可以看出,正向传递和反向传递的权重该矩阵是互相转置的,权重矩阵的梯度可以通过输出和输入来表示:

比如MxN维矩阵和Nx1维矩阵相乘,得到Mx1维的向量,而Mx1维的向量和转置后的1xN维的向量相乘,可以返回MxN维的矩阵梯度维数信息,也就是说,维度不会丢失。

下面的推导是错误记录,有的时候,明白自己所犯错误的原因比明白问题本身更重要。

这个推导之所以错误从形式上也可以看出,由于最后一层是全连接,E对w11的偏导数不应该只和y1有关,还应该和其余的y2 .... yj....yk都有关,也不知道一开始为什么自己会这样想。


\mathbf{\\ \begin{bmatrix} \frac{\partial E}{\partial w_{11}} \\ \frac{\partial E}{\partial w_{12}} \\ \frac{\partial E}{\partial w_{21}} \\ \frac{\partial E}{\partial w_{22}} \\ \frac{\partial E}{\partial w_{31}} \\ \frac{\partial E}{\partial w_{32}} \end{bmatrix} =  \mathbf{\\ \begin{bmatrix} \frac{\partial E}{\partial \hat{y_1}} \frac{\partial \hat{y_1}}{\partial w_{11}} \\ \frac{\partial E}{\partial \hat{y_1}} \frac{\partial \hat{y_1}}{\partial w_{12}} \\ \frac{\partial E}{\partial \hat{y_1}} \frac{\partial \hat{y_1}}{\partial w_{21}} \\ \frac{\partial E}{\partial \hat{y_1}} \frac{\partial \hat{y_1}}{\partial w_{22}} \\ \frac{\partial E}{\partial \hat{y_1}} \frac{\partial \hat{y_1}}{\partial w_{31}} \\ \frac{\partial E}{\partial \hat{y_1}} \frac{\partial \hat{y_1}}{\partial w_{32}} \end{bmatrix} = \begin{bmatrix} (\hat{y_1}-y_1)w'_{11}h_1(1-h_1)\cdot x_1 \\(\hat{y_1}-y_1) w'_{11}h_1(1-h_1)\cdot x_2\\ (\hat{y_1}-y_1) w'_{12}h_2(1-h_2)\cdot x_1 \\ (\hat{y_1}-y_1)w'_{12}h_2(1-h_2)\cdot x_2\\ (\hat{y_1}-y_1)w'_{13}h_3(1-h_3)\cdot x_1 \\ (\hat{y_1}-y_1) w'_{13}h_3(1-h_3)\cdot x_2\end{bmatrix}

\vdots

\mathbf{\\ \begin{bmatrix} \frac{\partial E}{\partial w_{11}} \\ \frac{\partial E}{\partial w_{12}} \\ \frac{\partial E}{\partial w_{21}} \\ \frac{\partial E}{\partial w_{22}} \\ \frac{\partial E}{\partial w_{31}} \\ \frac{\partial E}{\partial w_{32}} \end{bmatrix} =  \mathbf{\\ \begin{bmatrix} \frac{\partial E}{\partial \hat{y_j}} \frac{\partial \hat{y_j}}{\partial w_{11}} \\ \frac{\partial E}{\partial \hat{y_j}} \frac{\partial \hat{y_j}}{\partial w_{12}} \\ \frac{\partial E}{\partial \hat{y_j}} \frac{\partial \hat{y_j}}{\partial w_{21}} \\ \frac{\partial E}{\partial \hat{y_j}} \frac{\partial \hat{y_j}}{\partial w_{22}} \\ \frac{\partial E}{\partial \hat{y_j}} \frac{\partial \hat{y_j}}{\partial w_{31}} \\ \frac{\partial E}{\partial \hat{y_j}} \frac{\partial \hat{y_j}}{\partial w_{32}} \end{bmatrix} = \begin{bmatrix} (\hat{y_j}-y_j)w'_{j1}h_1(1-h_1)\cdot x_1 \\(\hat{y_j}-y_j) w'_{j1}h_1(1-h_1)\cdot x_2\\ (\hat{y_j}-y_j) w'_{j2}h_2(1-h_2)\cdot x_1 \\ (\hat{y_j}-y_j)w'_{j2}h_2(1-h_2)\cdot x_2\\ (\hat{y_j}-y_j)w'_{j3}h_3(1-h_3)\cdot x_1 \\ (\hat{y_j}-y_j) w'_{j3}h_3(1-h_3)\cdot x_2\end{bmatrix}


这样,求出每个\hat{y_j}w_{mn}的导数之后,将这些所有梯度加起来,作为最终梯度,去调整w_{mn},得到的就是新的一组权重值,调整后的参数比之前更加接近于训练目标想要达到的功能。

​​​​反向传播 Back Propagation (手把手推导)_哔哩哔哩_bilibili

结束!

猜你喜欢

转载自blog.csdn.net/tugouxp/article/details/120519277