Neural Networks Learning (neural network learning)

Foreword

        This chapter is read immediately previous chapter, we introduced in the last chapter, as well as applications Why do we need a neural network, the neural network model of how the structure of neural network in life, in this chapter, I will tell you the neural network can learn in detail how it (forward propagation) and backpropagation (back-propagation) combine to better learning by forward propagation.

        Finally, if some places do not understand, please let me know, thank you!

Chapter VII Neural Networks Learning (neural network learning)

 7.1 Cost function (cost function)

      When linear regression and logistic regression problems earlier to introduce, will the introduction of a cost function, where we also need to give the cost function, but also explain to you in front of the role of the cost function that determines whether a learning algorithm is effective, in this issue we tell you or classification, the model shown in Figure 1.

                                                                               Figure 1 neural network (classification) model

     In this model, when we have only one output, our results are two: 0 and 1, and when our output is more than one, that is, our more than one category, this time the result is y\in R^{k}(a total of k categories ), the four categories for example, \begin{bmatrix} 1\\0 \\ 0 \\ 0 \end{bmatrix}, \begin{bmatrix} 0\\1 \\ 0 \\ 0 \end{bmatrix}, \begin{bmatrix} 0\\0 \\ 1 \\ 0 \end{bmatrix}, \begin{bmatrix} 0\\0 \\ 0 \\ 1 \end{bmatrix}each representing a different species. And our entire model represented by L, a total number of layers, for example 1, to a total of 4 layers, L = 4, we use S_{l}to indicate the number of units in each layer, does not include a bias means, for example, FIG. 1 in the second layer to a total of five units, S_{l}= 5.

      Everyone recalled the earlier infrastructure on neural networks in the future, since this is a logical classification of the problem in front of us also to explain to the logistic regression problems cost function is: J(\theta )=-\frac{1}{m}\sum_{i=1}^{m}[y^{(i)}logh(x^{(i)})+(1-y^{(i)})log(1-h(x^{(i)}))]+\frac{\lambda }{2m}\sum_{j=1}^{n}\theta _{j}^{2}while our neural networks, our output the results of more than one, so this time the cost function is:J(\Theta )=-\frac{1}{m}[\sum_{i=1}^{m}\sum_{k=1}^{K}y_{k}^{(i)}log(h_{\Theta }(x^{(i)}))_{k}+(1-y_{k}^{(i)})log(1-(h_{\Theta }(x^{(i)}))_{k})]+\frac{\lambda }{2m}\sum_{l=1}^{L-1}\sum_{i=1}^{s_{l}}\sum_{j=1}^{s_{l}+1}(\Theta _{ji}^{l})^2

 This still has regularized. Similarly, we need to do here is to make J ( \Theta) minimum, and practice in front of us is to introduce the J ( \Theta) on \Thetathe derivative, and make it to 0, and then find the corresponding \Theta.

6.2 Backpropagation (back propagation)

       Here I will introduce a new algorithm to calculate J ( \Theta) on \Thetathe partial derivative, is the focus of this section Backpropagation (back propagation) algorithm, the previous chapter we have a step by step from the input layer to the output layer is called forward propagation algorithm, back propagation as the name suggests is that we are from the output to the input of the process. Here we use \delta _{j}^{(l)}to indicate the error first layer l j-th unit, or take the model of FIG. 1 explained do for error output layer Layer 4, a \ Delta _ {j} ^ {(4)} = a_ {j} ^ {(4)} - y_ {j}, i.e., until we get through to the output and we actually spread a difference between the output, may be simplified to \ Delta ^ {(4)} = a {(4)} - and, \delta ^{(3)}=(\Theta ^{(3)})^{\top }\delta ^{(4)}.*g'(z^{(3)})(where g'(z^{(3)})=a^{(3)}.*(1-a^{(3)})), empathy \delta ^{(2)}=(\Theta ^{(2)})^{\top }\delta ^{(3)}.*g'(z^{(2)})(where g'(z^{(2)})=a^{(2)}.*(1-a^{(2)})), no \delta ^{(1)}, because we obtained from the default data input of an error, as for the above error formula is to introduce some mathematical derivation does not require in-depth study, to be honest, I did not understand how special. This process is a step by step from the output to the input count, we previously requested \frac{\partial }{\partial \Theta _{ij}^{(l)}}J(\Theta )=a_{j}^{(l)}\delta _{i}^{(l+1)}(ignoring \lambda, = \lambda0).

    Let's give you sum up Backpropagation algorithm:

We have a training set { {(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),...,(x^{(m)},y^{(m)})}}

Order \Delta _{ij}^{(l)}=0(to calculate \frac{\partial }{\partial \Theta _{ij}^{(l)}}J(\Theta ))

for i=1to m   \leftarrow (x^{(i)},y^{(i)})

    set a^{(1)}=x^{(i)}

    perform forward propagation to compute a^{(l)}(l=2,3,...,L)

    using y^{(i)},compute \ Delta ^ {(L)} = a {(L)} - {y ^ (i)}

    compute \ Delta ^ {(L-1)}, \ delta ^ {(L-2)}, ..., \ delta ^ {(2)}

    \ Delta _ {ij} ^ {(l)}: = \ Delta _ {ij} ^ {(l)} + a_ {j} ^ {(l)} \ Delta _ {i} ^ {(l + 1) }\Rightarrow \Delta ^{(l)}:=\Delta ^{(l)}+\delta ^{(l+1)}(a^{(l)})^\top

D_{ij}^{(l)}:=\frac{1}{m}\Delta _{ij}^{(l)}+\lambda \Theta _{ij}^{(l)}    if j \ neq 0

D_{ij}^{(l)}:=\frac{1}{m}\Delta _{ij}^{(l)}                  if j= 0            \frac{\partial }{\partial \Theta _{ij}^{(l)}}J(\Theta )=D_{ij}^{(l)}

6.3 Forward Propagation and Backpropagation summary

     By explaining in front of Backpropagation, it may still in the end of Backpropagation do not know much about what, in this section, I will graphically intuitive way to understand the heart of everyone and the whole process will Forward Propagation algorithm to make a comparison.

1) Forward Propagation (forward propagation)

                                                                        图2  Forward Propagation

In Figure 2, we can see that this is a 4 layer neural network, for example, the third layer z_{1}^{(3)}=\Theta _{10}^{(2)}+\Theta _{11}^{(2)}a_{1}^{(2)}+\Theta _{12}^{(2)}a_{2}^{(2)}, and a_{1}^{(3)}=G ( z_ {1} ^ {(3)}) transmitted from the output layer through such input layer.

2) Backpropagation (back propagation)

      That Backpropagation in the end is doing what work?

For the cost function J(\theta )=-\frac{1}{m}\sum_{i=1}^{m}[y^{(i)}logh_{\Theta }(x^{(i)})+(1-y^{(i)})log(1-h_{\Theta }(x^{(i)}))]+\frac{\lambda }{2m}\sum_{l=1}^{L}\sum_{i=1}^{s_{l}}\sum_{j=1}^{s_{l}+1}(\Theta _{ij}^{(l)})^2if we do not consider a back, the equation becomes our logistic regression cost function beginning, and we cost (i) = y^{(i)}logh_{\Theta }(x^{(i)})+(1-y^{(i)})log(1-h_{\Theta }(x^{(i)})), think

cost(i)\approx(h_{\Theta }(x^{(i)})-y^{(i)})^2,所以我们仍然在做的是怎样使预测出来的值更接近实际数据,使所有距离之和最小。而我们的Backpropagation的整个计算流程如图3所示.

                                                                                  图3 Backpropagation

在图3中,首先我们的偏置单元是没有误差的,在这里\delta _{j}^{(l)}来表示关于a_{j}^{(l)}的误差。而\delta _{j}^{(l)}的正式定义是\delta _{j}^{(l)}=\frac{\partial }{\partial z_{j}^{(l)}}cost(i)(cost(i)=y^{(i)}logh_{\Theta }(x^{(i)})+(1-y^{(i)})log(1-h_{\Theta }(x^{(i)}))),例如这里的\delta _{1}^{(4)}=a_{1}^{(4)}-y^{(i)},而\delta _{2}^{(3)}=\Theta _{12}^{(3)}\delta _{1}^{(4)}\delta _{2}^{(2)}=\Theta _{12}^{(2)}\delta _{1}^{(3)}+\Theta _{22}^{(2)}\delta _{2}^{(3)}等等,就是一个和正向传播一样,只是方向和传播的对象不同而已。

6.4 为后面算法的实现做些准备

      在前面列写CostFunction时,我们给出函数的形式是这样的:

function [jVal,gradient]=costFunction(theta)

...

optTheta=fminunc(@costFunction,initialTheta,options)

在这里的theta和initialTheta告诉的都是一个向量而不是一个矩阵,而我们对于一个L=4的神经网络,\Theta ^{(1)},\Theta ^{(2)},\Theta ^{(3)}我们需要得到对应的矩阵(\Theta1,\Theta2,\Theta3)还有D^{(1)},D^{(2)},D^{(3)}我们需要的也是对应的矩阵(D1,D2,D3)。

所以例如s1=10,s2=10,s3=1,

\Theta ^{(1)}\in R^{10\times 11}\Theta ^{(2)}\in R^{10\times 11}\Theta ^{(3)}\in R^{1\times 11}

D ^{(1)}\in R^{10\times 11}D ^{(2)}\in R^{10\times 11}D ^{(3)}\in R^{1\times 11}

thetaVec=[Theta1(:);Theta2(:);Theta3(:)];         %三个矩阵排成一个向量

DVec=[D1(:);D2(:);D3(:)];

Theta1=reshape(thetaVec(1:110),10,11);           %把一个向量重新分成三个矩阵

Theta2 = reshape(thetaVec(111:220),10,11);

Theta3 = reshape(thetaVec(221:231),1,11);

所以对于所给的条件原始参数\Theta ^{(1)},\Theta ^{(2)},\Theta ^{(3)},需要通过展开得到一个向量initialTheta,然后再带入到函数fminunc(@costFunction, initialTheta, options),而我们在构造costFunction函数时,是function [jval, gradientVec] = costFunction(thetaVec),传进来的参数是thetaVec,这个时候我们则需要根据thetaVec得到\Theta ^{(1)},\Theta ^{(2)},\Theta ^{(3)},从而计算出D ^{(1)},D ^{(2)},D ^{(3)},然后通过展开成一个向量gradientVec。

6.5 Gradient checking(梯度算法的检验)

      这个梯度算法看似很好,效率很高,但是我们不能确定这个算法是否正确,所以在这里我们给大家介绍一个另外来计算梯度的方法,这个同样也是梯度的定义,如图4所示,在图中我们可以看到,我们在曲线上任意选取一点\Theta,做出该点的切线,切线的斜率即该点的真正梯度,而我们在该点的左右各选取一点\Theta -\varepsilon ,\Theta +\varepsilon,两点之间的直线斜率即两点的纵坐标之差除以横坐标之差,如图所示,我们用公式表示出两点之间直线的斜率:\frac{J(\Theta +\varepsilon )-J(\Theta -\varepsilon )}{2\varepsilon },而我们在\Theta的梯度近似可以表示成\frac{J(\Theta +\varepsilon )-J(\Theta -\varepsilon )}{2\varepsilon },当\varepsilon很小时,通常我们取\varepsilon大概为10^{-4}就可以了。

                                                                                       图4 梯度的定义

所以当我们\theta \in R^{n}(\theta\Theta ^{(1)},\Theta ^{(2)},\Theta ^{(3)}展开后的一个向量),\theta =\theta _{1},\theta _{2},...,\theta _{n}

\frac{\partial }{\partial \theta _{1}}J(\theta )\approx \frac{J(\theta _{1}+\varepsilon ,\theta _{2},...,\theta _{n})-J(\theta _{1}-\varepsilon ,\theta _{2},...,\theta _{n})}{2\varepsilon }

...

\frac{\partial }{\partial \theta _{n}}J(\theta )\approx \frac{J(\theta _{1} ,\theta _{2},...,\theta _{n}+\varepsilon)-J(\theta _{1} ,\theta _{2},...,\theta _{n}-\varepsilon)}{2\varepsilon }

    有一点值得注意的是,梯度检验算法计算量很大,效率不是很高,所以当我们检验了之前的梯度算法没问题后,一定要把梯度检验算法关掉,不然整个算法会运行很慢。

6.6 Random initialization(随机初始化)

       这是这章的最后一个问题,在前面我们给大家介绍optTheta = fminunc(@costFunction,initialTheta, options)时,我们对initialTheta的处理一直是初始化为0,即initialTheta=zeros(n,1),那么在这里我们是不是也需要进行同样的操作了?首先来看看如果我们初始化为0会有怎样的后果,如图5所示的一个神经网络模型。

                                                                    图5 一个3层的神经网络

In this model, if we initialize so \Theta _{ij}^{l}=0, then we get a_{1}^{(2)}=a_{2}^{(2)}the same \delta _{1}^{(2)}=\delta _{2}^{(2)}, then there will be \frac{\partial }{\partial \Theta _{01}^{(1)}}J(\Theta )=\frac{\partial }{\partial \Theta _{02}^{(1)}}J(\Theta ), \Theta _{01}^{(1)}=\Theta _{02}^{(1)}so go on both been equal, it is not right, here we want to \Theta _{ij}^{l}be randomized, so that the randomized value in [ -\varepsilon ,\varepsilon] within, the Theta1 = rand (10,11) * ( 2 * INIT_EPSILON) - INIT_EPSILON; Theta2 = rand (1,11) * (2 * INIT_EPSILON) - INIT_EPSILON; originally rand is to result in (0, between 1), and now we are in (-INIT_EPSILON, INIT_EPSILON) between.

6.7 summary

      Front to tell you so much, have a big head, and finally to give you sum up the whole process algorithm:

1) randomized initial weight initialTheta

2) by the forward-propagation algorithm x^{(i)}to obtain a steph_{\Theta }(x^{(i)})

3) completion of the cost function J to obtain ( \Theta)

4) The partial derivatives obtained by backpropagation\frac{\partial }{\partial \Theta _{jk}^{(l)}}J(\Theta )

5) using a gradient algorithm tests previously required to verify the correct gradient

6) using a gradient descent algorithm, or more advanced algorithms enable to obtain J ( \Theta) is the minimum \Thetavalue (here to be noted that J ( \Theta) is not a convex function, so we get the \Thetavalue is probably a local minimum)

 

 

Guess you like

Origin blog.csdn.net/qq_36417014/article/details/84025336