Netease course DeepLearning.ai Andrew Ng deep learning course notes: third week ④: gradient neural networks decline, random initialization

Neural Network gradient descent (Gradient descent for neural networks)

In this video, I will give you achieve back-propagation gradient descent equations or algorithms, we will introduce in the next video why this equation is to achieve several specific gradient descent equation correct for your neural network.

Your single hidden layer neural network will be W is [. 1] , B [. 1] , W is [2] , B [2] these parameters, there is a n- X represents the number of input features, n- [. 1] represents a hiding unit The number, n- [2] represents the number of output units.

In our case, we only introduced in this case, the parameters:

The matrix W is [. 1] dimension is ( n- [. 1] , n- [0] ), B [. 1] is n- [. 1] -dimensional vector can be written as ( n- [. 1] ,. 1) , it is a column vector. Matrix W is [2] dimension is ( n- [2] , n- [. 1] ), B [2] dimension is ( n- [2] ,. 1) dimension.

Do you have a cost function neural network, assuming you are doing binary classification task, then your cost function is equal to:

Cost function:

公式: J(W[1],b[1],W[2],b[2])=1mi=1mL(y,y)  

loss function and before doing logistic regression exactly the same.

Training parameters need to do gradient descent, when training the neural network, random initialization parameters are important, rather than initializing all zeroes. When you parameters are initialized to some value, each cycle will calculate the following gradient descent prediction values:

Forward propagation equation is as follows (before spoken):

 forward propagation

Back-propagation equations are as follows:

back propagation

Official 3.35:

 

The above is a step back propagation Note: These are carried out for all the quantized samples, the Y is 1 × m matrix; np.sum here is python of numpy command, axis = 1 indicates the horizontal summed, keepdims python is to prevent the output of those odd rank number (n-,) , add this to ensure that the array matrix D B [2] this dimension vector output is (n, 1) this standard form.

So far, we have calculated and Logistic regression is very similar, but when you begin to reverse the spread of computing, you need to calculate, is a function of the hidden layer derivative output using sigmoid be binary classification function. Here is multiplied element by element, because W is [2] T D Z [2] and ( Z [. 1] ) both are ( n- [. 1] , m) matrix;

Another prevent python output odd rank number, need to explicitly invoke reshape the output np.sum written in matrix form.

These are the four equations forward propagation and back propagation equation 6, where I was directly given in the next video, I will tell how to export these six equations of the back-propagation. If you want to implement these algorithms, you must perform the correct forward and reverse spread operation, you must be able to calculate the needs of all derivatives, gradient descent learning neural network parameters; you can learn a lot of successful practitioners of the same depth directly implement this algorithm, we do not understand the knowledge of them.

Random initialization (Random + Initialization)

When you train the neural network, random weight initialization is very important. For logistic regression, the weights are initialized to zero of course also possible. But for a neural network, or if you put the weight parameters are initialized to 0, then the gradient descent will not work.

Let's see why. Wherein there are two inputs, n- [0] = 2 , 2 hidden layer units n- [. 1] is equal to 2. Thus associated with one hidden layer matrix, or W is [. 1] is a matrix of 2 * 2, it is assumed that the 2 * 2 matrix is initialized to 0, and B [. 1] is also equal to [0 0 ] T , the bias term b is initialized to 0 is reasonable, but the w initialized to zero there will be problems. If you follow so that this issue is initialized, you will always find A 1 [1]  and A 2 [1] equal, this activation unit and the unit will activate the same. Since two hidden units calculated in the same function, when you do the reverse propagation calculations, which results in DZ . 1 [. 1]  and DZ 2 [. 1] will be the same, these hidden units initializes symmetry have the same, so that the output It will be exactly the same weights, whereby W is [2] is equal to [00] ;

图3.11.1 但是如果你这样初始化这个神经网络,那么这两个隐含单元就会完全一样,因此他们完全对称,也就意味着计算同样的函数,并且肯定的是最终经过每次训练的迭代,这两个隐含单元仍然是同一个函数,令人困惑。dW 会是一个这样的矩阵,每一行有同样的值因此我们做权重更新把权重W[1]W[1]-adW 每次迭代后的W[1] ,第一行等于第二行。

由此可以推导,如果你把权重都初始化为0,那么由于隐含单元开始计算同一个函数,所有的隐含单元就会对输出单元有同样的影响。一次迭代后同样的表达式结果仍然是相同的,即隐含单元仍是对称的。通过推导,两次、三次、无论多少次迭代,不管你训练网络多长时间,隐含单元仍然计算的是同样的函数。因此这种情况下超过1个隐含单元也没什么意义,因为他们计算同样的东西。当然更大的网络,比如你有3个特征,还有相当多的隐含单元。

如果你要初始化成0,由于所有的隐含单元都是对称的,无论你运行梯度下降多久,他们一直计算同样的函数。这没有任何帮助,因为你想要两个不同的隐含单元计算不同的函数,这个问题的解决方法就是随机初始化参数。你应该这么做:把W[1] 设为np.random.randn(2,2)(生成高斯分布),通常再乘上一个小的数,比如0.01,这样把它初始化为很小的随机数。然后b 没有这个对称的问题(叫做symmetry breaking problem),所以可以把 b  初始化为0,因为只要随机初始化W 你就有不同的隐含单元计算不同的东西,因此不会有symmetry breaking问题了。相似的,对于W[2] 你可以随机初始化,b[2] 可以初始化为0。

W1=np.random.randn2,2 * 0.01 ,

b[1]=np.zeros((2,1))  

W[2]=np.random.randn(2,2) * 0.01 , b[2]=0

你也许会疑惑,这个常数从哪里来,为什么是0.01,而不是100或者1000。我们通常倾向于初始化为很小的随机数。因为如果你用tanh或者sigmoid激活函数,或者说只在输出层有一个Sigmoid,如果(数值)波动太大,当你计算激活值时z[1]=W[1]x+b[1] , a[1]=σ(z[1])=g[1](z[1]) 如果W 很大,z 就会很大。z 的一些值a 就会很大或者很小,因此这种情况下你很可能停在tanh/sigmoid函数的平坦的地方(见图3.8.2),这些地方梯度很小也就意味着梯度下降会很慢,因此学习也就很慢。

回顾一下:如果w 很大,那么你很可能最终停在(甚至在训练刚刚开始的时候)z 很大的值,这会造成tanh/Sigmoid激活函数饱和在龟速的学习上,如果你没有sigmoid/tanh激活函数在你整个的神经网络里,就不成问题。但如果你做二分类并且你的输出单元是Sigmoid函数,那么你不会想让初始参数太大,因此这就是为什么乘上0.01或者其他一些小数是合理的尝试。对于w[2] 一样,就是np.random.randn((1,2)),我猜会是乘以0.01。

事实上有时有比0.01更好的常数,当你训练一个只有一层隐藏层的网络时(这是相对浅的神经网络,没有太多的隐藏层),设为0.01可能也可以。但当你训练一个非常非常深的神经网络,你可能会选择一个不同于的常数而不是0.01。下一节课我们会讨论怎么并且何时去选择一个不同于0.01的常数,但是无论如何它通常都会是个相对小的数。

好了,这就是这周的视频。你现在已经知道如何建立一个一层的神经网络了,初始化参数,用前向传播预测,还有计算导数,结合反向传播用在梯度下降中。

Guess you like

Origin blog.csdn.net/qq_36552489/article/details/93780788
Recommended