Improving Deep Neural Networks

1.10 Vanishing/Exploding Gradients

这一篇写得特别好:
详解机器学习中的梯度消失、爆炸原因及其解决方法

这里写图片描述
One of the problems of training neural network, especially very deep neural networks, is data vanishing and exploding gradients. What that means is that when you're training a very deep network your derivatives or your slopes can sometimes get either very, very big or very, very small, maybe even exponentially small, and this makes training difficult.

这里写图片描述
假设激活函数 g ( z ) = z ,即激活函数是线性的,并且没有偏移项 b

z [ 1 ] = W [ 1 ] x a [ 1 ] = z [ 1 ] z [ 2 ] = W [ 2 ] a [ 1 ] = W [ 2 ] W [ 1 ] x y ^ = W [ L ] W [ L 1 ] W [ L 2 ] W [ 2 ] W [ 1 ] x

  • The weights W, if they’re all just a little bit bigger than one or just a little bit bigger than the identity matrix, then with a very deep network the activations can explode.
  • And if W is just a little bit less than identity. So this maybe here’s 0.9, 0.9, then you have a very deep network, the activations will decrease exponentially.
  • And even though I went through this argument in terms of activations increasing or decreasing exponentially as a function of L, a similar argument can be used to show that the derivatives or the gradients the computer is going to send will also increase exponentially or decrease exponentially as a function of the number of layers.

1.11 Weight initialization for deep networks

这里写图片描述

It turns out that a partial solution to vanishing/exploding gradients, doesn’t solve it entirely but helps a lot, is better or more careful choice of the random initialization for your neural network.

So in order to make z not blow up and not become too small you notice that the larger n is, the smaller you want Wi to be, right? Because z is the sum of the WiXi and so if you’re adding up a lot of these terms you want each of these terms to be smaller. One reasonable thing to do would be to set the variance of Wi to be equal to 1 over n, where n is the number of input features that’s going into a neuron.

R e l u : v a r ( W ) = 2 n [ l 1 ] W [ l ] = n p . r a n d o m . r a n d n ( s h a p e ) n p . s q r t ( 2 n [ l 1 ] ) n [ l 1 ] denotes the number of units in  l 1  layer t a n h : v a r ( W ) = 1 n [ l 1 ]  Xavier Initialization

So in practice, what you can do is set the weight matrix W for a certain layer to be n p . r a n d o m . r a n d n you know, and then whatever the shape of the matrix is for this out here, and then times square root of 1 over the number of features that I fed into each neuron and there else is going to be n [ ( l 1 ) ] $because that’s the number of units that I’m feeding into each of the units and they are l.

So if the input features of activations are roughly mean 0 and standard variance and variance 1 then this would cause z to also take on a similar scale and this doesn’t solve, but it definitely helps reduce the vanishing, exploding gradients problem because it’s trying to set each of the weight matrices w you know so that it’s not too much bigger than 1 and not too much less than 1 so it doesn’t explode or vanish too quickly. I’ve just mention some other variants.

猜你喜欢

转载自blog.csdn.net/XindiOntheWay/article/details/82287312