グラデーションが消え、ReLU

二つの側面:ネットワークも、法律上の導電鎖(データはそれぞれ1以上を乗じ)、消失勾配が生じます。2.シグモイド、誘導体のTANH関数は、勾配の消失をもたらす、0になる傾向があります。もちろん、体重の初期化データも大きな影響を与えます
これは私の中に4番目のエントリで   2016年に人工知能の私の知識を拡張するためには、この中で私の動機の詳細情報  紹介  記事。
この投稿は、ニューラルネットワークとバックプロパゲーションアルゴリズムの健全な知識を前提としています。それはあなたの現在の理解を説明されていない場合、心配ありません。あなたは追いつくことができます  ここで

それは夏の時間だ、とあなたは最近、  私の中読ん  ポスト  バックプロパゲーションアルゴリズム上を。独自の実装でハックすることに興奮して、あなたが深い、多層ニューラルネットワークを作成し、プログラムを実行し始めます。しかし、あなたの無駄に、どちらか訓練するために永久に取ったり正確に実行していません。あなたは、ニューラルネットワークは、最新技術だと思っていたWhy‽、あなたはしませんでした!

私は、実際には、しばらくの間、この問題に貼り付けました。

ここでは、我々は、反復最適化アルゴリズムについて知っていることだ。彼らはゆっくりと費用関数の出力が低下するような勾配から推測する方向に重みを乱すことにより、局所解への道を作ります。勾配降下アルゴリズムは、具体的には、いくつかの小さな(0と1の間)スカラ値を掛けた勾配の負によって重みを更新します。

 

あなたが見ることができるように、私たちはしている  「リピート」 収束するまで。現実には、しかし、我々は実際には、最大反復回数のためのハイパーパラメータを設定します。反復回数が一定の深いニューラルネットの小さすぎる場合、我々は不正確な結果を持つことになります。数が多すぎる場合は、訓練期間はinfeasibly長くなります。これは、トレーニング時間と精度のトレードオフ不安です。

では、なぜこれがそうですか?各ステップにおける勾配が小さすぎると重量は各繰り返しで十分に変更されていないので、まあ、簡単に言えば、より大きい繰り返しは、収束するまで必要になります。あるいは、重みは、反復のセット数の(より大きな勾配に対して)最小に近い移動しません。そして、と  本当に 小さな勾配、これが問題となります。これは、ニューラルネットワークを訓練するために実行不可能となり、彼らは悪い予測を開始します。

あなたは次のように、長尺の費用関数を得るでしょう:

 
平らな勾配で必要なより多くのアップデートを視覚的に表現し、細長いコスト関数

例えば、以下のより最適なものにその形状を比較します。

 

後者は、より大きな勾配を有しているので、勾配降下ははるかに速く収束させることができます。

This isn’t the only issue. When gradients become too small, it also becomes difficult to represent the numerical value in computers. We call this underflow — the opposite of overflow.

Okay. Small gradients = bad news, got it. The question, then, is: does this problem exist? In many cases it indeed does, and we call it the vanishing gradient problem.

Recall the sigmoid function, one that was almost always used as an activation function for ANNs in a classification context:

 
 

The sigmoid function is useful because it “squeezes” any input value into an output range of (0, 1) (where it asymptotes). This is perfect for representations of probabilities and classification. The sigmoid function, along with the tanh function, though, have lost popularity in recent years. Why? Because they suffer from the vanishing gradient problem!

のは、シグモイド関数の導関数を見てみましょう。私はあなたのためにそれを事前に計算しました:

 
換言すれば、単にS(1-S)

さて、これをグラフ化してみましょう:

まともなルックス、あなたが言います。よく見ます。関数の最大点は1/4であり、そして機能が水平すなわち0に漸近線、コスト関数の導関数の出力は0と1/4の間に常にあります。数学用語では、範囲は(0、1/4]である。このことを念頭に置いてください。

それでは、ニューラルネットワークとバックプロパゲーションと勾配の大きさにその意味の構造に移りましょう。

 

Recall this general structure for a simple, univariate neural network. Each neuron or “activity” is derived from the previous: it is the previous activity multiplied by some weight and then fed through an activation function. The input, of course, is the notable exception. The error box J at the end returns the aggregate error of our system. We then perform backpropagation to modify the weights through gradient descent such that the output of J is minimized.

To calculate the derivative to the first weight, we used the chain rule to “backpropagate” like so:

 
We then use these derivatives to iteratively make our way the minimum point using gradient descent.

Let’s focus on these individual derivatives:

 

With regards to the first derivative — since the output is the activation of the 2nd hidden unit, and we are using the sigmoid function as our activation function, then the derivative of the output is going to contain the derivative of the sigmoid function. In specific, the resulting expression will be:

 
From the output to hidden2

The same applies for the second:

 
From hidden2 to hidden1

In both cases, the derivative contains the derivative of the sigmoid function. Now, let’s put those together:

 

Recall that the derivative of the sigmoid function outputs values between 0 and 1/4. By multiplying these two derivatives together, we are multiplying two values in the range (0, 1/4]. Any two numbers between 0 and 1 multiplied with each other will simply result in a smaller value. For example, 1/3 × 1/3 is 1/9.

The standard approach to weight initialization in a typical neural network is to choose weights using a Gaussian with mean 0 and standard deviation 1. Hence, the weights in a neural network will also usually be between -1 and 1.

Now, look at the magnitude of the terms in our expression:

 

At this point, we are multiplying four values which are between 0 and 1. That will become small very fast. And even if this weight initialization technique is not employed, the vanishing gradient problem will most likely still occur. Many of these sigmoid derivatives multiplied together would be small enough to compensate for the other weights, and the other weights may want to shift into a range below 1.

This neural network isn’t that deep. But imagine a deeper one used in an industrial application. As we backpropagate further back, we’d have many more small numbers partaking in a product, creating an even tinier gradient! Thus, with deep neural nets, the vanishing gradient problem becomes a major concern. Sidenote: we can even have exploding gradients, if the gradients were bigger than 1 (a bunch of numbers bigger than 1 multiplied together is going to give a huge result!) These aren’t good either — they wildly overshoot.

Now, let’s look at a typical ANN:

 

As you can see, the first layer is the furthest back from the error, so the derivative (using the chain rule) will be a longer expression and hence will contain more sigmoid derivatives, ending up smaller. Because of this, the first layers are the slowest to train. But there’s another issue with this: since the latter layers (and most notably the output) is functionally dependent on the earlier layers, inaccurate early layers will cause the latter layers to simply build on this inaccuracy, corrupting the entire neural net. Take convolutional neural nets as an example; their early layers perform more high-level feature detection such that the latter layers can analyze them further to make a choice. Also, because of the small steps, gradient descent may converge at a local minimum.

This is why neural networks (especially deep ones), at first, failed to become popular. Training the earlier layers correctly was the basis for the entire network, but that proved too difficult and infeasible because of the commonly used activation functions and available hardware.

How do we solve this? Well, it’s pretty clear that the root cause is the nature of the sigmoid activation function derivative. This also happened in the most popular alternative, the tanh function. Until recently, not many other activation functions were thought of / used. But now, the sigmoid and tanh functions have been declining in popularity in the light of the ReLU activation function.

What is the ReLU — Rectified Linear Unit — function anyways? It is a piecewise function that corresponds to:

 

Another way of writing the ReLU function is like so:

 

In other words, when the input is smaller than zero, the function will output zero. Else, the function will mimic the identity function. It’s very fast to compute the ReLU function.

 
The “Rectified Linear Unit” acivation function —  http://i.stack.imgur.com/8CGlM.png

It doesn’t take a genius to calculate the derivative of this function. When the input is smaller than zero, the output is always equal to zero, and so the rate of change of the function is zero. When the input is greater or equal to zero, the output is simply the input, and hence the derivative is equal to one:

 
The function is not differentiable at zero, though

If we were to graph this derivative, it would look exactly like a typical step function:

 
Hey… it’s a step function!

So, it’s solved! Our derivatives will no longer vanish, because the activation function’s derivative isn’t bounded by the range (0, 1).

ReLUs have one caveat though: they “die” (output zero) when the input to it is negative. This can, in many cases, completely block backpropagation because the gradients will just be zero after one negative value has been inputted to the ReLU function. This would also be an issue if a large negative bias term / constant term is learned — the weighted sum fed into neurons may end up being negative because the positive weights cannot compensate for the significance of the bias term. Negative weights also come to mind, or negative input (or some combination that gives a negative weighted sum). The dead ReLU will hence output the same value for almost all of your activities — zero. ReLUs cannot “recover” from this problem because they will not modify the weights in anyway, since not only is the output for any negative input zero, the derivative is too. No updates will be made to modify the (for example) bias term to be a lesser magnitude of negative such that the neural net can escape from corruption of the entire network. It doesn’t happen all that often that the weighted sum ends up negative, though; and we can indeed initialize weights to be only positive and/or normalize input between 0 and 1 if we are concerned about the chance of an issue like this occurring.

EDIT: As it turns out, there are some extremely useful properties of gradients dying off. In particular, the idea of “sparsity”. So, many times, backprop being blocked can actually be an advantage. More on that in this StackOverflow answer:

A “leaky” ReLU solves this problem. Leaky Rectified Linear Units are ones that have a very small gradient instead of a zero gradient when the input is negative, giving the chance for the net to continue its learning.

 

EDIT: Looks like values in the range of 0.2–0.3 are more common than something like 0.01.

Instead of outputting zero when the input is negative, the function will output a very flat line, using gradient ε. A common value to use for ε is 0.01. The resulting function is represented in following diagram:

 

As you can see, the learning will be small with negative inputs, but will exist nonetheless. In this sense, leaky ReLUs do not die.

However, ReLUs/leaky ReLUs aren’t necessarily always optimal — results with them have been inconsistent (maybe because, due to the small constant, in certain cases this could cause vanishing gradients again? — that being said, dead units again don’t happen all that often). Another notable issue is that, because the output of ReLU isn’t bounded between 0 and 1 or -1 and 1 like tanh/sigmoid are, the activations (values in the neurons in the network, not the gradients) can in fact explode with extremely deep neural networks like recurrent neural networks. During training, the whole network becomes fragile and unstable in that, if you update weights in the wrong direction even the slightest, the activations can blow up. Finally, even though the ReLU derivatives are either 0 or 1, our overall derivative expression contains the weights multiplied in. Since the weights are generally initialized to be < 1, this could contribute to vanishing gradients.

So, overall, it’s not a black and white problem. ReLUs still face the vanishing gradient problem, it’s just that they often face it to a lesser degree. In my opinion, the vanishing gradient problem isn’t binary; you can’t solve it with one technique, but you can delay (in terms of how deep we can go before we start suffering again) its impact.

You may also be wondering: by the way, ReLUs don’t squeeze values into a probability, so why are they so commonly used? One can easily stick a sigmoid/logistic activity to the end of a neural net for a binary classification scenario. For multiple outputs, one could use the softmax function to create a probability distribution that adds to 1.

Conclusion

This article was a joy to write because it was my first step into the intricacies and practice of Machine Learning, rather than just looking at theoretical algorithms and calculations. The vanishing gradient problem was a major obstacle for the success of deep learning, but now that we’ve overcome it through multiple different techniques in weight initialization (which I talked less about today), feature preparation (through batch normalization — centering all input feature values to zero), and activation functions, the field is going to thrive — and we’ve already seen it with recent innovations in game AI (AlphaGo, of course!). Here’s to the future of multi-layered neural networks! 

おすすめ

転載: www.cnblogs.com/heixialee/p/11099383.html