1.5 Why regularization help prevent over-fitting - depth learning Lesson "Improving DNN" -Stanford Professor Andrew Ng

Why regularization help prevent over-fitting it (Why regularization reduces overfitting?)

Why regularization help prevent over-fitting it? Why it can reduce the variance of the problem? We look at two examples to understand intuitively.

Here Insert Picture Description

Left deviation is high, the right is high variance, the middle is the Just Right , which a few pictures we have seen in previous lessons.

Here Insert Picture Description

Now we look at this huge depth of fitting neural network. I know this picture is not big enough, not enough depth, but you can imagine this is an over-fitting of the neural network. This is our cost function J J , with parameters w w b b . We add regularization term, it avoids excessive weight matrix of data, this is the Frobenius norm, why compression L 2 L2 norm or the Frobenius norm or parameters can reduce the over-fitting?

Intuitively understand that if regularization λ \lambda set large enough, weight matrix w w is set to a value close to zero, it is to be understood that multiple hidden intuitive weight units are set to zero, thus substantially eliminating many of the effects of these hidden units. If this is the case, this is greatly simplified neural network becomes a small network, as small as a logistic regression unit, but the depth is great, it makes the network more accessible from the over-fitting state left high bias state.

but λ \lambda there will be an intermediate value, then there will be a close "the Just Right" intermediate state.

Intuitive understanding is λ \lambda increases large enough, w w will be close to zero, in fact, this does not happen, we try to eliminate or at least reduce the impact of a number of hidden units, and ultimately the network will become more simple, the logistic regression neural network closer to our intuition Considering that the large number of hidden units is completely eliminated, it is not true, in fact, all the hidden units of the neural network still exist, but their influence has become even smaller. Neural network becomes more simple, seemingly so less prone to over-fitting, so I'm not sure whether this intuition is useful experience, but the implementation of regularization, you actually see some variance reduction results in programming.

Here Insert Picture Description

Let's intuitive feel, regularization Why can prevent over-fitting, assume that we use is this hyperbolic activation function.

Here Insert Picture Description

use g ( z ) g(z) denotes t a n h ( z ) tanh (z) , we discovered that if z from is very small, if z from involves only a few parameters, here we use the hyperbolic tangent function of a linear state, so long as z from can be extended to such a larger value or a smaller value, became non-linear activation function.

Here Insert Picture Description

Now you should abandon this intuition, if the regularization parameter λ l is large, the activation function parameters will be relatively small because the cost function parameters in the bigger, if w w is small,

Here Insert Picture Description

in case w w is small, relatively speaking, z from will be small.

Here Insert Picture Description

In particular, if z from values within this range finally, is a relatively small value, g ( z ) g(z) is substantially linear, each are almost linear, as linear regression function.

Here Insert Picture Description

We talked about the first lesson, if each is linear, then the whole network is a linear network, even a very deep underlying network, due to characteristics of a linear activation function, and ultimately we can only calculate a linear function, so it does not apply to a very complex decision-making, as well as non-linear decision boundary overfitting the data set, as we have seen in the slide overfitting high variance.

Here Insert Picture Description

In summary, if the regularization parameter becomes large, the parameters w w is small, z from will be relatively small, this time ignoring b b impact, z from will be relatively small, in fact, z from ranges from small, the activation function, i.e. a function of the curve t a n h fishy may be relatively linear, nearly the entire neural network computes the value of the off-line function, the linear function is very simple, is not a very complex non-linear function of the height of the fitting does not happen.

We realize regularization of programming assignments in time, will witness these results, summarized regularization before, I'll give you a little suggestion of implementation, while increasing regularization term, before the application of the definition of the cost function J J , we did modify, add a, aimed at preventing excessive weight.

Here Insert Picture Description

If you are using when gradient descent function decline in debugging gradient, one step is to cost function J J is designed to such a function, when debugging gradient descent, AM represents the number of gradient descent. Can be seen, the cost function for each amplitude are monotonically decreasing gradient descent. If you implement a regularization function, keep in mind, J J already has a new definition. If you use the original function J J , which is the first item regularization term, you may not see monotonically decreasing phenomenon, in order to debug gradient descent, be sure to use the newly defined J J function, which contains the second regularization term, otherwise the function J J may not all have a monotonically decreasing in amplitude range.

This is L 2 L2 regularization, it is my deep learning training model most commonly used method. In depth study, there is a method also used in the regularization isdropoutregularization, we repeat that the next lesson.

Course PPT

Here Insert Picture Description
Here Insert Picture Description
Here Insert Picture Description

Published 186 original articles · won praise 7 · views 10000 +

Guess you like

Origin blog.csdn.net/weixin_36815313/article/details/105389842