Why regularization help prevent over-fitting it (Why regularization reduces overfitting?)
Why regularization help prevent over-fitting it? Why it can reduce the variance of the problem? We look at two examples to understand intuitively.
Left deviation is high, the right is high variance, the middle is the Just Right , which a few pictures we have seen in previous lessons.
Now we look at this huge depth of fitting neural network. I know this picture is not big enough, not enough depth, but you can imagine this is an over-fitting of the neural network. This is our cost function , with parameters , . We add regularization term, it avoids excessive weight matrix of data, this is the Frobenius norm, why compression norm or the Frobenius norm or parameters can reduce the over-fitting?
Intuitively understand that if regularization set large enough, weight matrix is set to a value close to zero, it is to be understood that multiple hidden intuitive weight units are set to zero, thus substantially eliminating many of the effects of these hidden units. If this is the case, this is greatly simplified neural network becomes a small network, as small as a logistic regression unit, but the depth is great, it makes the network more accessible from the over-fitting state left high bias state.
but there will be an intermediate value, then there will be a close "the Just Right" intermediate state.
Intuitive understanding is increases large enough, will be close to zero, in fact, this does not happen, we try to eliminate or at least reduce the impact of a number of hidden units, and ultimately the network will become more simple, the logistic regression neural network closer to our intuition Considering that the large number of hidden units is completely eliminated, it is not true, in fact, all the hidden units of the neural network still exist, but their influence has become even smaller. Neural network becomes more simple, seemingly so less prone to over-fitting, so I'm not sure whether this intuition is useful experience, but the implementation of regularization, you actually see some variance reduction results in programming.
Let's intuitive feel, regularization Why can prevent over-fitting, assume that we use is this hyperbolic activation function.
use denotes , we discovered that if is very small, if involves only a few parameters, here we use the hyperbolic tangent function of a linear state, so long as can be extended to such a larger value or a smaller value, became non-linear activation function.
Now you should abandon this intuition, if the regularization parameter is large, the activation function parameters will be relatively small because the cost function parameters in the bigger, if is small,
in case is small, relatively speaking, will be small.
In particular, if values within this range finally, is a relatively small value, is substantially linear, each are almost linear, as linear regression function.
We talked about the first lesson, if each is linear, then the whole network is a linear network, even a very deep underlying network, due to characteristics of a linear activation function, and ultimately we can only calculate a linear function, so it does not apply to a very complex decision-making, as well as non-linear decision boundary overfitting the data set, as we have seen in the slide overfitting high variance.
In summary, if the regularization parameter becomes large, the parameters is small, will be relatively small, this time ignoring impact, will be relatively small, in fact, ranges from small, the activation function, i.e. a function of the curve may be relatively linear, nearly the entire neural network computes the value of the off-line function, the linear function is very simple, is not a very complex non-linear function of the height of the fitting does not happen.
We realize regularization of programming assignments in time, will witness these results, summarized regularization before, I'll give you a little suggestion of implementation, while increasing regularization term, before the application of the definition of the cost function , we did modify, add a, aimed at preventing excessive weight.
If you are using when gradient descent function decline in debugging gradient, one step is to cost function is designed to such a function, when debugging gradient descent, AM represents the number of gradient descent. Can be seen, the cost function for each amplitude are monotonically decreasing gradient descent. If you implement a regularization function, keep in mind, already has a new definition. If you use the original function , which is the first item regularization term, you may not see monotonically decreasing phenomenon, in order to debug gradient descent, be sure to use the newly defined function, which contains the second regularization term, otherwise the function may not all have a monotonically decreasing in amplitude range.
This is regularization, it is my deep learning training model most commonly used method. In depth study, there is a method also used in the regularization isdropoutregularization, we repeat that the next lesson.
Course PPT