Gradient penalty: input gradient penalty & parameter gradient penalty & the relationship between the two

Input gradient penalty: [perturb the input sample] [virtual confrontation]
parameter gradient penalty [FLooding]

Gradient penalty with respect to input ‖∇ x f(x;θ)‖ 2

Reference from: Talking about adversarial training: meaning, method and thinking (with Keras implementation)
applying ϵ∇xL(x,y;θ) adversarial disturbance to input samples is equivalent to adding "gradient penalty" to loss to a certain extent
insert image description here
Gradient penalty says "samples of the same type must not only be placed in the same pit, but also at the bottom of the pit"

Reference from: Generalization Luantan: From Random Noise, Gradient Penalty to Virtual Adversarial Training
insert image description here

Gradient penalty on parameters ‖∇ θ f(x;θ)‖ 2

Reference from:
Do we really need to reduce the loss of the training set to zero?
Look at the optimization algorithm from the perspective of dynamics (5): Why should the learning rate not be too small?

Is too small a learning rate advisable?
Google's recent paper "Implicit Gradient Regularization" published on Arxiv attempts to answer this question. It points out that the limited learning rate implicitly brings a gradient penalty term to the optimization process, and this gradient penalty term is useful for improving generalization performance. Helpful, so even if factors such as computing power and time are not considered, the learning rate should not be too small.
An appropriate rather than too small learning rate can bring an implicit gradient penalty term to the optimization process, helping to converge to a more stable area
Proof process:
The discretized iterative process implicitly brings a gradient penalty term, which is helpful to the generalization of the model. If γ→0, this implicit penalty will become weaker or even disappear.
Therefore, the conclusion is that the learning rate should not be too small. A larger learning rate not only has the benefit of accelerating convergence, but also improves the generalization ability of the model. Of course, some readers may think, if I directly add the gradient penalty to the loss, can I use a small enough learning rate? Theoretically yes, the original paper added the gradient penalty to the loss, which is called "explicit gradient penalty".

When training the model, do we need the loss function to be trained to 0?
Obviously not. Normally, after the loss of the training set is reduced to a certain value, the loss of the verification set will start to rise, so there is no need to reduce the loss of the training set to 0.
That being the case, after a certain threshold has been reached, can we do something else to improve the performance of the model?
ICML 2020's paper "Do We Need Zero Training Loss After Achieving Zero Training Error?" answers this question. However, the answer in the paper is limited to the level of "what", and does not describe "why" well.
insert image description here
insert image description here
When the loss function reaches b, the training process is probably to alternately perform gradient descent and gradient ascent.
If you think about it intuitively, it feels like one step up and one step down, which seems to be just offset. Is this really true?
Let's do the math. Assuming that one step is lowered and then
insert image description here
one step is increased, the step of learning rate ε approximation is to use the Taylor expansion to approximate the loss function. The final result is equivalent to the loss function as the gradient penalty ‖g(θ)‖2=‖∇θ (θ)‖2. Gradient descent with learning rate ε22.
Even better, it is changed to "rise first and then fall", and the expression is still the same.
Therefore, on average,Flooding’s change to the loss function is equivalent to minimizing ‖∇ θ L(θ)‖ 2 after ensuring that the loss function is small enough, that is, pushing the parameters to a more stable area, which usually improves the generalization performance ( better resistance to disturbances), so to a certain extent, it can explain the reason why Flooding works
insert image description here

The relationship between the two gradient penalties:

Reference from: An inequality between input gradient penalty and parameter gradient penalty

Google's recent paper "The Geometric Occam's Razor Implicit in Deep Learning" is a partial answer to this question.
According to the gradient penalty of the above parameters, it is pointed out that: SGD implicitly includes the gradient penalty for the parameters,
and the formula (2 ) indicates that the gradient penalty on the parameters implicitly includes the gradient penalty on the input, and the gradient penalty on the input is related to the Dirichlet energy, which can be used as a representation of the model complexity.
So after a series of reasoning, the conclusion is: SGD itself will tend to choose a model with a relatively small complexity.

However, the original paper made a small mistake when interpreting formula (2). It says that ‖W(t)‖ in the initial stage will be very close to 0, so the term in brackets in formula (2) will be very large, so if you want to reduce the parameter gradient penalty on the right side of formula (2), then you must make the formula ( 2) The input gradient penalty on the left is small enough. However, from "Understanding the Initialization Strategy of Model Parameters from a Geometric Perspective", we know that the commonly used initialization methods are actually close to orthogonal initialization, and the spectral norm of the orthogonal matrix is ​​actually 1. If the activation function is considered, then the spectral norm of the initialization The number is actually greater than 1, so it is not true that ‖W(t)‖ will be very close to 0 in the initialization phase.
In fact, for a network that has not been trained, the parameters of the model and the input and output of each layer will basically maintain a stable state, so in fact, during the entire training process ‖h(t)‖, ‖W(t)‖ 、‖∇xh(t)‖In fact, the fluctuations are not large, so the gradient penalty on the parameters on the right end is approximately equivalent to the multiplication penalty on the input on the left end. This is the author's understanding, and there is no need for the assumption that "‖W(t)‖ will be very close to 0".

insert image description here

Guess you like

Origin blog.csdn.net/weixin_36378508/article/details/127278357