This article introduces a method for dynamically adjusting the learning rate of the neural networktensorflow
in the library - the meaning of the parameters of the exponential decay strategy and its specific usage.ExponentialDecay()
When training neural networks, we often need to use dynamically changing learning rates, among which the exponential decayExponentialDecay()
strategy is a commonly used strategy. In tensorflow
the library, its complete usage is tf.keras.optimizers.schedules.ExponentialDecay()
, where the specific parameters are as follows.
tf.keras.optimizers.schedules.ExponentialDecay(
initial_learning_rate, decay_steps, decay_rate, staircase=False, name=None
)
First of all, we need to know that after using ExponentialDecay()
the strategy, the program will dynamically adjust the learning rate during the neural network training process, and this adjustment is step
related to our current training. For specific step
explanations, you can refer to the article Introduction to the specific meanings of neural network epoch, batch, batch size, step and iteration (https://blog.csdn.net/zhebushibiaoshifu/article/details/131086145), this article will not go into details .
As shown in the following code, after using the strategy, the program will calculate the current learning rate ExponentialDecay()
based on the current training and several parameters we set ourselves according to the following rules. step
Among them, the return value of the function is the current learning rate.
def decayed_learning_rate(step):
return initial_learning_rate * decay_rate ^ (step / decay_steps)
Among them, initial_learning_rate * decay_rate ^ (step / decay_steps)
is the calculation formula of the current learning rate. Here initial_learning_rate
, decay_rate
and , are the first parameters of the function decay_steps
we mentioned earlier . Among them, is our initial learning rate, is the rate at which the learning rate drops, and is the position where the learning rate drops (the specific meaning will be introduced later). In addition, the strategy also has two parameters, which indicate whether we round the result down or take a decimal when calculating . The default is to take the decimal result (the specific meaning will be introduced later); the last parameter is only for the current A learning rate drop strategy is named. Generally, this parameter is not used, so we will not introduce it again.ExponentialDecay()
3
initial_learning_rate
decay_rate
decay_steps
ExponentialDecay()
staircase
(step / decay_steps)
False
name
From this, we can preliminarily know that ExponentialDecay()
the first parameter of the function 4
is used to calculate the current learning rate; and combined with our previous formula initial_learning_rate * decay_rate ^ (step / decay_steps)
, we can know that as the current step
continues to increase, decay_rate ^ (step / decay_steps)
it will decrease.
Next, let's directly bring in specific data to see the specific functions of these parameters.
As shown in the figure below, we have a training data set here, in which there are 193608
a total of samples.
At the same time, I set up the behavior of the neural network batch size
, 2048
so based on the above mentioned article, the specific meaning of the neural network epoch, batch, batch size, step and iteration is introduced (https://blog.csdn.net/zhebushibiaoshifu/article/details/ 131086145), it can be seen that among 1
them epoch
, we train 193608
this sample, and the total batch
number required is 193608 / 2048
, that is 94.54
, rounded up to , which is equivalent to the number 95
required . Also, I set and are respectively , and and are set to . As shown below.95
step
initial_learning_rate
decay_rate
decay_steps
0.1
0.95
95
staircase
True
At this point, we can introduce the specific meaning and function of each parameter. First, we start to train the neural network model, that is, we step
start to 0
increase gradually; but due to my staircase
actions True
, as long as the exponent (step / decay_steps)
is less than 1
, then it is regarded as 0
(because the current parameter setting is to round the result down ); and because other 0
than The power of any number 0
is 1
, so the formula at this time initial_learning_rate * decay_rate ^ (step / decay_steps)
is always equal initial_learning_rate
, that is, it is always maintained 0.1
; only when step
it reaches the value we set , can decay_steps
the exponent become , so that it finally produces an effect. And here, because I deliberately set it to , so it stands to reason that the learning rate will drop after a period of time-because we have calculated it before, we need a period of time . Then at this time, the learning rate becomes .(step / decay_steps)
1
decay_rate
decay_steps
95
1
epoch
1
epoch
95
step
0.1 * 0.95
Next, we run the above code and train 6
it epoch
to verify whether the change of the learning rate is as we imagined.
The figure below shows the change of TensorBoard
the learning rate with epoch
. It should be noted here that I turned on the smoothing option of the graph when I took the screenshot here, so the light-colored line should prevail.
The above picture is not complete, so you may not be able to see anything; we directly export the learning rate change, as shown in the figure below.
step
Among them, what is actually shown in the figure is epoch
that everyone can understand it here. It can be seen that at epoch
the last time0
(that is, when the first one is performed epoch
), the learning rate has always been 0.1
; and when it is the second epoch
time - at this time, our training process step
should start from 95
the beginning, but it is not yet 190
, so (step / decay_steps)
it is always Because 1
, the learning rate 0.1 * 0.95 = 0.095
is just (because of the data format problem, the accuracy is slightly different); then, when it comes to the third epoch
time-at this time, our training process step
should start from 190
the beginning, but it is not yet 285
, so (step / decay_steps)
it is always 2
, learning The rate is already there 0.1 * 0.95 * 0.95 = 0.09025
.
It can be seen from this that if I decay_steps
increase 10
the number of times so that it is 950
, then the learning rate will not change at 10
the first time, and the learning rate will start to decay from the first time .epoch
11
epoch
Here my parameter staircase
is set to True
, so the above results will appear; on the contrary, if it is set to False
, then (step / decay_steps)
the result is calculated as decimals . In other words, as long as step
there is a change, the current corresponding learning rate will also change, but The magnitude of the change will be slightly smaller.
It can be seen from this that the above-mentioned changes in the learning rate are in line with our expectations. Of course, the last two epoch
corresponding learning rates in the above figure have not changed, and I have not figured out the specific reason for this; however, learning rate reduction is a strategy. Through the above code, we still meet the demand for dynamically adjusting the learning rate. .
So far, you're done.
Welcome to pay attention: Crazy learning GIS