Detailed explanation of the parameter meaning and usage method of the exponential decay ExponentialDecay strategy of neural network learning rate

  This article introduces a method for dynamically adjusting the learning rate of the neural networktensorflow in the library - the meaning of the parameters of the exponential decay strategy and its specific usage.ExponentialDecay()

  When training neural networks, we often need to use dynamically changing learning rates, among which the exponential decayExponentialDecay() strategy is a commonly used strategy. In tensorflowthe library, its complete usage is tf.keras.optimizers.schedules.ExponentialDecay(), where the specific parameters are as follows.

tf.keras.optimizers.schedules.ExponentialDecay(
    initial_learning_rate, decay_steps, decay_rate, staircase=False, name=None
)

  First of all, we need to know that after using ExponentialDecay()the strategy, the program will dynamically adjust the learning rate during the neural network training process, and this adjustment is steprelated to our current training. For specific stepexplanations, you can refer to the article Introduction to the specific meanings of neural network epoch, batch, batch size, step and iteration (https://blog.csdn.net/zhebushibiaoshifu/article/details/131086145), this article will not go into details .

  As shown in the following code, after using the strategy, the program will calculate the current learning rate ExponentialDecay()based on the current training and several parameters we set ourselves according to the following rules. stepAmong them, the return value of the function is the current learning rate.

def decayed_learning_rate(step):
  return initial_learning_rate * decay_rate ^ (step / decay_steps)

  Among them, initial_learning_rate * decay_rate ^ (step / decay_steps)is the calculation formula of the current learning rate. Here initial_learning_rate, decay_rateand , are the first parameters of the function decay_stepswe mentioned earlier . Among them, is our initial learning rate, is the rate at which the learning rate drops, and is the position where the learning rate drops (the specific meaning will be introduced later). In addition, the strategy also has two parameters, which indicate whether we round the result down or take a decimal when calculating . The default is to take the decimal result (the specific meaning will be introduced later); the last parameter is only for the current A learning rate drop strategy is named. Generally, this parameter is not used, so we will not introduce it again.ExponentialDecay()3initial_learning_ratedecay_ratedecay_stepsExponentialDecay()staircase(step / decay_steps)Falsename

  From this, we can preliminarily know that ExponentialDecay()the first parameter of the function 4is used to calculate the current learning rate; and combined with our previous formula initial_learning_rate * decay_rate ^ (step / decay_steps), we can know that as the current stepcontinues to increase, decay_rate ^ (step / decay_steps)it will decrease.

  Next, let's directly bring in specific data to see the specific functions of these parameters.

  As shown in the figure below, we have a training data set here, in which there are 193608a total of samples.

  At the same time, I set up the behavior of the neural network batch size, 2048so based on the above mentioned article, the specific meaning of the neural network epoch, batch, batch size, step and iteration is introduced (https://blog.csdn.net/zhebushibiaoshifu/article/details/ 131086145), it can be seen that among 1them epoch, we train 193608this sample, and the total batchnumber required is 193608 / 2048, that is 94.54, rounded up to , which is equivalent to the number 95required . Also, I set and are respectively , and and are set to . As shown below.95stepinitial_learning_ratedecay_ratedecay_steps0.10.9595staircaseTrue

  At this point, we can introduce the specific meaning and function of each parameter. First, we start to train the neural network model, that is, we stepstart to 0increase gradually; but due to my staircaseactions True, as long as the exponent (step / decay_steps)is less than 1, then it is regarded as 0(because the current parameter setting is to round the result down ); and because other 0than The power of any number 0is 1, so the formula at this time initial_learning_rate * decay_rate ^ (step / decay_steps)is always equal initial_learning_rate, that is, it is always maintained 0.1; only when stepit reaches the value we set , can decay_stepsthe exponent become , so that it finally produces an effect. And here, because I deliberately set it to , so it stands to reason that the learning rate will drop after a period of time-because we have calculated it before, we need a period of time . Then at this time, the learning rate becomes .(step / decay_steps)1decay_ratedecay_steps951epoch1epoch95step0.1 * 0.95

  Next, we run the above code and train 6it epochto verify whether the change of the learning rate is as we imagined.

  The figure below shows the change of TensorBoardthe learning rate with epoch. It should be noted here that I turned on the smoothing option of the graph when I took the screenshot here, so the light-colored line should prevail.

  The above picture is not complete, so you may not be able to see anything; we directly export the learning rate change, as shown in the figure below.

stepAmong them, what is actually shown   in the figure is epochthat everyone can understand it here. It can be seen that at epochthe last time0 (that is, when the first one is performed epoch), the learning rate has always been 0.1; and when it is the second epochtime - at this time, our training process stepshould start from 95the beginning, but it is not yet 190, so (step / decay_steps)it is always Because 1, the learning rate 0.1 * 0.95 = 0.095is just (because of the data format problem, the accuracy is slightly different); then, when it comes to the third epochtime-at this time, our training process stepshould start from 190the beginning, but it is not yet 285, so (step / decay_steps)it is always 2, learning The rate is already there 0.1 * 0.95 * 0.95 = 0.09025.

  It can be seen from this that if I decay_stepsincrease 10the number of times so that it is 950, then the learning rate will not change at 10the first time, and the learning rate will start to decay from the first time .epoch11epoch

  Here my parameter staircaseis set to True, so the above results will appear; on the contrary, if it is set to False, then (step / decay_steps)the result is calculated as decimals . In other words, as long as stepthere is a change, the current corresponding learning rate will also change, but The magnitude of the change will be slightly smaller.

  It can be seen from this that the above-mentioned changes in the learning rate are in line with our expectations. Of course, the last two epochcorresponding learning rates in the above figure have not changed, and I have not figured out the specific reason for this; however, learning rate reduction is a strategy. Through the above code, we still meet the demand for dynamically adjusting the learning rate. .

  So far, you're done.

Welcome to pay attention: Crazy learning GIS

Guess you like

Origin blog.csdn.net/zhebushibiaoshifu/article/details/131092719