A sudden loss start a batch becomes nan (tfdbg use case) when Tensorflow training

Disclaimer: This article is original, All Rights Reserved https://blog.csdn.net/weixin_41864878/article/details/89317760

About Loss nan suddenly become a problem, the Internet search out most gradients are caused by the explosion, where we still have to divide the discussion
first clear when the training process appears nan
(1) the beginning of iterations loss is nan: This situation loss value belongs gradient caused by the explosion is always NaN
(2) late to training suddenly became nan (normal training iteration step n): this is not a gradient explosion, often log function and the calculation of loss related to the introduction, this paper also focused on solving the problem

loss of value caused by the explosion gradient nan solution

Such methods are already many online description of
(1) reducing the learning rate: 1e-4 ~ 1e-6 is appropriate, or if there will be nan smaller than 1e-6, it would have to consider adjusting the network structure
(2) reduced batch size: personally, I think it is enough memory at the situation is not much hope that this parameter ............
(3) introduced a regularization term: for example, the introduction of l2 regularization at each layer kernel_regularizer parameters, so you can regardless of gradient cut (it will be mentioned later)
(4) to join BN layer: I usually added after the activation function, each layer will be added, plus a good online about BN layer or after good opinions before activating the function, or their own an experiment to see what kind of effect is good and then decide
(5) of the input data do normalize
(6) cutting gradient: gradient cut two ways
one is cropping (my own understanding is part of the gradient by the network structure into the gradient calculation)

optimizer = tf.train.AdamOptimizer(learning_rate=0.001, beta1=0.5)
grads = optimizer.compute_gradients(loss)
for i, (g, v) in enumerate(grads):
    if g is not None:
        grads[i] = (tf.clip_by_norm(g, 5), v)  # 阈值这里设为5
train_op = optimizer.apply_gradients(grads)

The second cut is made after the global gradient calculation is completed for

optimizer = tf.train.AdamOptimizer(learning_rate=0.001, beta1=0.5)
grads, variables = zip(*optimizer.compute_gradients(loss))
grads, global_norm = tf.clip_by_global_norm(grads, 5)
train_op = optimizer.apply_gradients(zip(grads, variables))

This difference between the two I saw other blog said that the difference in computation time, the second will be more time-consuming. Since I did not do experiments targeted so we can not draw conclusions.
Come to talk about the relationship between the cut and the introduction of gradient regularization term, because in essence is to cut gradient gradient greater than 1 are processed by L2 regularization, it is less than 1 (formulas do not want to play the hand, images from https: blog // .csdn.net / guolindonggld / article / details / 79547284)
Here Insert Picture DescriptionSo when introduced in the network definition structure kernel_regularizer L2 of the regularization parameter, the gradient does not bring any benefits cut(Right after I was trying to find a gradient of no use to cut only to find)
On this point we can adopt tf.gradients()change network gradient function point of view before and after addition of gradient cut

Introducing log (0) is caused by loss solution nan

The log function are invisible in many cases, is included in the tensorflow written api, we call such a function but do not know the function but in fact did log operation
(1) using a cross-entropy
in the calculation of cross entropy the process will be introduced Y t r in t h l o g ( y p r e d i c t ) y_{truth} * log(y_{predict}) , So when the predicted value of the output softmax If the probability is 0, the value will result nan
improvement:
Rewrite cross-entropy function: After the predicted value plus a small positive number cross_entropy = y_{truth}*log(y_{pre} + 1e-10), then do a reduce_mean is the final output loss
Rewrite log function: To log function is used to correct the code tf.log(tf.clip_by_value(tf.sigmoid(y_pre), 1e-8, tf.reduce_max(y_pre)))
(2) tensorflow comes debug tool
if not useful how to adjust the network structure, congratulations, hi mention tfdbg kit set

from tensorflow.python import debug as tf_debug
##在session中导入tfdbg,并加入检查nan值的filters
sess = tf.Session()
sess = tf_debug.LocalCLIDebugWrapperSession(sess)
sess.add_tensor_filter("has_inf_or_nan", tf_debug.has_inf_or_nan)

Then normal execution of the code, in tfdbg command input window tfdbg> run -f has_inf_or_nan, the program will run to occur when the value of nan allow debug, the cost is so run slower rate of about 5 times
you can find local investigation nan value according to the node that appears after error
specific reference https://zhuanlan.zhihu.com/p/33264569

Updated on 2019/6/11
tfdgb nan used to find the value of really good use, provided that your nan would appear very early in the code
as follows:
after the normal execution python test.py, the code will appear to the establishment of session tfdbg session window
and enter the
Here Insert Picture Description
carriage return after the program will automatically exit to normal interface running, then the value is nan wait for time, will automatically jump back tfdbg, press nan displayed in chronological order tensor name value appears, and then you can start investigation, the problem is caused by the high probability of a tensor, in fact, look tensor name should roughly know where the problem lies.
I am here for the first name is a tensor tensor data, so generally you should know that the data is out of the question
Here Insert Picture Description
here can be used pt IteratorGetNext:1to see the values of tensor, pt -aare all printed, I do not show up, there was indeed a data value nan
Here Insert Picture Description
reference:
https://blog.csdn.net/accumulate_zhang/article/details/79890624
https://blog.csdn.net/shwan_ma/article/details/80472996
https://blog.csdn.net/leadai/article/ the Details / 79,143,002
https://zhuanlan.zhihu.com/p/33264569

Guess you like

Origin blog.csdn.net/weixin_41864878/article/details/89317760