1. The initialization of weights and biases has a great influence. You can't just initialize directly with the standard normal distribution.
When the amount of data is small, use
weights=tf.Variable(tf.random_normal([input_size,out_size]))*np.sqrt(1/float(input_size))
bias=tf.Variable(tf.random_normal([1,out_size]))*np.sqrt(1/float(input_size))
7. During the training process, as the number of training steps increases, the loss appears Nan.
Reasons: ① The learning rate is too large, try to adjust it smaller.
The following is Zhihu Wang Yun's answer:
The most common reason is that the learning rate is too high. For classification problems, a learning rate that is too high will cause the model to "stubbornly" think that some data belongs to the wrong class, and the probability of the correct class is 0 (actually floating-point underflow), so using cross-entropy will calculate infinite loss function. Once this happens, the infinity derivative of the parameter will become NaN, and then the parameters of the entire network will become NaN.
The solution is to reduce the learning rate, or even set the learning rate to 0, and see if the problem persists. If the problem disappears, then it is indeed a learning rate problem. If the problem still exists, it means that the network that has just been initialized has died, and it is likely that there is an error in the implementation.
Author: Wang Yun Maigo
Link: https://www.zhihu.com/question/62441748/answer/232522878
Source: Zhihu
Copyright belongs to the author. For commercial reprints, please contact the author for authorization, and for non-commercial reprints, please indicate the source.
② Other situations: (quoting Zhihu’s answer)
What function does the loss function use? For classification problems, use categorical cross entropy; for regression problems, there may be a calculation of division by 0, which may be solved by adding a small remainder; the data itself, whether there is Nan, you can use numpy.any(numpy.isnan(x )) Check the input and target; the target itself should be able to be calculated by the loss function, for example, the target of the sigmoid activation function should be greater than 0, and the same data set needs to be checked
Author: Pig Go
Link: https://www.zhihu.com/question/62441748/answer/232520044
Source: Zhihu
The copyright belongs to the author. For commercial reprints, please contact the author for authorization, and for non-commercial reprints, please indicate the source.
7. (Additional at any time...