In this loop, batch_size=64, take
i=0-->[0, 63],
i=64-->[64, 127]
....
The training is normal. When the
In the loop of i=320-->[320, 383], the loss appears nan
Reason: gradient explosion
The reference article explains it more clearly, I just summarize it from the surface.
It can be found that when i=256, the gradient parameters appear larger, such as
e+16, e+17 and other larger values
method
batch_size=64--->Changed to 32
Note: I have only tried this method. You can also adjust the learning rate, normalize and standardize the data set, etc. If the method of changing batch_size fails, I will try other methods and will add more at that time.