1 Introduction
Sequence mapping we are currently planning to use the Transformer structure;
2 Acknowledgements
Thanks for the information provided by the public account "Artificial Intelligence Technology Dry Goods", "How to use pytorch's built-in torch.nn.CTCLoss elegantly"
Let me gain great knowledge!
3 Rules for building the vocabulary
We use "," as a unified separator;
3.1 Dictionary class-Vocab
We use set as the data structure of the dictionary;
4 model design
3.1 backbone——rec_r34_vd
The design of the backbone network is based on the recognition model of PaddleOCR,
The link to the code is as follows:
https://github.com/PaddlePaddle/PaddleOCR/blob/develop/ppocr/modeling/backbones/rec_resnet_vd.py
3.2 loss function-CTC Loss
The mathematical derivation of the CTC Loss function:
https://zhuanlan.zhihu.com/p/43534801
For the calculation process of CTC dynamic programming in CTC Loss, you can refer to the PPT document of Deep Systems:
3.2.1 Code writing——torch.nn.
CTCLoss
我们使用PyTorch自带的CTCLoss;
CTCLoss的输入有两种方式:
padded和un-padded;
Here we recommend un-padded方式,因为不用预先做padding的操作,会方便一些,也更好理解;
nn.CTCLoss() parameter description:
blank-- blank ID:
Similar to a placeholder, used in CTC-Loss to separate two character ranges, such as "ap-ple", you can use blank to separate the same elements in a string;
We will also add <blank> characters to the dictionary, and break the ID in the dictionary into the blank parameter here ;
ctc_loss() parameter description:
The reference code is
loss = ctc_loss(output.log_softmax(2), target, input_lengths, target_lengths)
input- prediction array:
The predicted value of the input model requires log_softmax(2) before inputting to loss;
targets -a long array of target arrays:
It is called a long array because it is spliced by the target value. The method of splicing is to use "long_array += array";
For a visual explanation, please refer to the blog post I posted, paste it here,
5 Model debugging
5.1 Loss appears "nan"-the model has numerical overflow during calculation
5.1.1 The "nan" phenomenon caused by gradient explosion-not caused by excessive learning rate lr
Gradient explosion will cause loss to appear "nan" phenomenon, this is because the gradient has a "value overflow" problem in the process of backpropagation, (such problems are not necessarily caused by code problems)
Observation point 1: After reducing the learning rate, there is no gradient explosion, indicating that it is not a problem of loss calculation, but a problem of gradient back propagation;
Observation point 2: When using the initial large learning rate, " loss is nan " will occasionally occur , instead of every training, it means that there is no problem with the forward calculation of loss;
Observation point 3: When using the initial large learning rate, the model has a certain probability (for example, 50%) to converge very well, indicating that there is no problem with the setting of the learning rate;
To sum up, this may be due to the large number of "fully connected" structures in the model , and multiple "multiplication" operations occurred when the gradient was passed in the reverse direction, resulting in a gradient explosion;
Conjecture: You can try to use gradient clipping to avoid numerical overflow caused by gradient return;
Experimental results: It is feasible to use gradient clipping;
6 Questions and notes
4.1 Why can't Transformer specify the size of output encoding?
When I was writing today, I thought of this question. Why can't Transformer specify the size of the output encoding?
I took a look at the PyTorch interface description, it is indeed not there.
Then I asked Mr. Liangliang, the teacher said that it was because the implementation version of PyTorch did not provide this function.
He suggested that I take a look at OpenNMT-py , thank you teacher Liangliang!