Autonomous driving-study notes for Image Caption

1 Introduction

Sequence mapping we are currently planning to use the Transformer structure;

2 Acknowledgements

Thanks for the information provided by the public account "Artificial Intelligence Technology Dry Goods", "How to use pytorch's built-in torch.nn.CTCLoss elegantly"

Let me gain great knowledge!

3 Rules for building the vocabulary

We use "," as a unified separator;

3.1 Dictionary class-Vocab

We use set as the data structure of the dictionary;

4 model design

3.1 backbone——rec_r34_vd

The design of the backbone network is based on the recognition model of PaddleOCR,

The link to the code is as follows:

https://github.com/PaddlePaddle/PaddleOCR/blob/develop/ppocr/modeling/backbones/rec_resnet_vd.py

3.2 loss function-CTC Loss

The mathematical derivation of the CTC Loss function:

https://zhuanlan.zhihu.com/p/43534801

For the calculation process of CTC dynamic programming in CTC Loss, you can refer to the PPT document of Deep Systems:

https://docs.google.com/presentation/d/12gYcPft9_4cxk2AD6Z6ZlJNa3wvZCW1ms31nhq51vMk/pub?start=false&loop=false&delayms=3000#slide=id.g24e9f0de4f_0_332

3.2.1 Code writing——torch.nn.CTCLoss

我们使用PyTorch自带的CTCLoss;

CTCLoss的输入有两种方式:padded和un-padded;

Here we recommend un-padded方式,因为不用预先做padding的操作,会方便一些,也更好理解;


nn.CTCLoss() parameter description:

blank-- blank ID:

Similar to a placeholder, used in CTC-Loss to separate two character ranges, such as "ap-ple", you can use blank to separate the same elements in a string;

We will also add <blank> characters to the dictionary, and break the ID in the dictionary into the blank parameter here ;

ctc_loss() parameter description:

The reference code is

loss = ctc_loss(output.log_softmax(2), target, input_lengths, target_lengths)

input- prediction array:

The predicted value of the input model requires log_softmax(2) before inputting to loss;

targets -a long array of target arrays:

It is called a long array because it is spliced ​​by the target value. The method of splicing is to use "long_array += array";

For a visual explanation, please refer to the blog post I posted, paste it here,

 

 

 

 

5 Model debugging

5.1 Loss appears "nan"-the model has numerical overflow during calculation

5.1.1 The "nan" phenomenon caused by gradient explosion-not caused by excessive learning rate lr

Gradient explosion will cause loss to appear "nan" phenomenon, this is because the gradient has a "value overflow" problem in the process of backpropagation, (such problems are not necessarily caused by code problems)

Observation point 1: After reducing the learning rate, there is no gradient explosion, indicating that it is not a problem of loss calculation, but a problem of gradient back propagation;

Observation point 2: When using the initial large learning rate, " loss is nan " will occasionally occur , instead of every training, it means that there is no problem with the forward calculation of loss;

Observation point 3: When using the initial large learning rate, the model has a certain probability (for example, 50%) to converge very well, indicating that there is no problem with the setting of the learning rate;

To sum up, this may be due to the large number of "fully connected" structures in the model , and multiple "multiplication" operations occurred when the gradient was passed in the reverse direction, resulting in a gradient explosion;

Conjecture: You can try to use gradient clipping to avoid numerical overflow caused by gradient return;

Experimental results: It is feasible to use gradient clipping;

6 Questions and notes

4.1 Why can't Transformer specify the size of output encoding?

When I was writing today, I thought of this question. Why can't Transformer specify the size of the output encoding?

I took a look at the PyTorch interface description, it is indeed not there.

Then I asked Mr. Liangliang, the teacher said that it was because the implementation version of PyTorch did not provide this function.

He suggested that I take a look at OpenNMT-py , thank you teacher Liangliang!

Guess you like

Origin blog.csdn.net/songyuc/article/details/107460069