BERT (1)-Detailed explanation of BERT transformer attention

Post the link first, and I have time to sort it out later...

Reference link:

https://blog.csdn.net/jiaowoshouzi/article/details/89073944 BERT principle, it’s very clear, just look back

Sorting records & thinking about some problems of BERT model  

How to evaluate the BERT model?

Transformer problem sorting (refer to the content of Zhihu)

 

attention https://zhuanlan.zhihu.com/p/43493999

https://zhuanlan.zhihu.com/p/27769667 attention code

https://www.zhihu.com/question/68482809 attention  原理
https://zhuanlan.zhihu.com/p/31547842  √
https://zhuanlan.zhihu.com/p/53682800  attention +transformer

BERT bert development history https://blog.csdn.net/jiaowoshouzi/article/details/89073944   

 

https://www.cnblogs.com/huangyc/p/9898852.html  bert principle   https://blog.csdn.net/u012526436/article/details/87637150

https://www.jianshu.com/p/63943ffe2bab   Bert needs to understand some content

http://blog.itpub.net/69942346/viewspace-2658642/ BERT pre-training model evolution process 

 

attention: https://zhuanlan.zhihu.com/p/150294471   https://www.zhihu.com/question/68482809  https://blog.csdn.net/guofei_fly/article/details/105516732 

 soft attention、hard attention、 local attention结构

 

BERT_MRC https://blog.csdn.net/eagleuniversityeye/article/details/109601547

Loss function

The loss function of the classification model in the BERT official code is called the negative log likelihood function (and is minimized, which is equivalent to maximizing the log likelihood function). The mathematical expression is:

As for why such a loss function is defined, it is because in actual use, logistic regression models are often used to solve classification problems. When logistic regression hits the square loss , the loss function is not convex with respect to the parameters . Therefore, it is not that square loss is not used in classification problems, but logistic regression does not use square loss. The log_probs in the code uses logarithms, so square loss is not used , but negative log likelihood loss function is used. Refer to link

The model has two losses, one is Masked Language Model, the other is Next Sentence Prediction

 

Guess you like

Origin blog.csdn.net/katrina1rani/article/details/108759047