Post the link first, and I have time to sort it out later...
Reference link:
Sorting records & thinking about some problems of BERT model
How to evaluate the BERT model?
Transformer problem sorting (refer to the content of Zhihu)
attention https://zhuanlan.zhihu.com/p/43493999
https://zhuanlan.zhihu.com/p/27769667 attention code
https://www.zhihu.com/question/68482809 attention 原理
https://zhuanlan.zhihu.com/p/31547842 √
https://zhuanlan.zhihu.com/p/53682800 attention +transformer
BERT bert development history https://blog.csdn.net/jiaowoshouzi/article/details/89073944
https://www.cnblogs.com/huangyc/p/9898852.html bert principle https://blog.csdn.net/u012526436/article/details/87637150
https://www.jianshu.com/p/63943ffe2bab Bert needs to understand some content
http://blog.itpub.net/69942346/viewspace-2658642/ BERT pre-training model evolution process
attention: https://zhuanlan.zhihu.com/p/150294471 https://www.zhihu.com/question/68482809 https://blog.csdn.net/guofei_fly/article/details/105516732
soft attention、hard attention、 local attention结构
BERT_MRC https://blog.csdn.net/eagleuniversityeye/article/details/109601547
Loss function
The loss function of the classification model in the BERT official code is called the negative log likelihood function (and is minimized, which is equivalent to maximizing the log likelihood function). The mathematical expression is:
As for why such a loss function is defined, it is because in actual use, logistic regression models are often used to solve classification problems. When logistic regression hits the square loss , the loss function is not convex with respect to the parameters . Therefore, it is not that square loss is not used in classification problems, but logistic regression does not use square loss. The log_probs in the code uses logarithms, so square loss is not used , but negative log likelihood loss function is used. Refer to link
The model has two losses, one is Masked Language Model, the other is Next Sentence Prediction