Disclaimer: The following article links are only for personal study and memo.
basic knowledge
1: Zero-based analysis tutorial [recommended]
https://zhuanlan.zhihu.com/p/609271490
2: Transformer detailed explanation [recommended]
https://wmathor.com/index.php/archives/1438/
3: How to understand transformer from shallow to deep?
https://www.zhihu.com/question/471328838/answer/3011638037
4: Detailed explanation of Transformer model (the most complete version with illustrations) [Recommended]
Detailed explanation of the Transformer model (the most complete version with illustrations) - Zhihu
5: Interpretation of the Transformer model and Attention mechanism in a 10,000-word long article [Recommended]
Doubt analysis
1: Why do the K and V in the transformer decoder use the K and V output by the encoder?
https://www.zhihu.com/question/458687952
2: Teacher Forcing 、 Autoregressive、Exposure Bias 解释
Thoughts about Teacher Forcing and Exposure Bias - Zhihu
3: How is the decoder part of the training parallelized?
A brief analysis of parallel issues during Transformer training - Zhihu
Understanding masked attention in Transformer decoder_Sili LZS's Blog-CSDN Blog
4: During testing or prediction, why does the decoder in Transformer still need seq mask?
When testing or predicting, why does the decoder in Transformer still need seq mask? - Know almost
In-depth understanding of transformer source code_Team Zhao’s blog-CSDN blog