[Deep Learning] Transformer/VitNet/Conformer/DSSM Model Structure Analysis

Table of contents

Transformer model 

Foreword:

recommended article: 

Vision Transformer model

Foreword:

The role of cls token:

Code analysis:

recommended article: 

Conformer model 

Foreword:

The source of the paper: 

recommended article: 

Transformer DSSM model

recommended article: 


Transformer model 

Foreword:

Recently, Transformer is very popular in the field of CV. Transformer was published by Google on Computation and Language in 2017. At that time, it was mainly proposed for the field of NLP natural language processing, mainly for machine translation tasks in the field of NLP. Prior to this, the models used by everyone to deal with such tasks were temporal networks such as RNN and LSTM, but such models inevitably have the problem of limited memory length or limited sentence information that can be used. , and the gradient explosion and gradient disappearance brought about by the increase of the sequence length. Although LSTM alleviates this kind of problem to a certain extent on the basis of RNN, another serious problem of this kind of RNN-based network model is that it cannot be parallelized. change. If you want to calculate the data at time tn, you must first calculate the data at time tn-1. The problem caused by the inability to parallelize is that the calculation efficiency is extremely low. In response to these problems, the Google team proposed Transformer (Chinese name Transformers). The current Transformer is considered to be the fourth largest type of basic model after NLP, CNN, and RNN. Perhaps this is the gold content of Attention Is All You Need. A core of the transformer is to propose a model that relies on the attention mechanism Attention.

The source of the paper:

[1706.03762] Attention Is All You Need (arxiv.org)

recommended article: 

Transformer that Xiaobai can understand (illustration) (qq.com)

[First release at station B] Detailed explanation of the Transformer model (the most complete version in the history of illustrations) - 哔哩哔哩 (bilibili.com)

(16 messages) Detailed explanation of Transformer source code (Pytorch version) line by line_Queen_sy's Blog-CSDN Blog

Detailed explanation of Transformer model - Zhihu (zhihu.com)

Vision Transformer model

Foreword:

Since the outbreak of deep learning, CNN has been the mainstream model in the CV field, and has achieved good results. In contrast, the Transformer based on the self-attention structure shines in the NLP field. Although the Transformer structure has become the standard in the NLP field, its application in the field of computer vision is still very limited.

ViT (vision transformer) is a model proposed by Google in 2020 that directly applies Transformer to image classification. Through the experiments in this article, the best model given can achieve an accuracy rate of 88.55% on ImageNet1K (first in Google's own Pre-trained on the JFT dataset), indicating that Transformer is indeed effective in the CV field, and the effect is quite amazing. VIT promotes the unification of NLP and CV, and promotes the development of the multimodal field.

The role of cls token:

(17 messages) ViT: What is the role of cls token in Vision transformer? _transformer class token_MengYa_DreamZ's Blog-CSDN Blog

Code analysis:

 VIT code analysis - Zhihu (zhihu.com)

(18 messages) Building a Pytorch Model Tutorial from Zero (3) Building a Transformer Network_Pytorch Building a Transformer_CV Technical Guide (Public Account) Blog-CSDN Blog

recommended article: 

(17 messages) Neural network study notes 3 - Transformer, VIT and BoTNet network_vit is a neural network_RanceGru's blog-CSDN blog

(17 messages) Neural network learning record 67 - detailed explanation of the recurrence of the Pytorch version of the Vision Transformer (VIT) model_vit recurrence_Bubbliiiiing's blog-CSDN blog

Cutting edge dynamics|Vision Transformer in the past two years (qq.com)

Conformer model 

Foreword:

The model based on transformer and convolutional neural network cnn has achieved good results on ASR, which are better than RNN. Transformer can capture long-sequence dependencies and content-based global interaction information, and CNN can effectively utilize local features. Therefore, this paper combines transformer and cnn to model both local and global dependencies of audio sequences. A convolution-enhanced transformer model is proposed for speech recognition problems, called conformer. The performance of the model is better than both transformer and cnn. New sota. On the Libri Speech benchmark, WER reached 2.1%/4.3% when the consformer did not use a language model, and 1.9%/3.9% when using an external language model.

The source of the paper: 

[2005.08100] Conformer: Convolution-augmented Transformer for Speech Recognition (arxiv.org)

recommended article: 

(19 messages) Conformer reading notes_44070509's blog-CSDN blog

(19 messages) Conformer (understanding and analysis used in WeNet)_conformer speech recognition_雨雨子speech's Blog-CSDN Blog

Transformer DSSM model

recommended article: 

(19 messages) Graphical Transformer+DSSM_transformer dssm is_a flying bird's blog-CSDN blog

おすすめ

転載: blog.csdn.net/Next_SummerAgain/article/details/130027901