Uncovering the Encoder and Decoder Language Models

The advent of the Transformer architecture marks the beginning of the era of large modern language models. Since 2018, various large language models have emerged in an endless stream .


Looking at the LLM evolutionary tree ( github.com/Mooler0410/LLMsPracticalGuide ) , these language models are mainly divided into three categories: one is "encoder only", this type of language model is good at text understanding because they allow information to be in both parts of the text flow in the direction; the second is "decoder only", this type of language model is good at text generation, because information can only flow from the left to the right of the text, and effectively generates new vocabulary in an autoregressive manner; the third is "encoder-decoder" This type of language model combines the above two models and is used to complete tasks that require understanding input and generating output, such as translation.


Sebastian Raschka, the author of this article, provides a detailed explanation of the working principles of these three types of language models. He is an LLM researcher at the artificial intelligence platform Lightning AI and the author of Machine Learning Q and AI.


(The following content is compiled and published by OneFlow. Please contact us for authorization for reprinting. Original text : https://magazine.sebastianraschka.com/p/understanding-encoder-and-decoder)


Source | Ahead of AI

OneFlow compilation

Translation|Yang Ting, Wan Zilin


I was asked to provide an in-depth introduction to language large model (LLM) terminology and explain some of the more technical terms we now take for granted, including "encoder-style" and "decoder-style" LLMs. What do these terms mean?


Both encoder and decoder architectures basically use the same self-attention layer to encode word tokens. However, the difference is that the encoder is designed to learn and can be used for various predictive modeling tasks (such as classification); the decoder is designed to generate new text, such as answering user queries, etc.


1

OriginalTransformer

 

The original Transformer architecture ("Attention Is All You Need", 2017), developed for English-French and English-German language translation, used both encoders and decoders, as shown in the figure below.



In the figure above, the input text (that is, the sentence to be translated) is first segmented into individual word tokens, and then these tokens are encoded through the embedding layer, and then enters the encoder part. Next, position encoding vectors are added to each word embedding, after which these embeddings are passed through a multi-head self-attention layer. The multi-head attention layer is followed by residual and layer normalization (Add & normalize), which performs a layer of normalization and adds the original embedding through a skip connection (also called a residual connection or a shortcut connection). Finally, after entering the "fully connected layer" (which is a small multi-layer perceptron composed of two fully connected layers (with a nonlinear activation function between the fully connected layers)), the output will be "residual and layer normalized" again . unified " before passing the output to the multi-head self-attention layer of the decoder module.

The overall structure of the decoder part in the figure above is very similar to that of the encoder part. The key difference is their input and output content. The encoder receives the input text for translation, and the decoder is responsible for generating the translated text.


2

Encoder

 

The encoder part of the original Transformer architecture shown above is responsible for understanding and extracting relevant information from the input text. It outputs a continuous representation (embedding) of the input text, which is then passed to the decoder. Finally, the decoder generates translated text (in the target language) based on the continuous representation received from the encoder.

 

Over the years, several encoder-only architectures have been developed based on the encoder module in the original Transformer model. Two of the most representative examples are BERT ( Deep Bidirectional Transformer Pre-training for Language Understanding, 2018 ) and RoBERTa ( Robust Optimized BERT Pre-training Method, 2018 ).

 

BERT (Bidirectional Encoder Representations from Transformers) is an encoder-only architecture based on the Transformer encoder module, which uses masked language modeling (as shown in the figure below) and next sentence prediction tasks, pre-trained on large text corpora .


Illustration of the masked language modeling pre-training target used in the BERT-style Transformer.



The main idea of ​​masked language modeling is to randomly mask (or replace) some word tokens in the input sequence, and train the model to predict the original masked tokens based on context.

 

In addition to the masked language modeling pre-training task shown in the figure above, the next sentence prediction task requires the model to predict whether the sentence order of two randomly arranged sentences in the original document is correct. For example, two random sentences separated by a [SEP] tag:


  • [CLS] Toast is a simple yet delicious food [SEP] It’s often served with butter, jam, or honey.

  • [CLS] It’s often served with butter, jam, or honey. [SEP] Toast is a simple yet delicious food.

 

Among them, the [CLS] token is a placeholder for the model, prompting the model to return a True or False label to indicate whether the sentence order is correct.

 

The masked language and next sentence pre-training objectives allow BERT to massively learn contextual representations of input text, which can then be fine-tuned for a variety of downstream tasks such as sentiment analysis, question answering, and named entity recognition.

 

RoBERTa (Robustly optimized BERT approach) is an optimized version of BERT. It maintains the same overall architecture as BERT, but makes some training and optimization improvements, such as larger batch size, more training data, and removes the next sentence prediction task. These improvements enable RoBERTa to have better performance. Compared with BERT, RoBERTa can handle various natural language understanding tasks better.


3

decoder

 

Returning to the original Transformer architecture at the beginning of this section, the multi-head self-attention mechanism in the decoder is similar to that in the encoder, but is masked to prevent the model from paying attention to future positions, ensuring that the prediction of position i is only based on the known Output position less than i. The figure below shows the process of the decoder generating output word by word.


Schematic diagram of the next sentence prediction task in the original Transformer.


This masking operation (visible explicitly in the image above, but actually occurring within the decoder's multi-head self-attention mechanism) is critical to maintaining the autoregressive properties of the Transformer model during training and inference. Autoregressive properties ensure that the model generates output tokens one by one, using previously generated tokens as context to generate the next token.

 

Over the years, researchers have expanded on the original encoder-decoder Transformer architecture and developed several decoder-only models that can efficiently handle a variety of natural language tasks, the most famous of which is GPT (Generative Pre-Transformer). -trained Transformer) series models.

 

The GPT series of models are decoder-only models that are pre-trained on large-scale unsupervised text data and then fine-tuned for specific tasks, such as text classification, sentiment analysis, question answering, and summary generation. GPT models include GPT-2, GPT-3 (GPT-3 was released in 2020, with few-shot learning capabilities) and the latest GPT-4. These models have demonstrated excellent performance in various benchmark tests and are currently the most popular models. Welcome architecture for natural language processing.


One of the most striking properties of the GPT model is its emergent property. Emergent properties refer to the capabilities and skills developed by the model during pre-training for next word prediction. Although these models are only trained to predict the next word, pre-trained models can perform a variety of tasks such as text summary generation, translation, question answering, and classification. Additionally, these models can learn from context to complete new tasks without updating model parameters.


4

Encoder-decoder hybrid model


  • BART (Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension, 2019)

  • and T5 (Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, 2019).

 

In addition to the traditional encoder and decoder architecture, the development of new encoder-decoder models has made a major breakthrough, giving full play to the advantages of encoder and decoder models. These models incorporate novel techniques, pre-training targets, or architectural modifications to improve performance on a variety of natural language processing tasks. Here are some new encoder-decoder models worth keeping an eye on:

 

  • BART (Denoising sequence-to-sequence pre-trained model for natural language generation, translation and understanding, released in 2019)

 

  • T5 (Exploring the limits of transfer learning with a unified text-to-text Transformer, released in 2019).

 

Encoder-decoder models are commonly used in natural language processing, tasks that involve understanding input sequences and generating corresponding output sequences. These sequences tend to be of different lengths and structures. This model excels in tasks that require complex mapping and capturing elemental relationships between input and output sequences. Encoder-decoder models are commonly used for tasks such as text translation and summary generation.


5

Terminology and Jargon


These models (including encoder-only, decoder-only, and encoder-decoder models) are sequence-to-sequence models (often referred to as "seq2seq" for short). It is worth noting that although we refer to BERT models as encoder-only models, the description "encoder only" can be misleading because these models also decode embeddings into output tokens or text during pre-training.

 

In other words, both encoder-only and decoder-only architectures are "decoding". However, unlike decoder-only and encoder-decoder architectures, encoder-only architectures do not decode in an autoregressive manner. Autoregressive decoding refers to generating an output sequence token by token, where each token is based on the previously generated token. In contrast, encoder-only models do not generate coherent output sequences in this way. Instead, they focus on understanding input text and generating task-specific outputs, such as label prediction or token prediction.


6

in conclusion


In short, encoder models are very popular for learning embedding representations for classification tasks, encoder-decoder models are used for generative tasks that rely on inputs to produce outputs (such as translation and summary generation), and Decoder-only models are used for other types of generative tasks, including question answering.


Since the advent of the first Transformer architecture, hundreds of encoder, decoder and encoder-decoder hybrid models have been developed. An overview of the models is shown in the figure below:


Some of the most popular large-scale language Transformers by architecture type and developer.


Although encoder-only models gradually lost popularity, decoder-only models such as GPT-3, ChatGPT, and GPT-4 made significant breakthroughs in text generation and became widely popular. However, encoder-only models are still very useful for training predictive models based on text embeddings, which offer unique advantages over generated text.


Everyone else is watching

Try OneFlow: github.com/Oneflow-Inc/oneflow/


This article is shared from the WeChat public account - OneFlow (OneFlowTechnology).
If there is any infringement, please contact [email protected] for deletion.
This article participates in the " OSC Source Creation Plan ". You who are reading are welcome to join and share together.

Fined 200 yuan and more than 1 million yuan confiscated You Yuxi: The importance of high-quality Chinese documents Musk's hard-core migration server Solon for JDK 21, virtual threads are incredible! ! ! TCP congestion control saves the Internet Flutter for OpenHarmony is here The Linux kernel LTS period will be restored from 6 years to 2 years Go 1.22 will fix the for loop variable error Svelte built a "new wheel" - runes Google celebrates its 25th anniversary
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/oneflow/blog/10109514