Generating text with transformers

So far, you've seen a high-level overview of some of the major components inside the Transformers architecture. But you haven't seen how the overall forecasting process works from start to finish. Let's understand with a simple example. In this example, you'll be looking at a translation task or a sequence-to-sequence task, which happened to be the original goal of the Transformers architecture designers.
insert image description here

You will use a Transformers model to translate the French phrase [J'aime l'apprentissage automatique] into English.

First, you'll tokenize the input words with the same tokenizer that you trained the network on.
insert image description here

These tokens are then added to the input on the network encoder side,
insert image description here

through the embedding layer and then feed into the multi-head attention layer.
insert image description here

The output of the multi-head attention layer is passed to the output of the encoder through a feed-forward network.
insert image description here

At this point, the data leaving the encoder is a deep representation of the structure and meaning of the input sequence. This representation is inserted in the middle of the decoder to influence the decoder's self-attention mechanism.
insert image description here

Next, a sequence start token is added to the input of the decoder.
insert image description here

This triggers the decoder to predict the next token, which it does based on contextual understanding provided from the encoder.
insert image description here

The output of the decoder's self-attention layer passes through the decoder's feed-forward network and a final softmax output layer.
insert image description here

At this point, we have our first token.
insert image description here

You would continue this loop, passing the output token back to the input to trigger the generation of the next token,
insert image description here

until the model predicts a sequence end token.
insert image description here

At this point, the final sequence of tokens can be back-tokenized into words and you have your output. In this case, I love machine learning I love machine learning.
insert image description here

There are various ways to use the output of the softmax layer to predict the next token. These can affect the creativity of the text you generate. You'll learn about these in more detail later this week.

Let's summarize what you've seen so far. A complete Transformers architecture consists of encoder and decoder components. An encoder encodes an input sequence into a deep representation of the structure and meaning of the input. The decoder works from an input token trigger, using the encoder's contextual understanding to generate new tokens. It keeps doing this until a certain stop condition is reached.
insert image description here

While the translation examples you explore here use the encoder and decoder parts of Transformers, you can separate these components for variations of your architecture.

Encoder-only models can also work as sequence-to-sequence models, but without further modifications, the input and output sequences are either of the same length. Their use is less common these days, but by adding an extra layer to the architecture, you can train an encoder-only model to perform classification tasks such as sentiment analysis, BERT is an example of an encoder-only model.
insert image description here

Encoder-decoder models, as you can see, perform well for sequence-to-sequence tasks such as translation, where the input and output sequences can be of different lengths. You can also extend and train this type of model to perform general text generation tasks. Examples of encoder-decoder models include BART (as opposed to BERT) and T5, which is the model you will use in the labs in this course.
insert image description here

Finally, decoder-only models are the most commonly used today. Again, as they expand, their capabilities grow. These models can now generalize to most tasks. Popular decoder-only models include the GPT model family, BLOOM, Jurassic, LLaMA, and many more. You'll learn more about the different kinds of Transformers and how they are trained later this week. That's quite a lot.
insert image description here

The main goal of this overview of Transformers models is to give you enough background to understand the differences between the various models used in the world and to be able to read the model documentation.

I want to stress that you don't need to worry about remembering all the details you see here, as you can come back to this explanation as many times as you want.

Remember that you'll be interacting with Transformers models through natural language, creating prompts using written words rather than code.

You don't need to know all the details of the underlying architecture to do this. This is known as hint engineering, and it's what you'll explore in the next part of this course. Let's move on to the next video to learn more.

reference

https://www.coursera.org/learn/generative-ai-with-llms/lecture/R0xbD/generating-text-with-transformers

Guess you like

Origin blog.csdn.net/zgpeace/article/details/132391997