LLM architecture self-attention mechanism Transformers architecture Attention is all you need

Building large-scale language models using the Transformers architecture significantly improves the performance of natural language tasks, surpassing previous RNNs and leading to an explosion in regenerative capabilities.
insert image description here

The power of the Transformers architecture lies in its ability to learn the relevance and context of all words in a sentence. Not just what you see here, every word is next to its neighbors, but every other word in the sentence. Attention weights are applied to these relationships so that the model learns how each word is related to other words in the input, no matter where they are.
insert image description here

This enables the algorithm to learn who owns the book, who is likely to own it, and whether it is relevant to the wider context of the document. These attention weights are learned during LLM training, which you will learn more about later this week.
insert image description here

insert image description here
insert image description here
This graph is called an attention map and can be used to illustrate the attention weights of each word in relation to every other word. In this stylized example, you can see that the word "book" is strongly connected or concerned with the words "teacher" and "student".
insert image description here

This is called self-attention, and this ability to learn attention across the entire input significantly improves the model's ability to encode language.
insert image description here

Now that you've seen a key property of the Transformers architecture, self-attention, let's take a high-level look at how the model works. Here is a simplified diagram of Transformers architecture so that you can focus on where these processes happen at a high level. Transformers architecture is divided into two distinct parts, encoder and decoder.
insert image description here

These components work together, and they share many similarities. Also, note that the graph you see here is derived from the original "Attention is All You Need" paper. Note that the input to the model is at the bottom and the output is at the top, where possible we will try to keep this throughout the course.

Right now, machine learning models are just big statistical calculators that use numbers instead of words. Therefore, you must first tokenize the words before passing the text to the model for processing. Simply put, this converts words into numbers, with each number representing a position in the dictionary of all possible words that the model can use. You can choose from several tokenization methods.

For example, to match two full words of token IDs,
insert image description here

Or use token IDs to represent parts of words.
insert image description here

As you can see here. Importantly, once you have chosen a tokenizer to train the model, you must use the same tokenizer when generating text. Now that your input is represented as a number, you can pass it to the embedding layer. This layer is a trainable vector embedding space, a high-dimensional space in which each token is represented as a vector and occupies a unique position within the space.
insert image description here

Each token ID in the vocabulary is matched with a multidimensional vector, and the intuition is that these vectors learn to encode the meaning and context of individual tokens in the input sequence. Embedding vector spaces have been used in natural language processing for some time, and previous generation language algorithms like Word2vec used this concept. If you're not familiar with this, don't worry. You'll see examples of this throughout the course, and there are some links to other resources in this weekend's reading exercises.

Looking back at the sample sequence, you can see that in this simple case, each word is matched with a token ID, and each token is mapped to a vector. In the original Transformers paper, the size of the vector is actually 512, so much larger than we can fit on this image.
insert image description here

For simplicity, if you imagine a vector of size three, you can plot the words into a 3D space and see the relationship between those words. You can now see how words that are close to each other in the embedding space are associated,
insert image description here

and how to calculate the distance between words as an angle,
insert image description here

This gives the model the ability to understand language mathematically. When you add a token vector to the base of an encoder or decoder, you also add positional encoding.
insert image description here

The model processes each input token in parallel. So by adding positional encoding, you preserve the information about the order of the words and you don't lose the relevance of where the words are in the sentence. Once you add the input tokens and positional encodings, you pass the resulting vector to the self-attention layer.

insert image description here

Here, the model analyzes the relationship between tokens in the input sequence. As you saw earlier, this enables the model to focus on different parts of the input sequence to better capture contextual dependencies between words. The self-attention weights learned during training and stored in these layers reflect the importance of each word in the input sequence compared to all other words in the sequence.

insert image description here

But this doesn't just happen once, the Transformers architecture actually has multi-head self-attention. This means that multiple sets of self-attention weights or heads are learned independently in parallel. The number of attention heads included in an attention layer varies from model to model, but a range between 12-100 is common.
insert image description here

The intuition is that each self-attentional head will learn a different aspect of the language. For example, a head might see the relationship between the person entities in our sentences.
insert image description here

And another head might focus on the activity of the sentence.
insert image description here

While another head might focus on other attributes, such as whether a word rhymes or not.
insert image description here

It is important to note that you do not specify ahead of time which aspects of the language the attention head will learn. The weights of each head are initialized randomly, and given enough training data and time, each head learns a different aspect of the language. While some attention maps are easy to interpret, like the examples discussed here, others may not.

Now that all attention weights have been applied to your input data, the output is processed through a fully connected feed-forward network.
insert image description here

The output of this layer is a vector of logits proportional to the probability score of each token in the tokenizer dictionary.

You can then pass these logits to a final softmax layer where they are normalized to a probability score for each word. This output includes probabilities for each word in the vocabulary, so there could be thousands of scores here.
insert image description here

A single token will have a higher score than all other tokens. This is the most likely predictable token. However, as you'll see later in the lesson, you can use a variety of methods to choose your final choice from this vector of probabilities.

reference

https://www.coursera.org/learn/generative-ai-with-llms/lecture/3AqWI/transformers-architecture

Guess you like

Origin blog.csdn.net/zgpeace/article/details/132391611