LLM pre-training large language models Pre-training large language models

In the previous video, you were introduced to the lifecycle of a generative AI project.
insert image description here

As you can see, there are several steps to complete before you can start the fun parts of your generative AI application. Once you have scoped your use case and determined how you need the LLM to work in your application, your next step is to choose a model to use.
insert image description here

Your first choice will be to use an existing model or train your own from scratch. There are certain situations in which it may be advantageous to train your own model from scratch, which you will learn about later in this course.
insert image description here

Typically, however, you will use an existing base model to start developing your application. Many open source models are available for AI community members like you to use in your applications. The developers of some major frameworks, such as Hugging Face and PyTorch for building generative AI applications, have curated hubs where you can browse these models.
insert image description here

A very useful feature of these hubs is the inclusion of model cards describing important details of each model's best use case, how to train it, and known limitations. You'll find some links to these model hubs in the readings at the end of the week.
insert image description here

The exact model you choose will depend on the specifics of the tasks you need to perform. Variants of Transformers model architectures are suitable for different language tasks, mainly because of differences in the way the models are trained. To help you better understand these differences, and develop intuition about which model to use for a particular task, let's take a closer look at how large language models are trained. Armed with this knowledge, it will be easier for you to browse Model Central and find the best model for your use case.

First, let's look at the initial training process of LLMs at a high level. This stage is often called pre-training.
insert image description here

As you saw in Lesson 1, LLMs encode a deep statistical representation of language. This understanding is developed during the pre-training phase of the model, when the model learns from large amounts of unstructured text data. This can be gigabytes, terabytes, or even petabytes of unstructured text. This data comes from a number of sources, including data scraped from the internet and text corpora specially assembled for training language models.
insert image description here

During this self-supervised learning step, the model internalizes the patterns and structures present in the language. These modes then enable the model to accomplish its training goals, depending on the model's architecture, as you will see shortly. During pre-training, the model weights are updated to minimize the loss of the training objective. The encoder generates an embedding or vector representation for each token. Pre-training is also computationally intensive and uses GPUs.
insert image description here

Note that when you scrape training data from public sites such as the internet, you often need to process the data to improve quality, address bias, and remove other harmful content. Because of this data quality curation, typically only 1-3% of tokens are used for pre-training. When estimating how much data you need to collect, you should take this into account if you decide to pretrain your own model.
insert image description here

Earlier this week, you saw that Transformers models come in three variants; encoder-only, encoder-decoder models, and decoder-only.
insert image description here

Each is trained on a different goal and thus learns to perform different tasks.

Encoder-only models are also known as autoencoder models, which are pretrained using masked language modeling.
insert image description here

Here, the tokens in the input sequence are randomly masked, and the training goal is to predict the masked tokens to reconstruct the original sentence.
insert image description here

This is also known as the denoising target.
insert image description here

The autoencoder model produced a bidirectional representation of the input sequence, meaning the model had knowledge of the entire context of the token, not just the preceding words. Encoder-only models are well suited for tasks that benefit from this bidirectional context.
insert image description here

You can use them to perform sentence classification tasks such as sentiment analysis or token-level tasks such as named entity recognition or word classification. Some well-known examples of autoencoding models are BERT and RoBERTa.
insert image description here

Now, let's look at decoder-only or autoregressive models, which are pretrained using causal language modeling. Here, the training goal is to predict the next token based on the previous sequence of tokens.
Predicting the next token is sometimes referred to by researchers as full language modeling. Decoder-based autoregressive model that masks the input sequence and only sees input tokens up to the question token.
insert image description here

The model does not know the end of the sentence. Then, the model iterates over the input sequence one by one to predict the next token.
insert image description here

Contrary to the encoder architecture, this means that the context is unidirectional.
insert image description here

By learning to predict the next token from a large number of examples, the model builds a statistical representation of the language. This type of model uses the decoder component of the original architecture without the encoder.
insert image description here

Decoder-only models are often used for text generation, although larger decoder-only models show strong zero-shot inference capabilities and generally perform well for a range of tasks. GPT and BLOOM are some well-known examples of decoder-based autoregressive models.
insert image description here

The final variant of the Transformers model is a sequence-to-sequence model that uses the encoder and decoder parts of the original Transformers architecture. The exact details of the pre-training target vary from model to model. A popular sequence-to-sequence model, T5, uses the Span corruption pretrained encoder, which masks random input token sequences. Those mask sequences are then replaced with a unique sentinel token, here shown as an x. Sentinel tokens are special tokens that are added to the vocabulary but do not correspond to any actual words of the input text.
insert image description here

The decoder is then assigned to autoregressively reconstruct the sequence of masked tokens. The output is the prediction token following the sentinel token.
insert image description here

You can use sequence-to-sequence models for translation, summarization, and question answering. They are often useful when you have a body of text as input and output. Besides T5, which you will use in the labs of this course, another well known encoder-decoder model is BART, not Bird.
insert image description here

In summary, this is a goal for quickly comparing different model architectures and pre-training targets. Autoencoding models are pretrained using masked language modeling. They correspond to the encoder part of the original Transformers architecture and are often used with sentence classification or token classification.
insert image description here

Autoregressive models are pretrained using causal language modeling. This type of model uses the decoder component of the original Transformers architecture and is often used for text generation.
insert image description here

Sequence-to-sequence models use the encoder and decoder parts of the original Transformers architecture. The exact details of the pre-training target vary from model to model. The T5 model is pretrained using span corruption. Sequence-to-sequence models are commonly used for translation, summarization, and question answering.
insert image description here

Now that you have seen how these different model architectures are trained, and the specific tasks they are suitable for, you can choose the model type that best suits your use case. One more thing to keep in mind is that larger models of any architecture are generally more capable of performing their tasks well. The researchers found that the larger the model, the more likely it will work the way you want without additional contextual learning or further training. This observed trend of model power increasing with size has driven the development of larger models in recent years.
insert image description here

This growth is driven by inflection points in research, such as the introduction of highly scalable Transformers architectures, access to large amounts of data for training, and the development of more powerful computing resources.
insert image description here

This steady increase in model size has actually led some researchers to speculate that there is a new Moore's Law for LLMs. Like them, you might ask, can we just keep adding parameters to increase performance and make the model smarter? What might this model growth lead to?
insert image description here

While this sounds great, it turns out that training these huge models is difficult and so expensive that continuously training bigger and bigger models may not be feasible. Let's take a closer look at some of the challenges associated with training large models in the next video.

reference

https://www.coursera.org/learn/generative-ai-with-llms/lecture/2T3Au/pre-training-large-language-models

Guess you like

Origin blog.csdn.net/zgpeace/article/details/132419769