In the previous video, you were introduced to the lifecycle of a generative AI project.
As you can see, there are several steps to complete before you can start the fun parts of your generative AI application. Once you have scoped your use case and determined how you need the LLM to work in your application, your next step is to choose a model to use.
Your first choice will be to use an existing model or train your own from scratch. There are certain situations in which it may be advantageous to train your own model from scratch, which you will learn about later in this course.
Typically, however, you will use an existing base model to start developing your application. Many open source models are available for AI community members like you to use in your applications. The developers of some major frameworks, such as Hugging Face and PyTorch for building generative AI applications, have curated hubs where you can browse these models.
A very useful feature of these hubs is the inclusion of model cards describing important details of each model's best use case, how to train it, and known limitations. You'll find some links to these model hubs in the readings at the end of the week.
The exact model you choose will depend on the specifics of the tasks you need to perform. Variants of Transformers model architectures are suitable for different language tasks, mainly because of differences in the way the models are trained. To help you better understand these differences, and develop intuition about which model to use for a particular task, let's take a closer look at how large language models are trained. Armed with this knowledge, it will be easier for you to browse Model Central and find the best model for your use case.
First, let's look at the initial training process of LLMs at a high level. This stage is often called pre-training.
As you saw in Lesson 1, LLMs encode a deep statistical representation of language. This understanding is developed during the pre-training phase of the model, when the model learns from large amounts of unstructured text data. This can be gigabytes, terabytes, or even petabytes of unstructured text. This data comes from a number of sources, including data scraped from the internet and text corpora specially assembled for training language models.
During this self-supervised learning step, the model internalizes the patterns and structures present in the language. These modes then enable the model to accomplish its training goals, depending on the model's architecture, as you will see shortly. During pre-training, the model weights are updated to minimize the loss of the training objective. The encoder generates an embedding or vector representation for each token. Pre-training is also computationally intensive and uses GPUs.
Note that when you scrape training data from public sites such as the internet, you often need to process the data to improve quality, address bias, and remove other harmful content. Because of this data quality curation, typically only 1-3% of tokens are used for pre-training. When estimating how much data you need to collect, you should take this into account if you decide to pretrain your own model.
Earlier this week, you saw that Transformers models come in three variants; encoder-only, encoder-decoder models, and decoder-only.
Each is trained on a different goal and thus learns to perform different tasks.
Encoder-only models are also known as autoencoder models, which are pretrained using masked language modeling.
Here, the tokens in the input sequence are randomly masked, and the training goal is to predict the masked tokens to reconstruct the original sentence.
This is also known as the denoising target.
The autoencoder model produced a bidirectional representation of the input sequence, meaning the model had knowledge of the entire context of the token, not just the preceding words. Encoder-only models are well suited for tasks that benefit from this bidirectional context.
You can use them to perform sentence classification tasks such as sentiment analysis or token-level tasks such as named entity recognition or word classification. Some well-known examples of autoencoding models are BERT and RoBERTa.
Now, let's look at decoder-only or autoregressive models, which are pretrained using causal language modeling. Here, the training goal is to predict the next token based on the previous sequence of tokens.
Predicting the next token is sometimes referred to by researchers as full language modeling. Decoder-based autoregressive model that masks the input sequence and only sees input tokens up to the question token.
The model does not know the end of the sentence. Then, the model iterates over the input sequence one by one to predict the next token.
Contrary to the encoder architecture, this means that the context is unidirectional.
By learning to predict the next token from a large number of examples, the model builds a statistical representation of the language. This type of model uses the decoder component of the original architecture without the encoder.
Decoder-only models are often used for text generation, although larger decoder-only models show strong zero-shot inference capabilities and generally perform well for a range of tasks. GPT and BLOOM are some well-known examples of decoder-based autoregressive models.
The final variant of the Transformers model is a sequence-to-sequence model that uses the encoder and decoder parts of the original Transformers architecture. The exact details of the pre-training target vary from model to model. A popular sequence-to-sequence model, T5, uses the Span corruption pretrained encoder, which masks random input token sequences. Those mask sequences are then replaced with a unique sentinel token, here shown as an x. Sentinel tokens are special tokens that are added to the vocabulary but do not correspond to any actual words of the input text.
The decoder is then assigned to autoregressively reconstruct the sequence of masked tokens. The output is the prediction token following the sentinel token.
You can use sequence-to-sequence models for translation, summarization, and question answering. They are often useful when you have a body of text as input and output. Besides T5, which you will use in the labs of this course, another well known encoder-decoder model is BART, not Bird.
In summary, this is a goal for quickly comparing different model architectures and pre-training targets. Autoencoding models are pretrained using masked language modeling. They correspond to the encoder part of the original Transformers architecture and are often used with sentence classification or token classification.
Autoregressive models are pretrained using causal language modeling. This type of model uses the decoder component of the original Transformers architecture and is often used for text generation.
Sequence-to-sequence models use the encoder and decoder parts of the original Transformers architecture. The exact details of the pre-training target vary from model to model. The T5 model is pretrained using span corruption. Sequence-to-sequence models are commonly used for translation, summarization, and question answering.
Now that you have seen how these different model architectures are trained, and the specific tasks they are suitable for, you can choose the model type that best suits your use case. One more thing to keep in mind is that larger models of any architecture are generally more capable of performing their tasks well. The researchers found that the larger the model, the more likely it will work the way you want without additional contextual learning or further training. This observed trend of model power increasing with size has driven the development of larger models in recent years.
This growth is driven by inflection points in research, such as the introduction of highly scalable Transformers architectures, access to large amounts of data for training, and the development of more powerful computing resources.
This steady increase in model size has actually led some researchers to speculate that there is a new Moore's Law for LLMs. Like them, you might ask, can we just keep adding parameters to increase performance and make the model smarter? What might this model growth lead to?
While this sounds great, it turns out that training these huge models is difficult and so expensive that continuously training bigger and bigger models may not be feasible. Let's take a closer look at some of the challenges associated with training large models in the next video.
reference
https://www.coursera.org/learn/generative-ai-with-llms/lecture/2T3Au/pre-training-large-language-models