[NLP] How the GPT model works

introduce

In 2021, I wrote my first few lines of code using the GPT model, and that’s when I realized that text generation had reached an inflection point. I asked GPT-3 to summarize a very long document and tried several prompts. I can see the results are much more advanced than previous models, which makes me excited about the technology and eager to see how it's implemented. Now, subsequent GPT-3.5, ChatGPT, and GPT-4 models are rapidly gaining widespread adoption, and more people in the field are curious about how they work. While the details of their inner workings are proprietary and complex, all GPT models share some basic ideas that are not difficult to understand.

How generative language models work

Let’s first explore how generative language models work. The basic idea is as follows: they take n tokens as input and produce one token as output.

This seems like a fairly simple concept, but in order to truly understand it, we need to know what a token is.

The token is a piece of text. In the context of the OpenAI GPT model, common words and short words often correspond to a single token, such as the word "We" in the image below. Long and uncommon words are often divided into several tokens. For example, the word "personification" in the image below is broken down into three tokens. An abbreviation like "ChatGPT" can be represented by a single token or divided into multiple tokens, depending on how common the letters appear together. You can go to OpenAI's Tokenizer page , enter text, and see how it splits into tokens. You can choose between "GPT-3" tokenization for text and "Codex" tokenization for code. We will leave the default "GPT-3" settings.

You can also use OpenAI’s open sourcetiktoken library for tokenization using Python code. OpenAI provides several different taggers, each of which behaves slightly differently. In the code below, we use the tokenizer from "davinci" (a GPT-3 model) to match the behavior you see with the UI.

import tiktoken

# Get the encoding for the davinci GPT3 model, which is the "r50k_base" encoding.
encoding = tiktoken.encoding_for_model("davinci")

text = "We need to stop anthropomorphizing ChatGPT."
print(f"text: {text}")

token_integers = encoding.encode(text)
print(f"total number of tokens: {encoding.n_vocab}")

print(f"token integers: {token_integers}")
token_strings = [encoding.decode_single_token_bytes(token) for token in token_integers]
print(f"token strings: {token_strings}")
print(f"number of tokens in text: {len(token_integers)}")

encoded_decoded_text = encoding.decode(token_integers)
print(f"encoded-decoded text: {encoded_decoded_text}")
text: We need to stop anthropomorphizing ChatGPT.
total number of tokens: 50257
token integers: [1135, 761, 284, 2245, 17911, 25831, 2890, 24101, 38, 11571, 13]
token strings: [b'We', b' need', b' to', b' stop', b' anthrop', b'omorph', b'izing', b' Chat', b'G', b'PT', b'.']
number of tokens in text: 11
encoded-decoded text: We need to stop anthropomorphizing ChatGPT.

You can see in the output of the code that this token generator contains 50,257 different tokens, and each token is internally mapped to an integer index. Given a string, we can split it into integer tokens and then convert these integers into their corresponding character sequences. Encoding and decoding strings should always return the original string.

This gives you a good intuition about how the OpenAI tokenizer works, but you might be wondering why they chose these token lengths. Let's consider some other tokenization options. Suppose we try the simplest implementation, where each letter is a token. This makes it easy to break the text into tags and keeps the total number of different tags small. However, we were not able to encode as much information as in the OpenAI approach. If we use letter-based tokens in the example above, 11 tokens can only encode "We need to", whereas OpenAI's 11 tokens can encode the entire sentence. It turns out that current language models have a limit on the maximum number of tokens they can receive. Therefore, we want to include as much information as possible in each token.

Now let us consider the scenario where each word is a token. Compared with OpenAI's method, we only need 7 tokens to represent the same sentence, which seems to be more efficient. Splitting by word is also easy to implement. However, language models need to have a complete list of tokens that may be encountered, and this is not feasible for entire words - not only because there are too many words in the dictionary, but also because it is difficult to keep pace with the domain-specific Terminology and any new words invented.

It is therefore not surprising that OpenAI chose a solution somewhere in between these two extremes. Other companies have released tokenizers that follow a similar approach, such as Google's Sentence Piece .

Now that we have a better understanding of tokens, let's go back to the original diagram and see if we can understand it better. The generative model takesn tags, which can be several words, several paragraphs, or several pages. They produce a single token, which can be a short word or a part of a word.

This makes more sense now.

But if you have usedOpenAI’s ChatGPT you know that it generates many tokens, not just a single token . This is because this basic idea applies to extended window mode. You give itn tokens, it produces a token output, and then it merges that output token as part of the input for the next iteration , produces a new token output, and so on. This pattern repeats until a stopping condition is reached, indicating that it has finished generating all the text you need.

For example, if I input "We need to" as input to the model, the algorithm might produce results like this:

When using ChatGPT, you may also notice that the model is not deterministic: if you ask the exact same question twice, you may get two different answers. This is because the model doesn't actually generate a single predicted token; Instead, it returns a probability distribution over all possible tags. In other words, it returns a vector where each entry represents the probability of selecting a particular token. The model then samples from this distribution to generate output tokens.

How does the model arrive at this probability distribution? This is the purpose of the training phase. During training, the model is exposed to a large amount of text, and its weights are adjusted to predict a good probability distribution given a sequence of input tokens. GPT models are trained across large portions of the Internet, so their predictions reflect a mix of the information they see.

You now have a good understanding of the idea behind generative models. Please note that I just explained the idea but haven't given you an algorithm yet. It turns out that this idea has been around for decades, and has been implemented using several different algorithms over the years. Next we'll look at some of these algorithms.

A brief history of generative language models

Hidden Markov models (HMM) became popular in the 1970s. Their internal representations encode the grammatical structure of sentences (nouns, verbs, etc.) and use this knowledge when predicting new words. However, since they are Markov processes, only the latest tokens are considered when generating new tokens. Therefore, they implemented a very simple version of the " n token input, one token output" idea, where n = 1. Therefore, they do not generate very complex output. Let's consider the following example:

If we input "The Quick Brown Fox Jumps Over the" into a language model, we would expect it to return "Lazy". However, the Hidden Markov Model will only see the last token "the" and with so little information it is unlikely to give the predictions we expect. When one tries HMMs, it becomes obvious that the language model needs to support multiple input tokens in order to produce good output. When one tries HMMs, it becomes obvious that the language model needs to support multiple input tokens in order to produce good output.

N-grams became popular in the 1990s because they addressed a major limitation of HMMs by taking multiple tokens as input. For the previous example, the n-gram model might do a good job predicting the word "lazy."

The simplest implementation of the n-gram is a bigram with character-based tokenization, which, given a single character, is able to predict the next character in the sequence. You can create one of these with just a few lines of code, and I encourage you to give it a try. First, count the number of distinct characters in the training text (let's call it n) and create an two-dimensional matrix. Each pair of input characters can be used to locate a specific entry in this matrix by selecting the row corresponding to the first character and the column corresponding to the second character. When you parse the training data, for each pair of characters you simply add one to the corresponding matrix cell. For example, if your training data contains the word "car", you could add 1 to the cells in row "c" and column "a", and then add 1 to the cells in row "a" and column "r". After accumulating counts across all training data, convert each row into a probability distribution by dividing each cell by the total number in that row.

Then, in order to predict, you need to give it a single character to start with, such as "c". You find the probability distribution corresponding to row "c" and sample that distribution to generate the next character. You then repeat the process with the resulting character until the stopping condition is reached. Higher-order n-grams follow the same basic idea, but they are able to look at longer sequences of input tokens by using n-dimensional tensors.

N-grams are easy to implement. However, since the size of the matrix grows exponentially with the number of input tokens, they do not scale well to larger numbers of tokens. And using only a few input tags they fail to produce good results. A new technology is needed to continue progress in this area.

In the 2000s, Recurrent Neural Networks (RNN) became extremely popular due to their ability to accept a greater number of input tokens than previous techniques. In particular, LSTMs and GRUs (types of RNNs) are widely used and have been shown to produce reasonably good results.

RNNs are a type of neural network, but unlike traditional feedforward neural networks, their architecture can be adapted to accept any number of inputs and produce any number of outputs. For example, if we feed an RNN the input tokens "We", "need", and "to" and want it to generate more tokens until a completion point is reached, the RNN might have the following structure:

Every node in the above structure has the same weight. You can think of it as a single node that connects to itself and executes repeatedly (hence the name "loop"), or you can think of it as an expanded form as shown in the image above. A key feature added by LSTM and GRU compared to basic RNN is the presence of internal memory cells that are passed from one node to the next. This enables later nodes to remember certain aspects of previous nodes, which is crucial for making good text predictions.

However, RNN suffers from instability issues when processing very long text sequences. Gradients in a model tend to grow exponentially (called "exploding gradients") or decrease to zero (called "vanishing gradients"), preventing the model from continuing to learn from the training data. LSTM and GRU can alleviate the vanishing gradient problem but cannot completely prevent it. So while their architecture theoretically allows inputs of arbitrary length, in practice there is a limit to that length. The quality of text generation is again limited by the number of input tokens supported by the algorithm, requiring new breakthroughs.

In 2017, Google released a paper introducing Transformers, and we entered a new era of text generation. The architecture used in Transformers allows for a large increase in the number of input tokens, eliminates the gradient instability problem seen in RNNs, and is highly parallelized, meaning it is able to leverage the power of GPUs. Transformer is widely used today, and OpenAI chose it for its latest GPT text generation model.

Transformer is based on an "attention mechanism" that allows the model to pay more attention to certain inputs than others, regardless of where they appear in the input sequence. For example, let us consider the following sentences:

In this case, when the model predicts the verb "buy", it needs to match the past tense of the verb "go". In order to do this, it has to be very focused on "going" to this token. In fact, it may pay more attention to the token "went" than to the token "and", even though "went" appears earlier in the input sequence.

This selective attention behavior in the GPT model is enabled by a novel idea from the 2017 paper: the use of a “masked multi-head attention” layer. Let’s break this term down and delve into each of its sub-terms:

Attention: The "attention" layer contains a weight matrix that represents the strength of the relationship between all pairs of token positions in the input sentence. These weights are learned during training. If the weight corresponding to a pair of positions is large, then the two tokens in these positions have a strong influence on each other. This mechanism allows Transfomer to focus more on certain tokens than others, regardless of where in the sentence they appear.

Masked: If the matrix is ​​limited to the relationship between each marker position and an earlier position in the input, the attention layer will be "masked". This is how the GPT model is used for text generation, as the output tag can only depend on the tags that preceded it.

Multi-head: Transformer uses a masked "multi-head" attention layer because it contains multiple masked attention layers that operate in parallel.

The memory units of LSTM and GRU also enable later tokens to remember certain aspects of earlier tokens. However, gradient issues can get in the way if two related tokens are far apart. Transformer doesn't have this problem because each token has a direct connection to every other token before it.

Now that you understand the main idea of ​​the Transformer architecture used in GPT models, let's look at the differences between the various GPT models currently available.

How different GPT models are implemented

As of this writing, the three latest text generation models released by OpenAI are GPT-3.5, ChatGPT, and GPT-4, all based on the Transformer architecture. In fact, "GPT" stands for "Generative Pre-Training Transformer".

GPT-3.5 is a transformer trained as a completion-style model, which means that if we give it some words as input, it is able to generate more words that are likely to follow them in the training data.

ChatGPT, on the other hand, was trained as a conversational model, meaning it performs best when we communicate with it like we are having a conversation. It is based on the same basic transformer model as GPT-3.5, but it is fine-tuned based on conversational data. It is then further fine-tuned using reinforcement learning with human feedback (RLHF), a technique introduced by OpenAI in its2022 InstructGPT paper. In this technique, we give the model the same input twice, get two different outputs, and then ask the human ranker which output it prefers. This selection is then used to improve the model through fine-tuning. This technology aligns the model’s output with human expectations, which is critical to the success of OpenAI’s latest models.

GPT-4, on the other hand, can be used for both completion and dialogue, and has its own new base model. The base model was also fine-tuned using RLHF to better match human expectations.

Write code that uses the GPT model

The main difference between the two is that Azure provides the following additional features:

  • Automated, responsible AI filters reduce unethical use of APIs
  • Azure security features such as private networking
  • Regional availability for optimal performance when interacting with the API

If you are writing code that uses these models, you need to choose a specific version to use. Here's a quick cheat sheet for the versions currently available in the Azure OpenAI service:

  • GPT-3.5: text-davinci-002, text-davinci-003
  • ChatGPT:gpt-35-turbo
  • GPT-4:gpt-4、gpt-4–32k

The main difference between the two GPT-4 versions is the number of tokens they support: gpt-4 supports 8,000 tokens and gpt-4–32k supports 32,000 tokens. In comparison, the GPT-3.5 model only supports 4,000 tokens.

Since GPT-4 is currently the most expensive option, it is best to start with one of the other models and upgrade only when needed. For more details on these models, check out thedocumentation.

in conclusion

In this article, we introduce the basic principles common to all generative language models and, in particular, what makes OpenAI’s latest GPT model unique.

Along the way, we emphasized the core idea of ​​the language model: " n token input, one token output". We explore how tokens break down and why they break down the way they do. We trace the evolution of language models over several decades, from early hidden Markov models to more recent Transformer-based models. Finally, we describe OpenAI’s three latest Transformer-based GPT models, how each model is implemented, and how to write code that uses them.

By now, you should be well prepared to have informed conversations about GPT models and start using them in your own coding projects. I plan to write more explanations of language models, so follow me and let me know what topics you'd like to see covered! Thank you for reading!

Guess you like

Origin blog.csdn.net/sikh_0529/article/details/134557875