【Deep Learning】ChatGPT

        This article is based on a speech given by Andrej Karpathy (co-founder of OpenAI, former Tesla's artificial intelligence and autonomous driving vision director) at Microsoft Build 2023 (the complete video is at the end of the article, directly dragged to the bottom of the article), the main points for 2 large parts:

        1. How to train GPT (can be understood as AI Assistant)

        2. How to use GPT

        The training process of GPT assistant can be divided into four stages: "pre-training, supervised fine-tuning, reward modeling and reinforcement learning" . In the "pre-training" stage, it takes months to train the transformer-based neural network, which is often referred to as the basic model, using large-scale Internet datasets and tens of thousands of GPUs.

        Then use high-quality question-answer data to further "supervise and fine-tune" the model. These data are usually manually labeled question-answer data, and the number can reach tens of thousands to hundreds of thousands. Then proceed to "reward modeling" to let the model learn to evaluate the quality of different responses. It leverages human feedback by comparing different responses to build a reward model that can score the quality of newly generated text. Finally, the reward model is used for reinforcement learning to obtain the final assistant model. In the reinforcement learning stage, the text quality score generated by the reward model is used to further improve the quality of the reply text through reinforcement learning to make it more in line with human requirements.

        What is the difference between a base model, a supervised fine-tuned model, and a reinforcement learning fine-tuned model? The basic model is more creative and generates more diverse results, which is suitable for generative creation; the model after supervised fine-tuning is better at solving specific problems, but less creative; the model after reinforcement learning fine-tuning has the highest generation quality, but it is difficult to train .

        How to effectively use the GPT assistant model through hint engineering? The input prompts should provide sufficient context and exact intent descriptions, clearly point out the expected performance results, and can also supplement the display examples to guide the model to think step by step. In addition, taking advantage of the consistency of the model and requiring the model to reflect, tools can also be used to compensate for the limitations of the model. What Hint Engineering is doing is to bridge the cognitive gap between the model and the human brain.

        The current best practice is to try various means of hinting engineering first, and if it is not enough, then consider fine-tuning, including SFT and RLHF. Fine-tuning models has a higher barrier to entry and requires more expertise.

        The current model also has limitations, and it is currently suitable for applying it to low-risk scenarios to assist humans in completing their work. LLM represented by GPT-4 is a great work, and the amount of knowledge it contains is amazing, but it also needs to recognize its limitations and make good use of it in a scenario that conforms to the core socialist values.

1. How to train GPT

        The training of the GPT model is divided into four stages: "pre-training, supervised fine-tuning, reward modeling and reinforcement learning" , as shown in the following figure:

         The above four stages correspond to three types:  "Base model (Base model), supervised fine-tuning model (SFT) and reinforcement learning human feedback model (RLHF)"  . These four stages are described in detail below.

1.1 Pre-training

        The basic model has only been pre-trained without subsequent tuning. It has general language modeling capabilities and can generate coherent, grammatically correct, and diverse texts, but has weak processing capabilities for specific tasks.

1.1.1 Data

        The pre-training phase requires the use of a large amount of Internet corpus, such as webpage data such as websites, books, and papers. The magnitude of these data is very large, and can reach the order of billions or even tens of billions. On such a large data set, unsupervised language model pre-training is carried out through network structures such as transformers to learn the statistical laws of language. This stage requires the use of a large number of computing resources, such as thousands of GPUs for parallel computing, which can last for several weeks to several months. The purpose of pre-training is to learn general-purpose language representation capabilities that can be applied to downstream tasks.

        First, a large amount of data needs to be collected. The following is the mixed data used by the LLaMA pre-training basic model released by Meta:

         You can roughly see the types of data sets involved in these data sets, including Common Crawl (web crawler data), C4 (also web crawler data), and then there are some high-quality data sets, such as: GitHub, Wikipedia, books, ArXiv, StackExchange, etc. These are mixed together and then sampled according to a certain ratio, which forms the training set of the pre-trained GPT model.

1.1.2 Tokenization

        Before using these data for pre-training, a pre-processing step is required: Tokenization (word segmentation/tokenization). The essence is to convert text into a sequence of integers that can be processed by the computer.

1.2.3 Parameters and hyperparameters

         The above is roughly the order of magnitude of the relevant parameters or hyperparameters in the pre-training phase. The size of the vocabulary is usually 10,000 tokens. The context length is usually 2000, 4000 and now even 100k. This determines the maximum number of integers GPT needs to look up when trying to predict the next integer in the sequence. In addition, you can see the approximate number of parameters, for example, LLaMA has 65 billion parameters. Although LLaMA has only 65 billion parameters compared to GPT3's 175 billion parameters, LLaMA is the more powerful model, intuitively, because "the model takes significantly longer to train", in this case , the 1.4 trillion tokens, not just 300 billion tokens. Therefore, the strength of a model cannot be judged solely by the number of parameters it contains.

        In addition, the above table also shows some hyperparameters of some Transformer-based models, including the number of headers, dimension size, number of layers, and so on. For example, to train the llama 65B model, Meta used 2000 GPUs, took about 21 days to train, and cost about a few million dollars. This is roughly an order of magnitude in the pre-training phase.

1.1.4 How to pre-train

        During pre-training, the tokens that have been integerized are packed into batches and sent to the transformer. A batch of these packed into an array is a batch size, which is recorded as B (batch size). In the example below, these training samples are stacked in rows, and B is multiplied by T, where T is the maximum context length.

        The above context length of 10 is just an example, and this number may be 2000, 4000, etc. in real scenarios. When packaging documents into a batch size, some special tokens are needed to separate them. These special tokens are used as text terminators to tell the transformer model where new documents start. In the example below:

         A green cell can see all the tokens in front of it, here it refers to all the yellow tokens. Feed the entire context into the transformer network. The transformer will try to predict the next token of the sequence, in this case the token at the red cell position. When tuning the model parameters, the predicted distribution for each cell will be slightly different. For example, if the vocabulary has 50257 tokens, then there will be probabilities corresponding to this many numbers, i.e. each token will have a probability value. This is a probability distribution, which represents the probability of predicting any token that is possible next. In the above example, the next predicted token integer value is 513 (that is, the red cell in the above figure). It should be noted that based on the existing corpus, we know which token value should appear on the red cell. Therefore, we can use this as a supervision source (or standard result) to update the parameters of the transformer model. Within a batch size, each row is a sample data, and each sample data executes the above logic of predicting the next token in parallel. The data in other batch sizes is processed in the same way, trying to make the transformer correctly predict what the next token in the sequence is.

1.1.5 How to use the pre-trained model

        Pretrained models learn very powerful general representations in the process of language modeling, and are very effective when fine-tuning on the downstream tasks you are interested in. For example, if you are interested in sentiment classification, the ancient classic method is to collect a bunch of positive and negative sentiment data, and then train an NLP model. However, the method in the BERT era is to ignore emotion classification first, and first do pre-training of large-scale language models. After pre-training a large Transformer model, only a small number of emotion classification examples can be used to fine-tune the model very effectively, so that perform sentiment classification tasks. The reason is that the Transformer-based model must understand a lot about the text structure and all the different concepts in it in order to predict the next token during the pre-training phase. This is GPT-1.

        In the GPT-2 era, it was noticed that, better than fine-tuning, we could bootstrap these models very efficiently. Because these are language models, they have only one goal: to complete the continuation or completion of the document, so they can actually be tricked into performing the task by arranging these fake documents.

        For example, there are some paragraphs, and then operations like "question and answer, question and answer, question and answer" are called few-shot prompts. For example, "Q: How old is Catherine? A: 54" in the picture above), and then we ask questions, such as "where does she live?" in the picture above. When Transformer tries to complete the document in terms of language model properties, it is actually answering our questions. This is an example of hint engineering on the base model, making it think it is mimicking the document, tricking it into performing a task in this way. In fact, it imitates the task as a text completion or continuation.

        This ushers in a new era, the era of "hints over fine-tuning" . In practice, it is found that this method is very effective on many problems, even without training any neural network, without fine-tuning.

        It should be pointed out that the basic model obtained at this time is not an assistant model (see the figure below). They don't really want to answer your question, they just want to complete the continuation of the documentation. So, if you ask the model, "write a poem about bread and cheese", it will just answer your question with more questions. It's just doing what it thinks is the back of the document. However, the base model can be prompted in a specific way to accomplish the above tasks, as shown in the right subfigure below: we can even trick the base model into a helper model.

        For example, some specific few-shot prompts can be created to make the document look like a human and an assistant are exchanging information: 

         Then put your real intent query at the end and the underlying model will condition itself like a helpful assistant and answer your query. While it is possible to do this, it is not very reliable and not particularly effective in practice.

        Therefore, there is a different path to making a true GPT assistant than just treating the base model as a continuation of the document. This leads to a supervised fine-tuning phase!

1.2 Supervised fine-tuning

        The supervised fine-tuning model is based on the basic pre-training model, which is further fine-tuned using additional labeled data to adapt it to specific downstream tasks, such as question answering. It can generate answers to specific questions.

        In the supervised fine-tuning stage, a small (tens of thousands) but high-quality dataset needs to be collected, which consists of prompts and high-quality responses. Language modeling is also performed on these data, and there is no change in the algorithm at this time. After training, you get a SFT model (supervised fine-tuning model), at which point you can actually deploy these models, which are real helpers and useful in a way. Here is an example of SFT training data:

         Some random prompts that humans come up with, such as: "Can you write a short introduction to "monopoly" in economics? ", and then follow the Labeling instruction (labeling document) to manually write a high-quality response that meets the requirements. When writing these responses, writers need to follow the annotation documentation, such as asking to write helpful, truthful, and harmless responses.

1.3 Reward Modeling

        After doing SFT, go to "RLHF" , which is "Reinforcement Learning from Human Feedback", which includes "Reward Modeling and Reinforcement Learning" . Reward modeling builds a model that automatically evaluates the quality of responses by learning human judgments about the quality of different responses. Reinforcement learning uses this reward model to improve the overall quality of responses by enhancing the probability of generating high reward text. Therefore, reinforcement learning human feedback models can generate higher quality and more human-like responses.

        In the "reward modeling" step, what is done is "turning the data into a comparative form" . Here is an example of data:

         For example, with the same prompt, let the model write a program or function. The above example is asking to check whether a given string is a palindrome (sequence of words, sentences, or numbers that read the same forward and backward). Use the trained SFT model to create multiple completions (completions, also translated as completions), that is, multiple replies. In this example, the SFT model creates 3 completions, which are then manually sorted:

         In fact, this kind of comparison operation is very difficult, and it may take several hours of manual labor to complete the comparison between pairs under a single prompt. After sorting all possible pairs between these completions, we end up with a sorting of all completions. 

         The three lines of prompts in the above schematic diagram are all the same, but the completion item is different. The yellow token is generated by the SFT model, and a special token (green) is added at the end to output to represent the reward. Basically only supervise the output of the green token in the transformer, and the transformer will predict the score of each completion item under the same prompt (prompt). Therefore, the quality of each completion item is basically estimated, and there is also the basic fact that the model ranks each completion item completion. Misrankings can be corrected by designing a loss function to train the model to make reward predictions that are consistent with comparative factual data from humans. The reward model is what allows us to evaluate how well a prompt is completed.

1.4 Reinforcement Learning

        The trained "reward model" cannot be deployed directly, because it cannot be used as an assistant well by itself, so reinforcement learning needs to be introduced. The reward model is useful for the subsequent reinforcement learning phase, as it can score arbitrary completions given a prompt. What to do in the process of reinforcement learning is to obtain a large number of hints, and then perform reinforcement learning according to the reward model.

        How did you do it? Take the following figure as an example.

        Arrange the same prompts in a row, use the SFT model to generate some completion items (yellow tokens), and then add a special token representing the reward, and obtain the reward for the corresponding completion item according to the reward model. It should be noted that the parameters of the reward model used now are fixed and will not change. The reward model gives the score of each completed item under each prompt, and then uses the same language modeling to design a loss function to train the yellow token, so as to adjust the generated yellow token so that it can be scored high by the reward model. The core is to measure the goal of language modeling in terms of the reward model's reward for instructions.

        For example, in the first row above, the reward model considers this a fairly high-scoring completion item. So all tokens sampled in the first row will be reinforced and they will get higher probability in the future. In contrast, in the second row, the reward model dislikes the completion very much, giving it a score of -1.2. Therefore, every token sampled in the second row will get a slightly lower probability later on. Repeat this operation in many prompts, many batches, and finally get a policy. This strategy creates yellow tokens here, which are basically all completions, all of which will get high scores based on the reward model trained in the previous stage. The above is the RLHF process.

        After experiencing RLHF, you will get a model that can be deployed. For example, ChatGPT is an RLHF model, Vicuna-13B (commonly known as the vicuna model), etc. These are all SFT models. In summary, the whole process will go through three model processes: basic model, SFT model and RLHF model.

        The reason for doing RLHF is because it can further enhance the effect. Studies have shown that humans basically prefer the generated results from the RLHF model compared to the base model, the SFT model.

        So why does RLHF work better? The jury is still out on a possible reason why comparison is easier than generation. Whether to compare or generate, the degree of difficulty is different. Taking the generation of an ancient poem as an example, let a model write a seven-character quatrain about spring. If you are a contractor trying to provide training data, imagine being a contractor collecting SFT data. How should you create a seven-character quatrain about spring? You may just not be good at it, but if you are given a few off-the-shelf quatrains, you might be able to tell which one you like better. Therefore, judging which one is better is a simpler task. This asymmetry makes "comparison" perhaps a better way to make better use of your human judgment to create a slightly better model.

        Of course, the RLHF model is not better than the base model in all cases. RLHF loses some entropy, which means it outputs samples that vary less than the base model. The basic model has high entropy, which will give a variety of outputs and be more creative. The basic model is like a person who has read everything in the world, but his mind is like a child, and he is easy to be whimsical and unrestrained; the SFT and RLHF models have received various exam-oriented education, and his thinking is easily restricted by various rules.        

 2. How to use GPT

        For large language models such as the GPT model, carefully designing input hints to generate high-quality output is a current research hotspot. This requires considering the cognitive characteristics of the model itself, adopting techniques such as step-by-step reasoning, providing clear context and examples, and guiding the use of tools. Unlike human thinking, GPT is more like sequence generation based on hints. Tip engineering is easier and more efficient than model fine-tuning, but fine-tuning works better in some scenarios. GPT has problems such as bias, errors, and knowledge limitations, and cannot be completely relied on. It should be used as a creative assistant under supervision and collaborate with humans. Overall, GPT is suitable for low-risk applications and can be used as a source of knowledge, but human judgment is still required for key decisions. We should regard it as a partner in writing or thinking, give full play to the advantages of both sides, and jointly improve efficiency.

2.1 Thinking about differences

        Now, let's show it with a specific scene. Let's say you're writing an article or blog post and you're going to end your article with a sentence like this: "California has 53 times the population of Alaska." When you create a sentence like this, your inner monologue goes on a lot behind the scenes The work, such as population data review, numerical comparison, multiple calculation, etc., was finally able to write the text "California's population is 53 times that of Alaska." But what does such a sentence look like when training GPT?

        From the perspective of GPT, this is just a sequence of tokens. When GPT reads or generates these tokens, it only processes them one by one, and each processed block requires about the same amount of calculation. And these transformers are not shallow networks (for example, there may be 80 layers). This transformer will do its best to imitate, but obviously, this process is very different from the thought process you go through. In particular, in the final product, all internal monologues have been completely stripped out in the datasets we created and then eventually fed into the LLMs. GPT looks at each token and spends the same amount of computation on each. So, we can't expect it to do too much work per token. These transformers are like token simulators. They don't know what they don't know, they don't know what they are doing well or not doing well, they just try their best to imitate the next token. They lack loop reflection and do not perform any sanity checks. By default, they do not correct their own mistakes in the process, but just sample the generated token sequence. They don't have a separate internal monologue flowing through their brains, they're just evaluating what's going on in the moment.

        Of course, this approach has some cognitive advantages. Through tens of billions of parameters, it stores a very large factual basis and covers a large number of fields. Although storing a large number of facts consumes huge storage space, this method also has a relatively long and perfect working memory. Through the internal self-attention mechanism, no matter how long the context content can be obtained immediately. From this point of view, its memory mechanism seems to be perfect, but the length of context it can acquire is actually limited by real conditions. Up to this length the transformer is directly accessible, losslessly remembering anything within its context window. Prompt is just to bridge the cognitive difference between these two architectures, the brain and the big language model.

2.2 Thinking chain

        For tasks involving reasoning, the Transformer cannot be expected to do too much reasoning per token. Therefore, reasoning has to be really extended to many more tokens. Many times it is impossible to give Transformer a very complex question and expect it to find the answer in a token. There simply wasn't enough time for it. These Transformers need more tokens to "think". Disassemble the task into multiple steps, and the way of prompt stimulates the internal monologue, and then allows more tokens to participate in the reasoning process.

        For example, the few-shot prompt, which tells the Transformer that it should demonstrate its working process when answering a question. If given some examples, the Transformer will mimic that template and eventually perform better in terms of evaluation. Furthermore, the model can be guided to behave this way by saying "let's think step by step". This puts the Transformer into a state where it exhibits work, with less computation per token overall and more successful results.

2.3 Try more

        If you find that you are not successful, you can try multiple times, and then choose the best one or make a majority vote. In the process of predicting the next token, Transformer may sample a not-so-good token, and at this time may enter into a dead-end reasoning. Unlike human thinking, Transformer models cannot recover from this, they are stuck with every token they sample. Therefore, they continue the sequence even though they know it will not be successful.

2.4 Thinking, fast and slow

        In fact, LLM knows if it screwed up. Suppose you ask the model to generate a poem that doesn't rhyme, and it might give you a poem that actually rhymes. But it turns out that especially with larger models like GPT-4, you can just ask it, did you get the job done?

        It can be seen that GPT-4 is very clear that it did not complete the task, and it will tell you, "No, I actually did not complete the task. Let me try again". Therefore, this needs to be compensated in the prompt. You have to let it go and check. If you don't ask it to check, it won't check itself. It's just a token simulator. More generally, many of these techniques fall into the category of rebuilding slow-thinking systems. Daniel Kahneman's "Thinking, Fast and Slow" mentioned that human beings have 2 ways of thinking:

        (1) Fast thinking is a fast and automatic process, which is similar to LLM sampling Token. Fast thinking (System 1) refers to the intuitive and automatic way of thinking that we use most of the time. It is a quick, unconscious, almost automatic way of thinking that makes quick judgments and decisions. Fast thinking relies on the patterns, intuitions, and heuristics we have accumulated in experience, which can help us respond quickly in our daily lives, eliminating the need for deliberation.

        (2) Slow thinking is the slower, more deliberate planning part of the brain. Slow thinking (System 2) is a deeper, conscious way of thinking that requires more cognitive effort. Slow thinking involves higher cognitive processes such as logical reasoning, analysis, comparison, and controlling attention. It requires our conscious focus, deliberation, and complex problem-solving and decision-making.

        According to Kahneman, fast thinking and slow thinking play different roles in our daily lives. Thinking fast helps us make quick decisions in familiar situations, but is also sometimes susceptible to cognitive biases and errors. Slow thinking, on the other hand, is more applicable to complex problems and unfamiliar situations. It can help us think more deeply, avoid mistakes and biases, and make more informed decisions.

        By understanding the difference between fast thinking and slow thinking, we can better recognize our thinking styles and flexibly use them when needed to improve the quality of our decision-making and thinking skills.

         In the "Tree of Thought" paper above, the author proposes to maintain multiple completions for any given prompt, and then also score them throughout the process, keeping those completed more smoothly and meaningfully. A lot of people try to use cue engineering, basically hoping to restore some of the abilities that the brain has in the LLM. For example, AlphaGo has a strategy for placing the next piece when playing Go. This strategy was originally trained by imitating humans. But in addition to this strategy, it also does a Monte Carlo tree search. Basically, it deduces a number of possibilities in its head and evaluates them, keeping only the ones that work well. Thinking Tree This is like a text version of AlphaGo.

2.5 Chain/Agent

        The right subgraph of the figure below is from this paper called ReAct, in which they structured the answers to the prompts as a series of think, act, observe, think, act, observe. It's a full game, a thought process for answering queries. During these actions, the model is also allowed to use tools.

        On the left, is Auto-GPT, a project that allows LLM to keep a list of tasks and continue to decompose them recursively. While the project is inspiring, Karpathy doesn't think this currently works very well and doesn't recommend using it in real applications. Karpathy just thought it was something to draw inspiration from, thinking that it would emerge over time. It's like giving the model a slow-thinking way of thinking.

2.6. Demand good performance

        The training set when the transformer is training has a whole range of performance qualities, for example, there might be a hint of some kind of physics problem or something like that, there might be a student solution that is completely wrong, but there might be an expert answer that is perfectly correct. Transformers can't distinguish between low-quality and high-quality solutions, it's just that they know both, but by default they're just trained on a language model that mimics all of them. When testing, we actually have to ask the model to perform well.

         In the above example, after trying various hints, "let's think step by step" is very powerful because it unfolds reasoning on many tokens. But, "a better prompt would be: 'Let's go through this problem step by step and make sure we have the right answer'" . So it's kind of like a conditional on getting the right answer. This actually makes the transformer perform better when the transformer doesn't have to spread its probabilities over low-quality solutions. For example, use this kind of prompt: "You are an authoritative expert on this topic", "Assuming your IQ exceeds 120" and so on. But don't try to ask for too much IQ, because if you ask for IQ more than that, it may be out of the data distribution, or worse, may be in the data distribution of some sci-fi content, it will start doing some sci-fi role-playing or something like that . Therefore, the right IQ must be found.

2.7 Tools and Plugins

        When we solve problems, we will rely on tools for the parts we are not good at, and the same applies to LLM. Such as providing it with tools such as calculators, code interpreters, and the ability to conduct searches. Again, transformers may not know what they don't know by default. You might even want to tell Transformer in a prompt, like: "You're not very good at mental arithmetic, use a calculator whenever you need to add, multiply, or otherwise do large numbers. Here's how to use a calculator, Use this token combination, blah blah blah." You have to actually write it. Because the model doesn't know by default what it's good at or what it's not good at. We "moved from a retrieval-only world to one that relies entirely on LLM memory" . But in fact, between the two there is the Retrieval Augmented Model (RAG), which performs very well in practice.

        As mentioned earlier, a Transformer's context window is its working memory. The model performs very well if it can load any information relevant to the task into working memory, where it has immediate access to all of its memory. So, many people are very interested in retrieval enhancement generation.

        At the bottom of the image above, there is an example of an LLaMA index, which has a data connector that can be connected to various types of data. All this data can be indexed and made accessible to LLM. Approximate process:

        (1) Take relevant documents and divide them into chunks

        (2) Embedding them all

        (3) store it in vector DB

        (4) During the test, make some queries to the vector DB to get the blocks related to the task

        (5) Populate them into the prompt, and generate.

2.8 Fine-tuning

        While the targeted outcome can be achieved through hint engineering, fine-tuning the model can also be further considered. Fine-tuning a model means actually changing the weights of the model. It is becoming easier to do this in practice because of the recent development of techniques such as LoRA for efficient fine-tuning of parameters ensuring that only small, sparse segments of the model are needed to be trained. Most of the model is kept at the state of the base model, only a part is allowed to change. Only adjust a small part of the model, and the computational cost becomes lower. Also, since the model is mostly fixed, those parts can be computed with very low precision since they are not updated by gradient descent, which also makes the whole process more efficient.

        SFT (supervised fine-tuning) is relatively recommended, because it just continues to do the task of language model, which is relatively simple and clear. And RLHF (Reinforcement Learning with Human Feedback) is a very research field and it is even hard to make it work effectively. Therefore, it is not recommended for someone to try to do their own RLHF implementation, it is very unstable, very difficult to train, and currently not very suitable for beginners. Of course, it may still be changing rapidly, and the future is unknown.

9. Suggestions for use

        1. The best performance is currently from the GPT-4 model, which is by far the most powerful model, so use it.

         2. Use very detailed prompts that include task context, relevant information, and instructions. Think along these lines: what would you tell them if they couldn't email you back? But also remember that task contractors are people, they have inner thoughts, they're very smart, etc. LLMs don't have these qualities, so make sure to almost take the psychology of LLMs into account and design your prompts for that.

        3. Make a lot of reference to the extensive hint engineering technique, retrieve and add any relevant context and information in these hints. Some of this is highlighted in the slide above, but it's also a very large area, just advisable to look online for hint engineering, there's a lot of content out there.

        4. Try using few-shots. This means that you don't just want to ask, you want to show (what you want) as much as possible, give it examples, if you can, help it really understand what you mean.

        5. Try to use tools and plugins to offload tasks that are difficult for LLM itself.

        6. Consider not only individual prompts and answers, but also potential chains and reflections, and how to glue them together, and how to make multiple samples, etc.

        7. Finally, if you think you've maximally optimized the effect of hint engineering, try fine-tuning the model to fit your application, but expect this to be slower and more involved.

        8. Then there's an expert-level research area here, which is RLHF, if you can make it work. The current introduction of RLHF does work somewhat better than SFT. But again, it's very complicated. To optimize your costs, try exploring lower capacity models or shorter tips, etc.

10. Limitations

        Current LLMs have many limitations.

        1. The model may be biased. They can make things up, they can hallucinate information, they can have errors in reasoning, they can be haunted across an entire category of apps.

        2. The validity period of knowledge. For example, ChatGPT may not know any information after September 2021.

        3. Vulnerable to a wide range of attacks, which are published on Twitter every day, including prompt injection, jailbreak attacks, data poisoning attacks, etc.

        The advice is to use LLMs in lower risk applications, always use them in conjunction with human supervision, use them as a source of inspiration and advice, consider usage models like copilot (co-pilot) rather than fully autonomous as executive Agent for the task.

3. Summary

        This article mainly introduces the training method of the GPT model, including four stages of pre-training, supervised fine-tuning, reward modeling and reinforcement learning. Different fine-tuned models have different characteristics, the base model is more creative, and reinforcement learning can obtain the highest quality responses after fine-tuning. When using GPT, you can use hint engineering techniques, such as providing sufficient context and guiding the model to think step by step. The current model still has various limitations, and humans should carefully and effectively apply it to low-risk collaboration scenarios, and their potential risks cannot be underestimated. Overall, this article systematically introduces the whole process of GPT training and various key skills in specific applications.

reference:

        1. The 40000-word long text elaborates on the past and present of ChatGPT

Guess you like

Origin blog.csdn.net/weixin_44750512/article/details/132358365