Paper Reading A Survey of Large Language Models 2

pre-training

The importance of pre-training for the ability of LLMs. Through pre-training on large-scale corpora, LLMs can obtain basic language understanding and generation capabilities. For pre-training, the size and quality of the corpus are crucial, and in addition, in order to effectively pre-train LLMs, well-designed model architectures, acceleration methods, and optimization techniques are required. Specifically, the article will discuss data collection and processing in Section 4.1, introduce commonly used model architectures in Section 4.2, and finally introduce training techniques for stably and efficiently optimizing LLMs in Section 4.3.

data collection

Compared with small-scale language models, LLMs require high-quality data for model pre-training, and their model capacity largely depends on the pre-training corpus and its pre-processing methods. This article will discuss the collection and processing of pre-training data, including data sources, pre-processing methods, and an important analysis of how pre-training data affects the performance of LLMs.

data source

In order to develop a capable LLM, the collection of large natural language corpora from various data sources is key. Existing LLMs mainly utilize a variety of public text datasets as pre-training corpora. Figure 2 shows the distribution of pre-training data sources for several existing LLMs. The sources of pre-training corpora can be broadly divided into two categories: general data and specialized data. Due to their large scale, variety, and easy accessibility, most LLMs use general-purpose data, such as web pages, books, and dialogue texts, which can enhance the language modeling and generalization capabilities of LLMs. Considering the excellent generalization ability shown by LLMs, there are also studies extending their pre-training corpora to more specialized datasets, such as multilingual data, scientific data, and codes, which endow LLMs with specific task-solving abilities. Next, we describe these two sources of pre-training data and their impact on LLM. For a detailed introduction to commonly used corpora, see Section 3.2.

General data General data used by LLM (Language Models), including Web pages, dialogue texts, and books. Among these data, Web pages are the most common. Although web data comes from a wide range of sources, they often contain high-quality text and low-quality text, so they need to be filtered and processed before use. Dialogue data can enhance the dialogue ability of LLM and improve its performance on question answering tasks. Online dialogue data often involves multiple parties participating in the discussion, so it needs to be transformed into a tree structure. Books provide an important source of formal long texts that can help LLM learn language knowledge, model long-term dependencies, and generate coherent texts. Currently, open source book data usually use datasets such as Books3 and Bookcorpus2.

Specialized data are used to improve the specific capabilities of language models (LLMs) in downstream tasks. The first is multilingual text, including text in the target language and multilingual corpora, which can enhance the multilingual ability of language understanding and generation. Scientific texts are designed to enhance the understanding of scientific knowledge by LLMs, by training the model with a large amount of scientific texts, LLMs can achieve impressive performance in scientific and reasoning tasks. Code data is another form of data for training LLMs that can improve the quality and accuracy of generated programs. Unlike natural language text, code exists in the form of programming language, corresponding to long-distance dependencies and accurate execution logic. By training code data, LLMs can acquire more complex reasoning capabilities.
insert image description here

data preprocessing

Before building a pre-training corpus, after collecting a large amount of text data, preprocessing must be done, especially to remove noisy, redundant, irrelevant and potentially harmful data, which may greatly affect the capabilities and performance of LLMs. This article reviews detailed data preprocessing strategies to improve the quality of collected data. A typical preprocessing pipeline for pre-training data has been illustrated in Figure 3.

insert image description here

Filtering Quality Existing works generally employ both classifiers and heuristics. The classifier method trains a binary classifier with high-quality text as positive examples and sampled data as negative examples, and predicts a quality score for each data instance. But the method may accidentally remove high-quality text in dialects, colloquial and sociolinguistics, leading to bias and lack of corpus diversity in pre-trained corpora. Heuristic methods eliminate low-quality text data through a series of well-designed rules. These rules include: linguistic filtering, metric filtering, statistical filtering, and keyword filtering. Among them, language filtering methods can filter texts that exclude certain languages ​​according to task requirements, metrics can be used to detect and remove unnatural sentences when evaluating generated texts, statistical methods can be used to measure text quality and filter low-quality data, the key Word filtering can judge useless or noisy elements in the text according to a specific keyword set and remove them.

Deduplication processing Because repeated data will reduce the diversity of the language model, it may cause the training process to be unstable and affect the model performance. Deduplication can be performed at the sentence level, document level, and dataset level. The specific implementation includes removing low-quality sentences containing the same words and phrases, and using the overlap ratio of surface features to detect and delete documents with similar content. In addition, there is a need to avoid the overlap problem between the training set and the evaluation set, which is solved by removing possible duplicate texts from the training set. The three levels of deduplication methods can be used together to help improve the training effect of the pre-trained language model.

Privacy Revision In pre-training text data processing, most of the data comes from the web, including user-generated content containing sensitive or personal information, which may increase the risk of privacy leaks. Therefore, it is necessary to remove personally identifiable information (PII) from the pre-training corpus. A straightforward and effective approach is to employ a rule-based approach, such as keyword recognition, to detect and remove PII such as names, addresses, and phone numbers. In addition, the researchers also found that the vulnerability of LLMs in the face of privacy attacks can be attributed to the presence of duplicate PII data in the pre-training corpus. Therefore, deduplication can also reduce privacy risks to a certain extent.

Tokenization step The purpose of tokenization is to divide the original text into a series of individual Token sequences, which are then used as input to LLMs. While it is convenient to leverage existing tokenizers (e.g., OPT and GPT-3 leverage GPT-2's tokenizer), specially designed tokenizers for pre-trained corpora (such as SentencePiece) can bring great benefits, especially is for corpora consisting of multiple domains, languages, and formats. Some recent LLMs also use the Byte Pair Encoding (BPE) algorithm to train custom Token tokenizers and ensure that the information after segmentation is complete and not lost. It should be noted that the normalization technology in BPE (such as NFKC) may reduce the performance of token word segmentation.

Effect of pre-trained data on LLMs.

Unlike small-scale PLMs, it is often not possible to iterate the pre-training of LLMs multiple times due to the large computational resources required. Therefore, it is very important to construct a well-prepared pre-training corpus before training LLMs. In this section, we discuss how the quality and distribution of pre-trained corpora may affect the performance of LLMs.

Source mixed LLMs pre-train text data from different sources with different linguistic features and semantic knowledge. By mixing text data from different sources for pre-training, a wide range of knowledge can be obtained and exhibit strong generalization ability. When choosing a mixed source, the distribution of pre-training data needs to be carefully set, as this may also affect the performance of LLMs on downstream tasks. Gopher conducts ablation experiments on data distributions to study the impact of mixed sources on downstream tasks. Experimental results show that increasing the proportion of book data can improve the ability of LLMs to capture long-term dependencies in text, while increasing the proportion of C4 data can improve the performance on the C4 validation dataset. However, training on too much data in one domain can affect the generalization ability of LLMs in other domains. Therefore, it is suggested that researchers should carefully determine the proportion of data from different domains to develop LLMs that better meet their specific needs. Readers can refer to Figure 2 for a comparison of the data sources of different LLMs.

The amount of pre-training data The amount of pre-training data is important for an effective LLM. It is found that as the size of LLM parameters increases, the amount of data required for training also increases. There is a data scaling law similar to model size, related to model performance. Some existing LLMs suffer from suboptimal training due to insufficient pre-training data. Extensive experiments were performed, and it was shown that employing an equal proportion of model parameters and training labels is necessary to adequately train a model for a given computational budget. Recent studies have shown that small models can also achieve good performance by increasing the amount of data and extending the training time. Therefore, researchers are advised to pay attention to the amount of high-quality data when adjusting the model parameter scale to adequately train the model.

The Importance of Pre-Training Data Quality Studies have shown that pre-training with low-quality, noisy, toxic, or repetitive data corpora may degrade model performance. Therefore, not only the quantity of training data but also the quality of the data must be considered. Recent studies have found that by comparing the performance of models trained on filtered and unfiltered corpora, it is possible to conclude that training language models with corpora of cleaned data can improve model performance. In addition, data duplication may lead to the "double hills" phenomenon and even affect the generalization ability of generative learning. Therefore, caution and prudence are required when preprocessing the pre-training corpus to improve the stability of the training process and avoid affecting model performance.

network structure

This section introduces the architectural design of LLM (Large Language Model), including the main architecture, pre-training objectives, and detailed configuration. Table 3 presents model cards for several representative LLM models with publicly available details.

mainstream architecture

Because the Transformer architecture has excellent parallelism and capacity, it has become the de facto standard backbone for developing various language models, enabling language models to scale to tens of billions or hundreds of billions of parameters. Overall, the mainstream architectures of existing language models can be roughly classified into three main types, namely codecs, informal decoders, and prefix decoders.

The Encoder-decoder architecture is the basis of the Transformer model, which includes a stack of two Transformer blocks as encoder and decoder. The encoder encodes the input sequence with stacked multi-head self-attention layers to generate its latent representations; while the decoder applies cross-attention to these representations and autoregressively generates the target sequence. Encoder-decoder PLMs (such as T5 and BART) have shown effectiveness in various NLP tasks, but only a few LLMs are built on the encoder-decoder architecture, such as Flan-T5. A detailed discussion on architectural choices will be made in Section 4.2.4.

The model architecture of Casual Decoder Architecture , which includes a one-way attention mask, ensures that each input token can only pay attention to past tokens and itself. Input and output tokens are handled the same way using the model. This architecture is the basis for the development of some representative language models (such as the GPT series of models). GPT-3 successfully demonstrates the effectiveness of this architecture while also demonstrating surprising contextual learning capabilities. Interestingly, GPT-1 and GPT-2 did not demonstrate the remarkable ability of GPT-3, so scaling seems to play an important role in increasing the model capacity of this model architecture. So far, this model architecture has been widely used in various existing language model designs. Note that both the "decoder-only architecture" mentioned and the "prefix decoder" discussed below belong to the decoder-only architecture. In the existing literature, when referring to "decoder-only architecture", it mainly refers to "Casual Decoder Architecture", unless otherwise specified.

The prefix decoder structure employs a novel masking mechanism to achieve bidirectional attention on prefix tokens and unidirectional attention on generated tokens only. This architecture can bidirectionally encode prefix sequences like an encoder-decoder architecture, and autoregressively predict output tokens one by one, with parameter sharing during encoding and decoding. Instead of pre-training from scratch, a pragmatic proposal is to continuously train the missing decoder and then convert it to a prefix decoder to speed up convergence. Existing LLMs representing prefix decoders include GLM-130B and U-PaLM, etc.

Three neural network architectures are discussed that can be scaled by mix-of-experts (MoE) scaling, where a subset of neural network weights are sparsely activated for each input, such as Switch Transformer and GLaM. Studies have shown that significant performance gains can be observed by increasing the number of experts or the total parameter size.

detailed configuration

Various improvements have been made to Transformer since its release to enhance its training stability, performance, and computational efficiency. The corresponding configuration of the four main parts of Transformer will be discussed in this article, including normalization, position embedding, activation function, attention and bias.

The training instability problem in pre-trained LLMs, where **Layer Normalization (Layer Norm, LN)** is widely used in the Transformer architecture to alleviate this problem. The location of LN is very important to the performance of LLMs. Although the original Transformer used the post-LN, most LLMs adopted the pre-LN for more stable training, albeit with a decrease in performance. On the basis of the previous LN, Sandwich-LN adds an extra LN before the remaining connections to avoid numerical explosion. However, it is found that Sandwich-LN sometimes fails to stabilize the training of LLMs, which may lead to training crashes. Recently, some advanced normalization techniques have been proposed as alternatives to LN. In Gopher and Chinchilla, RMS Norm is adopted due to its superiority in training speed and performance. DeepNorm shows a better ability to stabilize training and is thus called the post-normalization standard by GLM-130B. Furthermore, adding an extra LN after the embedding layer can also stabilize the training of LLMs, but often leads to a significant performance drop, which has been eliminated in some recent LLMs.

Activation functions also play a crucial role in the performance of feed-forward neural networks. The currently commonly used activation function is GeLU. In the latest LLMs, such as PaLM and LaMDA, variants of the GLU activation function are also used, especially the SwiGLU and GeGLU variants. These variants have achieved better performance in practice, but they require About 50% more extra parameters than GeLU.

The concept and application of Position Embeddings in Transformer . Due to the permutation equivariance of the self-attention module in Transformer, positional encodings are used to inject absolute or relative positional information to model sequences. There are two variants of absolute position encoding in the original Transformer, sine functions and learned position embeddings, where the latter is commonly used in LLMs. Unlike absolute positional embeddings, relative positional encodings generate embeddings based on the offset between the key and the query, so it is able to perform well on longer sequences than those seen during training, i.e. strong extrapolation. ALiBi biases the attention score by formulating penalties based on the distance between keys and queries, and its zero-shot generalization ability is verified to be better than other positional encodings. Furthermore, by setting a specific rotation matrix according to the absolute position, RoPE can use the relative position information to calculate the score between the key and the query, which is useful for modeling long sequences. Therefore, RoPE has been widely used in state-of-the-art LLMs.

This passage mainly talks about the attention mechanism and bias . Based on the original Transformer model, GPT-3 adopts a sparse attention mechanism to reduce computational complexity. In order to better handle long sequences, some special attention modes or attention mechanisms considering GPU memory are introduced. In addition, most language models are performed on the basis of preserving the bias of each dense kernel and layer normalization, but some models (such as PaLM and Galactica) remove the bias, proving that no bias can enhance the training stability of LLMs sex.

In order to obtain stronger generalization and training stability, it is recommended to choose to use the pre-set RMS Norm for layer normalization, and use SwiGLU or GeGLU as the activation function. However, Layer Norm should not be used immediately after embedding layers, or performance degradation may result. Also, for positional embeddings, RoPE or ALiBi are better choices as it performs better with long sequences.

pre-training task

Pre-training is the key to encoding general knowledge from large-scale corpora into massive model parameters. For the training of LLM, there are two commonly used pre-training tasks, language modeling and denoising auto-encoding.

Language Modeling is a common task for pre-training decoder-only models such as GPT3 and PaLM. The goal of this task is to predict future markers based on previous markers. Many natural language processing tasks can be viewed as input-based prediction problems, and these decoder-only LLMs may implicitly learn how to accomplish these tasks in a unified LM fashion. An important LLM variant is the prefix language modeling task, specifically for pre-training models of prefix-decoder structures. Tokens within a randomly selected prefix are not used to compute the loss for prefix language modeling. Compared to language modeling, for the same number of tokens seen during pre-training, prefix language modeling performs slightly worse because fewer tokens are involved in the sequence.

Denoising Autoencoding (DAE) is a task for pre-training language models. After randomly replacing some text, the language model is trained to restore these replaced parts. Compared with the traditional language model (LM), the implementation of DAE tasks is more complicated, so it is not widely used in the pre-training of large-scale language models, but some existing models such as T5 and GLM-130B use DAE as pre-training target training.

Summary Discussion

Among them, pre-training with LM objective seems to achieve superior zero-shot and few-shot generalization capabilities. The performance of arbitrary decoders can be greatly improved by scaling the model size, dataset size, and total computation. As detailed surveys of encoder-decoder models are still insufficient, more research is needed to analyze how the choice of architecture and pre-training tasks affect the capabilities of LLMs, especially for encoder-decoder architectures. In addition to the main architecture, the detailed configuration of LLM is also worthy of attention.

model training

optimization settings

To optimize the parameters of LLMs, we introduce commonly used settings for batch training, learning rate, optimizer, and training stability.

batch training . For language model pre-training, existing research usually sets the batch size to a large number (such as 8,196 samples or 1.6M tokens) to improve training stability and throughput. For LLMs like GPT-3 and PaLM, they introduce a new strategy to dynamically increase the batch size during training, eventually reaching the million level. Specifically, the batch size of GPT-3 is gradually increased from 32K to 3.2M tokens. Empirical results show that dynamic scheduling of batch size can effectively stabilize the training process of LLMs [56].

learning rate . Existing LLMs usually employ a learning rate schedule similar to that used during pre-training with a warm-up and decay strategy. Specifically, during the initial 0.1% to 0.5% of the training step, a linear warmup schedule is employed to gradually increase the learning rate to a maximum value in the range of approximately 5×10 -5 to 1×10 -4 (e.g., GPT -3 is 6×10^-5). Then, a cosine decay strategy is employed in subsequent steps to gradually reduce the learning rate to about 10% of its maximum value until the training loss converges.

optimizer . The Adam optimizer [168] and the AdamW optimizer [169] are widely used to train LLMs (e.g., GPT-3), which are based on the adaptive estimation of lower-order moments based on first-order gradient optimization. Usually its hyperparameters are set as follows: β 1 = 0.9 β1 = 0.9b 1=0.9β 2 = 0.95 β2 = 0.95b 2=0.95ε = 1 0 − 8 ε = 10^{-8}e=10−8 . _ Meanwhile, the Adafactor optimizer [170] is also used to train LLMs (e.g., PaLM and T5), which is a variant of the Adam optimizer specifically designed to preserve GPU memory during training. The hyperparameter settings for the Adafactor optimizer are:β1 = 0.9 β1 = 0.9b 1=0.9 sumβ 2 = 1.0 − k − 0.8 β2 = 1.0-k^{-0.8}b 2=1.0k0.8 , where k represents the number of training steps.

Steady training . In the pre-training of LLMs, the problem of training instability often occurs, which may cause the model to crash. To solve this problem, weight decay and gradient clipping are widely used, and existing studies usually set the threshold of gradient clipping as 1.0 and the weight decay rate as 0.1. However, as LLMs scale, fluctuations in training loss are also more prone to occur, resulting in unstable training. To alleviate this problem, PaLM and OPT use a simple strategy of restarting the training process from a previous checkpoint before the spike, skipping potentially problematic data. Furthermore, GLM finds that anomalous gradients of embedding layers often lead to peaks, and proposes to shrink embedding layer gradients to alleviate this problem.

insert image description here

Zoom Training Tips

As the amount of models and data increases, it becomes challenging to efficiently train large language models (LLMs) with limited computing resources. In particular, two major technical issues need to be addressed, namely increasing training throughput and loading larger models into GPU memory. This section reviews several widely used approaches in existing work to address the above two challenges, including 3D parallelism, ZeRO, and mixed-precision training, and provides general recommendations to leverage them for training. Among them, 3D parallelism is a combination of three commonly used parallel training techniques—data parallelism, pipeline parallelism, and tensor parallelism—while ZeRO solves the memory redundancy problem in data parallelism. Mixed precision training uses smaller precision floating point numbers to reduce memory usage and communication overhead. The techniques in it can be integrated and used to improve training throughput and large model loading. As for the improvement of inference speed, error quantization techniques can reduce the time and space cost of LLMs.
insert image description here

Adaptation and tuning of LLMS

After pre-training, LLMs can acquire the general ability to solve various tasks. However, a growing body of research suggests that the capabilities of LLMs can be further adapted according to their specific goals. This section presents two main methods for adapting pretrained LLMs, namely guided fine-tuning and alignment fine-tuning. The former approach mainly aims to enhance (or unlock) the capabilities of the LLM, while the latter approach aims to align the behavior of the LLM with human values ​​or preferences. Next, we will introduce these two methods in detail.

guide fine-tuning

formatting example

The two main approaches to constructing formatted instances are using formatted existing datasets and human needs, where task descriptions formatted in natural language are a key factor. Increasing the number and diversity of tasks can improve the generalization ability of LLMs. Properly formatted natural language formats can also increase the number and type of demonstrations, as well as the content of mission statements. The key to guiding LLMs to be able to understand the task is the task description. Using multiple task specifications, including those that people actually need, improves the performance of LLMs. Format design also has a big influence, for example, include critical procedures, non-critical procedures, problem-solving steps, etc. in the instructions, and minimize other components in the instructions. Consider diversity and quantity balance in the process of annotating instances of human needs.

instruction tuning

Instruction adjustment strategy, compared to pre-training, instruction adjustment is more efficient because it only uses a small number of examples for training. It is necessary to balance the data distribution in the instruction adjustment, usually use the equal sampling method, and increase the sampling ratio of high-quality collection samples to improve the performance. In addition, instruction adjustment can be combined with pre-training, using both plain text data and formatted data for multi-task learning to achieve the advantages of pre-training and instruction adjustment.

Guidance adjustment effect

Introduces the impact of instruction tuning on language models. It can make the language model have the ability to understand natural language instructions to complete tasks, and even execute instructions on unknown tasks and achieve excellent performance. Guided tuning can not only improve model performance, but also eliminate model defects and improve the model's ability to solve practical problems. At the same time, the guidance adjustment can make the language model have the ability to promote among different languages, and use only English instructions to complete multilingual tasks, reducing the workload of instruction engineering. The effect of guided adjustment has been confirmed by many studies, and it is a general method to improve the ability of language models. It has significantly improved models of different sizes, structures, and pre-training targets, so guided adjustment is a more efficient method than pre-training.

insert image description here

Alignment Tuning

This section first introduces the background, definitions, and standards of alignment, then focuses on the collection of human-feedback data for LLM alignment, and finally discusses the key technique for alignment tuning—human-feedback reinforcement learning.

background

Language Modeling (LLM) requires context and alignment criteria that align with human expectations. LLMs have shown remarkable abilities in NLP tasks, but sometimes exhibit unexpected behaviors, for which the concept of human alignment is proposed. However, unlike raw pre-training and tuning tuning (e.g. instruction tuning), this alignment needs to consider very different criteria. In recent years, there has been increasing focus on developing diverse standards to regulate the behavior of LLMs. The article gives three representative alignment criteria: helpful, honest, and harmless, and expounds their meanings. These criteria are quite subjective and based on the development of human cognition, so formulating them directly as optimization goals for LLM is difficult. We propose a seemingly promising technique, red team testing, for probing LLMs in an adversarial fashion to generate harmful outputs and updating LLMs to prevent such outputs.

Collect people's feedback

In the pre-training phase of LLM, a large-scale corpus is used for language modeling objectives for training. However, it fails to take into account the subjective and qualitative assessment of LLM outputs by humans (referred to as human feedback in this survey). High-quality human feedback is important to align LLM with human preferences and values. In this section, we discuss how to select a team of human annotators to collect feedback data, and introduce three main methods of collecting feedback and preference data: ranking collection, question collection, and rule-based collection. Additionally, a reinforcement learning technique with human feedback (RLHF), which is widely used in recent powerful LLMs, is introduced.
insert image description here

reinforcement learning

To align LM with human values, reinforcement learning from human feedback (RLHF) is proposed. RLHF employs a reinforcement learning algorithm to adapt the LM to human feedback by learning a reward model. The RLHF system mainly consists of three key components: a pretrained LM, a reward model learned from human feedback, and a reinforcement learning algorithm for training the LM. RLHF follows a three-step process, including supervised fine-tuning, reward model training, and RL fine-tuning. This process can be iterated multiple times to better align the LMs.

use

LLMs are exploited by designing appropriate prompt strategies to solve various tasks after pre-training or after adaptation. A typical approach to prompting is contextual learning, where task descriptions and/or demonstrations are expressed in the form of natural language text. Furthermore, contextual learning can be enhanced by adopting thought-chain cues, cues involving a sequence of intermediate reasoning steps. Next, we will elaborate on the details of these two techniques.

context learning

This paper introduces contextual learning (ICL) as a special form of cues. ICL uses formatted natural language cues to inspire LLMs to recognize and perform new tasks. LLMs are able to recognize and perform new tasks without explicit gradient updates. The effectiveness of ICL is heavily dependent on presentation, so rational design of presentation in cues is an important issue. This paper pays special attention to how to apply ICL to LLMs, mainly including demonstration design and basic mechanism of LLMs. In addition, ICL is closely related to instruction adjustment, but instruction adjustment needs to adapt LLMs to the task by fine-tuning, while ICL just prompts LLMs to exploit.

insert image description here
The effectiveness of the ICL is affected by the presentation design. Demonstration design can be introduced from three main aspects of presentation selection, format and sequence. Demonstration selection can be heuristic and based on LLMs, the latter of which can be used to directly assess the informativeness of demonstration examples. The presentation format can enhance the reasoning ability of LLMs by adding task descriptions or using sequential thinking prompts. The presentation order needs to take into account the recent bias of LLMs, which can be addressed by methods such as similarity or information theory.

The ICL capabilities of LLMs are discussed. After pre-training, LLMs can exhibit interesting ICL capabilities without updating. The article explores two key questions, namely "how pre-training affects ICL capability" and "how LLMs perform ICL during inference". Pre-training task design and source of pre-training corpus are important factors affecting the ability of LLMs ICL. In the inference stage, LLMs generate meta-gradients through the Attention mechanism to perform gradient descent implicitly, achieving the effect of ICL. The results also show that certain attention heads of LLMs are able to perform task-independent atomic operations that are closely related to the capabilities of ICLs. This process can be abstracted as a learning algorithm. The article also introduces LLMs to efficiently learn some complex functions such as decision trees without updating.
insert image description here

chain of thought tips

This paper introduces an improved cueing strategy named Chain-of-Thought (CoT) to improve the performance of LLMs in complex reasoning tasks, such as arithmetic reasoning, commonsense reasoning and symbolic reasoning. CoT differs from traditional ICL hints by incorporating intermediate reasoning steps that can lead to the final output into hints. This article further elaborates on how CoT is used in ICL and when and why CoT cues are effective.

Combining Controllable Text Generation (CoT) technology with In-context Learning (ICL) can effectively improve the reasoning ability of Large Language Models (LLMs), especially for tasks that require step-by-step reasoning, such as arithmetic reasoning, common sense reasoning and symbolic reasoning ,Better results. CoT technology can be divided into Few-shot CoT and Zero-shot CoT modes, which can effectively improve model performance. In addition, the article discusses the development mechanism of the CoT technology and the impact of the CoT cue component. It should be noted that the effect of CoT technology is related to model size and task relevance, and the design and use of CoT hints need to consider multiple factors.

Guess you like

Origin blog.csdn.net/u010095372/article/details/129966234