Overview of large language models (6) Model use

2023-05-04 22:13350Read · 1Like · 0Comment

dupaper

After pre-training or adaptive tuning, a major approach to using large language models is to design hinting strategies suitable for solving various tasks. A typical cueing approach is contextual learning , which formulates task descriptions or demonstrations in the form of natural language text. Furthermore, thought chain cues can be used to enhance contextual learning by incorporating a series of intermediate reasoning steps into the cues. Next, we will elaborate on the details of these two techniques.

In-Context Learning

As a special form of prompting, in-context learning (ICL) was first proposed together with GPT-3, which has become a typical method to exploit large language models.

prompt formula

As described in "Language models are few-shot learners ", ICL uses formatted natural language prompts , including task descriptions or some task examples as demonstrations . Figure shows a schematic representation of the ICL. First, starting with the task description, some examples are selected from the task dataset as demonstrations. They are then combined in a specific order to form natural language prompts with specially designed templates. Finally, the test instance is attached to the demo as input to the output generated by the large language model. Based on task demonstrations, large language models can recognize and perform new tasks without explicit gradient updates.

A comparative illustration of contextual learning (ICL) and chain of thought (CoT) cues. ICL hints at large language models with natural language descriptions, several demos, and a test query. Whereas CoT hints involve a series of intermediate reasoning steps in the hint.

Formally, let Dk = f{(x1; y1); ......; f(xk; yk)} denote a set of demonstrations with k examples, where f(xk; yk) is the k-th A prompt function that converts task examples into natural language prompts. Given a task description I, a demonstration Dk, and a new input query xk+1, the prediction of the output y^k+1 generated by a large language model can be expressed as follows

where the actual answer yk+1 is left blank and is predicted by the large language model. Since the performance of ICLs is heavily dependent on presentation, designing them correctly in hints is an important issue. According to the construction process of the above formula, we focus on three aspects of prompt formatting demonstration, including how to select examples to form a demonstration, use the function f(·) to format each example as a prompt, and arrange a reasonable display order.

ICL is comprehensively surveyed in the survey paper "A survey for in-context learning", to which we refer readers for a broader and more detailed discussion on this topic. Compared with this survey, we focus on the application of ICL to large language models especially from two aspects of demonstration design and the underlying mechanism of ICL. Furthermore, ICL is also closely related to instruction tuning, as both use natural language to format tasks or instances.

Demonstration Design

Multiple studies have shown that the effectiveness of ICLs is largely influenced by presentation design. We will introduce the demonstration design of ICL from three main aspects, namely demonstration selection, format and sequence .

Demonstration selection . The performance of ICL tends to vary widely across different demo examples, so it is important to choose a subset of examples that can effectively exploit the capabilities of ICL for large language models. There are two main demo selection methods, namely heuristic methods and large language model based methods:

heuristic method . Existing works widely employ heuristics to select demos due to simplicity and low cost. Some studies use k-NN based retrievers to select examples that are semantically relevant to the query. However, they perform selection on each example individually, rather than evaluating the set of examples as a whole. To address this issue, a diversity-based selection strategy is proposed to select the most representative set of examples for a specific task. Furthermore, in "Complementary explanations for effective in-context learning", both relevance and diversity are considered when selecting demonstrations.

Approaches based on large language models . Another line of work selects demos by using large language models. For example, large language models can be used to directly measure the informativeness of each example in terms of the performance gain after adding examples. In addition, EPR proposes a two-stage retrieval method that first recalls similar examples using an unsupervised method (e.g. BM25), and then ranks them using a dense retriever (trained using large language model-labeled positive and negative examples). As an alternative, the task of demonstrating choice can be formulated as an RL problem, where a large language model acts as a reward function to provide feedback for training a policy model. Since large language models perform well in text annotation, some recent studies use large language models themselves as demonstration generators without human intervention.

In summary, as discussed in "An explanation of in-context learning as implicit bayesian inference", for the above two selection methods, the demo examples selected in ICL should contain enough information about the task to be solved and be consistent with the test query relevant.

presentation format . After selecting task examples, the next step is to integrate and format them into natural language cues for a large language model. A straightforward approach is to instantiate a predefined template with the corresponding input and output pairs. To build more informative templates, recent studies consider adding task descriptions or using thought chain hints to enhance the reasoning ability of large language models. For example, in "Cross-task generalization via natural language crowdsourcing instructions", the authors collected a large dataset containing task descriptions written by humans. After tuning with this dataset, the performance of seen tasks can be improved, and the large language model can also generalize to unseen tasks to a certain extent. To reduce annotation costs, "Self-instruct: Aligning language model with self generated instructions" proposes a semi-automated approach to instruct a large language model to generate task descriptions for new tasks by using a seed set consisting of human-written task descriptions . Due to the high cost of manually annotating demo formats for different tasks, some works also investigate how to automatically generate high-quality demo formats. As two representative methods, Auto-CoT leverages the zero-shot hint “Let’s think step by step” from a large language model to generate intermediate reasoning steps, while the least-tomost hint first queries a large language model to perform problem A language model solves subproblems sequentially based on intermediate answers to previously solved problems.

Example sort . Large language models sometimes suffer from recency bias, i.e. they tend to repeat answers that are near the end of the demonstration. Therefore, it is important to arrange the demos (i.e. task examples) in a logical order. Early work proposed several heuristics to quickly find a good order. For example, demonstrations can be directly organized according to their similarity to queries in the embedding space the more similar the closer to the end. Furthermore, global and local entropy measures can be used to score different presentation orders. To integrate more task information, some recent studies propose to minimize the code length needed to compress and transmit task labels, which is inspired by information theory. However, these methods require additional labeled data as a validation set to evaluate the performance of a specific demo order. To eliminate this need, the authors in "Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity" suggest sampling validation data from the large language model itself.

underlying mechanism

After pre-training, large language models can exhibit interesting ICL capabilities without updating. Next, we discuss two key questions about the ICL capability of large language models, namely "how pre-training affects ICL capability" and "how LLM performs ICL during inference".

How does pre-training affect ICL?  ICL was first proposed in GPT-3, which showed that the ICL ability becomes more significant as the model size increases. Meanwhile, some studies have shown that small PLMs can also exhibit strong ICL capabilities with specially designed training tasks (eg, learning to predict labels given task examples and queries as input), possibly even surpassing larger models. This shows that the design of training tasks is an important factor affecting the ability of large language model ICL. In addition to training tasks, recent studies also investigate the relationship between ICL and pre-training corpora. It has been shown that the performance of ICL depends heavily on the source of the pre-training corpus rather than the size. Another study "Data distributional properties drive emergent in-context learning in transformers" provides an in-depth analysis of the impact of training data distribution. They found that ICL occurs when the training data can be clustered into many infrequent classes, rather than being uniformly distributed. Furthermore, the authors in "An explanation of in-context learning as implicit bayesian inference" theoretically explain ICL as a product of pre-training on documents that exhibit long-range coherence.

How do large language models perform ICL ? During the inference phase, researchers focus on analyzing how the ICL functions behave based on a given demonstration, since no explicit learning or updating is involved. They usually analyze from the perspective of gradient descent and regard ICL as implicit fine-tuning. Under this framework, the ICL process can be explained as follows: a large language model generates meta-gradients with respect to presentations via forward computation, and implicitly performs gradient descent via an attention mechanism. Experiments also show that some attention heads in large language models are able to perform task-independent atomic operations (e.g., copy and prefix matching), which is closely related to ICL capabilities. In order to further explore the working mechanism of ICL, some studies abstract ICL as an algorithmic learning process. Specifically, the authors in "What learning algorithm is in-context learning? investigations with linear models" found that large language models essentially encode implicit models through their parameters during pre-training. With examples provided in ICL, large language models can implement learning algorithms such as gradient descent or directly compute closed-form solutions to update these models during forward computation. Under this explanatory framework, it has been shown that large language models can effectively learn simple linear functions and even some complex functions such as decision trees with ICL.

Chain-of-Thought Prompting

Chain of Thought (CoT) is an improved hinting strategy that improves the performance of large language models on complex reasoning tasks, such as arithmetic reasoning, commonsense reasoning, and symbolic reasoning. Rather than simply constructing hints from input-output pairs as in ICL, CoT incorporates intermediate reasoning steps that lead to final outputs into hints. In the following, we detail the use of CoT with ICL and discuss when and why CoT cues are useful.

Context Learning with CoT

Generally, CoT can be used with ICL in two main settings, the few-shot and zero-shot settings, as described below.

Small sample CoT . Few-shot CoT is a special case of ICL, which augments the input, output of each demonstration into input, CoT, output by combining CoT inference steps. To apply this strategy, we next discuss two key issues, namely, how to design appropriate CoT hints, and how to utilize the generated CoT to obtain the final answer.

CoT prompt design . Designing proper CoT hints is crucial to effectively mobilize the complex reasoning ability of large language models. As a straightforward approach, it shows that using different CoTs (i.e., multiple inference paths per question) can effectively improve their performance. Another intuitive idea is that cues with more complex reasoning paths are more likely to elicit reasoning capabilities of large language models, which can lead to higher accuracy in generating correct answers. However, both methods rely on annotated CoT datasets, which limits their use in practice. To overcome this limitation, AutoCoT proposes to utilize Zero-shot-CoT to generate CoT inference paths by specially hinting large language models, thereby eliminating manual work. To improve performance, Auto-CoT further divides the questions in the training set into different clusters, and then selects the question closest to the centroid of each cluster, which should represent the questions in the training set well. While the few-shot CoT can be considered a special case of hinting in ICL, the ordering of demos seems to have a relatively small impact compared to standard hinting in ICL: reordering demos only leads to a performance change of less than 2% most tasks.

Enhanced CoT strategy . In addition to enriching contextual information, CoT hints provide more options to infer the answer to a given question. Existing research mainly focuses on generating multiple reasoning paths and trying to find consensus among the derived answers. For example, self-consistency is proposed as a new decoding strategy when generating CoT and final answers. It first generates several inference paths, and then performs an ensemble of all answers (e.g., by voting for the most consistent answer among these paths). Self-consistency greatly improves the performance of CoT inference, and can even improve some tasks where CoT hints are usually worse than standard hints (eg, closed-book question answering and natural language inference). Furthermore, the authors in "Rationale-augmented ensembles in language models" extended the self-consistent strategy to a more general ensemble framework (extended to ensembles on hints), and they found that diverse inference paths are the key to the improved inference performance of CoT. The above methods can be easily integrated into CoT hints to improve performance without additional training. In contrast, other studies trained a scoring model to measure the reliability of generated inference paths or continuously trained large language models on self-generated inference paths to improve performance.

Zero-sample CoT . Unlike few-shot CoT, zero-shot CoT does not include human-annotated task demonstrations in the prompts. Instead, it directly generates inference steps and then uses the generated CoTs to arrive at an answer. Zero-shot CoT was first proposed in "Large language models are zero-shot reasoners". The large language model first generates reasoning steps from the "Let's think step by step" prompt, and then obtains the final answer from the "Therefore, the answer is" prompt . They found that this strategy greatly improves performance when the model size exceeds a certain size, but is ineffective for small-scale models, showing a significant pattern of emergent power. In order to unlock CoT capabilities on more tasks, Flan-T5 and Flan-PaLM further perform instruction tuning on CoT annotations, improving zero-shot performance on unseen tasks.

Further discussion on CoT

In this part, we discuss two fundamental questions related to CoT, namely "when is CoT suitable for large language models" and "why large language models can perform CoT inference".

When is CoT suitable for large language models? Since CoT is an emergent capability, it only has a positive impact on sufficiently large models (eg, typically containing 10B or more parameters), but has no effect on small models. Moreover, since CoT adds standard cues through intermediate reasoning steps, it is mainly effective in improving tasks that require step-by-step reasoning, such as arithmetic reasoning, commonsense reasoning, and symbolic reasoning. However, it may exhibit worse performance than standard hints for other tasks that do not rely on complex reasoning, such as MNLI-m/mm, SST-2, and QQP from GLUE. Interestingly, the performance gain from CoT hinting seems to be significant only when standard hinting yields worse results.

Why can large language models be used for CoT reasoning? As the second question, we discuss the underlying mechanism of CoT from the following two aspects.

Source of CoT capabilities . Regarding the origin of CoT's ability, it is widely assumed that it can be attributed to code training, since the models trained on it show strong reasoning ability. Intuitively, the code data is well-organized with algorithmic logic and programming flow, which may help improve the inference performance of large language models. However, this hypothesis still lacks publicly reported evidence from ablation experiments (with and without code training). Furthermore, instruction tuning does not appear to be the key reason for achieving CoT capabilities, as experience shows that instruction tuning on non-CoT data does not improve performance on the retained CoT benchmark.

The effect of the hint component . The main difference between CoT hints and standard hints is the combination of reasoning paths before the final answer. Therefore, some researchers investigated the influence of different components in the reasoning path. Specifically, a recent study identified three key components in CoT cues, namely symbols (e.g., numerical quantities in arithmetic reasoning), patterns (e.g., equations in arithmetic reasoning), and text (i.e., rest not a symbol or pattern). It turns out that the latter two parts (i.e. schema and text) are critical to model performance and removing either one leads to a significant performance drop. However, the correctness of notation and schema does not seem to matter. Furthermore, there is a symbiotic relationship between text and patterns: text helps big language models generate useful patterns, and patterns help big language models. Understand tasks and generate text that helps solve problems. In conclusion, CoT hints provide a general and flexible approach to motivate the reasoning ability of large language models.

In conclusion, CoT hints provide a general and flexible way to stimulate the reasoning ability of large language models. There are also some initial attempts to extend the technique to solve multimodal tasks and multilingual tasks. In addition to directly combining large language models with ICL and CoT, some recent studies have also explored how to specialize the capabilities of large language models for specific tasks, which is called model specialization. For example, the researchers specifically studied the mathematical reasoning ability of large language models by fine-tuning small-scale Flan-T5 on the CoT inference path generated by large language models. Model specialization can also be applied to solve various tasks such as question answering, code synthesis, and information retrieval.

Guess you like

Origin blog.csdn.net/sinat_37574187/article/details/132161372