Interpretation of LIMA and QLoRA papers

《LIMA: Less Is More for Alignment》

paper: https://arxiv.org/abs/2305.11206

Meta released the paper "LIMA: Less Is More for Alignment" in May 2023, fine-tuning the large model LIMA based on LLaMa-65B. Effect.

The training of a large language model is divided into two stages:

1. Unsupervised pre-training, learning general representation from raw text;

2. Large-scale instruction tuning and reinforcement learning to better adapt to end tasks and user preferences.

The author fine-tuned the LIMA model with 1000 selected instruction data, without using any reinforcement learning or human preference model, LIMA showed extremely strong performance, it can learn specific responses from only a small number of samples in the training data formats, including complex tasks from planning travel itineraries to speculating on history. Furthermore, the model generalizes to new tasks not present in the training set.

As shown in the figure below, in the evaluation, 43% of LIMA's responses were equal to or better than GPT-4, 58% were equal to or better than Bard, and 65% were equal to or better than DaVinci003.

Figure 1 shows the human preference evaluation results, and Figure 2 shows the GPT-4 preference evaluation results.

The first observation of the study is that Alpaca 65B tends to outperform LIMA despite being trained with 52x more data, as does DaVinci003 trained with the advanced alignment method RLHF. 

While Claude and GPT-4 generally performed better than LIMA, LIMA did produce better answers in many cases. Notably, even GPT-4 prefers the output of LIMA 19% of the time.

The evaluation shows that almost all the knowledge in the large language model is learned in the pre-training stage, and only limited instruction fine-tuning data is needed to make the model produce high-quality output .

Why "Less More"?

The authors explore the impact of the diversity, quality, and quantity of training data through ablation experiments. They observe that, for alignment purposes, expanding input diversity and output quality has a measurably positive impact that expanding number alone may not.

diversity

In order to test the effect of prompt diversity, while controlling the quality and quantity, the researchers compared the training effect of quality-filtered Stack Exchange data and wikiHow data, the former has heterogeneous prompts, and the latter is mostly homogeneous prompts. They drew 2000 training samples from each source, and as shown above, the more diverse Stack Exchange data yielded significantly better performance.

quality

To test the effect of answer quality, the researchers took 2000 samples from Stack Exchange without any quality or style filtering, and compared the model trained on this dataset with the model trained on the filtered dataset. Compare. As shown in Figure 5, there is a significant difference of 0.5 points between models trained on filtered and unfiltered data sources.

quantity

Increasing the number of samples is a common way to improve performance in machine learning. To test its impact, the researchers sampled an exponentially larger training set from Stack Exchange. As shown in the figure below, doubling the size of the training set did not improve answer quality. This result shows that alignment is not only limited by the number of training samples, but also related to prompt diversity.

Summarize

Fine-tuning a powerful pretrained language model (LLaMa-65B) on 1000 well-curated examples yields remarkable, competitive results across a wide range of prompts.

However, this approach has limitations: first, the mental effort involved in constructing such samples is enormous and difficult to scale up. Second, LIMA is not as robust as production-level models, and while LIMA usually produces good responses, adversarial prompts may generate wrong responses.

Nonetheless, this work shows the potential of solving complex alignment problems in simple ways.


《QLORA: Efficient Finetuning of Quantized LLMs》

Paper: https://arxiv.org/pdf/2305.14314.pdf

代码:GitHub - artidoro/qlora: QLoRA: Efficient Finetuning of Quantized LLMs

In May 2023, the University of Washington proposed an efficient fine-tuning method QLoRA. By reducing the use of video memory, it is possible to fine-tune a large model with 65B parameters on a single 48GB GPU. It only takes 12 hours of fine-tuning to reach 97% of the ChatGPT level. . At the same time, only int4 can maintain the effect of fp16 precision.

  • Motivation: I hope to reduce the memory usage, fine-tune the 65B parameter model on a single 48GB GPU, and retain the fp16 inference performance.
  • Method: QLoRA quantizes the pre-trained model by int4, and uses LoRA to realize an efficient fine-tuning method that maintains performance with a small amount of memory consumption.
  • Advantages: QLoRA introduces a number of innovations aimed at reducing memory usage without sacrificing performance, including 4bit NormalFloat data type (NF4), double quantization (Double Quantization) and paging optimizer (Paged Optimizers), etc., while retaining Inference performance of full fp16.

The four innovations of QLoRA are explained below:

  1. 4-bit Quantization (NF4): Imagine you have a box of 16 different colored crayons. But you find that you can paint almost the same picture with only 4 colors. This is what quantization does. It reduces the number of different "colors" (or numbers) that the model uses to represent its knowledge, saving a lot of space. In this case, they used a special 4-bit quantization, which means they only used 16 different numbers instead of the tens of thousands that a model might normally use.
  2. LoRA: This is a way to change the knowledge of the model without needing to adjust all parts of the model. Imagine you have a huge, complex Lego structure and you want to change it. Instead of tearing down the whole structure, you just add or change a few parts here and there. This is what LoRA does. It allows researchers to fine-tune models without using large amounts of video memory.
  3. Double Quantization: This is another trick to save video memory. It's as if you realize that you can represent your 4 crayon colors with only 2 symbols, which allows you to save even more space.
  4. Paged Optimizers: This is the way to deal with situations where the model needs a lot of video memory at once, like you have a small table, but sometimes you need to do a big project. Instead of changing to a bigger table, you can clear and use small areas of the table at a time. Video memory is swapped with internal memory to handle models that require large video memory.

QLoRA is quantized by int4 of NF4 and trained with LoRA to obtain the Guanaco-65B model, which outperforms all previously released models, reaching 99.3% of the performance level of the Vicuna benchmark, while only requiring 24 on a single GPU. Hours of fine-tuning.

The author fine-tuned more than 1,000 models using QLoRA and came to the following conclusions:

  1. Data quality is far more important than data set size, which is also consistent with the conclusion of the previous LIMA paper. Using Open Assistant’s 9,000 pieces of data to tune for 12 hours can achieve good results, compared with FLAN v2, which uses more than 1 million pieces. For instruction data, finetune may not require a very large data set, and a small amount of high-quality data can bring better results.
  2. For a given task, the suitability of the dataset is more important than the size. Instruction fine-tuning data only uses instruction-related tasks, which does not perform well on Chatbot. Chatbot is more suitable for finetune with the Open Assistant data set. OASST1 with 9k samples is better than the 450k instruction fine-tuning sample data set in chatbot performance; the fine-tuning of instruction data sets can improve the reasoning ability of large models, not for chat born.
  3. The default LoRA hyperparameters are insufficient to achieve the best performance for large models. The study found that the number of LoRA adapters used is a key hyperparameter and that LoRA needs to be used on all linear layers to match the performance of full-parameter model fine-tuning.
  4. QLoRA uses NF4 to reproduce the performance of fp16 full parameter fine-tuning and fp16 LoRA fine-tuning, and NF4 is superior to FP4 in terms of quantization accuracy.

The emergence of QLoRA can indeed bring some new thinking to people. Whether it is finetune or deploying large models by yourself, it will become easier. Everyone can quickly use their own private data for finetune, and at the same time easily deploy large inference models. Later, I will use QLoRA to build a privatized QA Bot to see if it can also achieve good results in Chinese reasoning.

 

Guess you like

Origin blog.csdn.net/shibing624/article/details/130918241