Defeating the entire alpaca family, the new Meta AI self-alignment method requires very little manual labeling data


The west wind comes from the qubit of the concave temple | public account QbitAI

Is it urgent to manually label data?

Mata's new method builds a high-quality instruction following language model with only a small amount of seed data.

In other words, large language models require a large amount of human-labeled instruction data for fine-tuning, but now the model can automatically infer instructions from unlabeled text in web corpora.

Then use the instruction data generated by yourself for training, which is comparable to self-produced and sold.

And the model trained by this method surpasses the open source alpaca and its series of derivative models on the Alpaca benchmark .

LeCun tweeted that the study was sensational in terms of model self-alignment:

739aabb25c9e85b5356e5c679b1f279c.png

To sum it up in a sentence from a netizen:

The alpaca began to train itself.

cfeffd5f162f3fd83da8afe198e32339.png

The two sentences sum it up like this:

Instructions>response data sets were originally required (manual labeling is required), but now it is only necessary to simply train a "reverse model" for response>instructions. Any text can be freely converted into an instruction data set.

252daee93336409c321d88b600bb3eba.png

Another netizen issued a soul torture:

Am I the only one who thinks this looks like the path to superintelligence? If you can get LLMs that get smarter and smarter without additional high-quality external data, then this is a self-improving closed system.

Maybe just a reinforcement learning system is needed to provide the signal, and then the iteration of the LLM itself does the rest.

1817e4d491c9a8f237119db169fa15c7.png

Alpaca: I used data to train a whale

This scalable new method is called instruction back-translation , and Mata named the model trained by this method- Humpback (humpback whale, also known as humpback whale).

(The researchers said that the name was given because of its relationship with the camel's back, and the larger size of the whale corresponds to a larger scale of the model)

afa2835c9c7283ea1184ed4762da2071.png

The step of training a Humpback is simply to start with a small amount of labeled data, use the language model to generate instructions corresponding to unlabeled text, and form candidate training data. Then use the model to evaluate the data quality and select high-quality data for retraining. The process is then repeated to further improve the model.

83b304b562b007fe6b475e8058f417f0.png

As shown in the figure above, the "materials" that need to be prepared are:

  • A basic model - LLaMa

  • A seed data (Seed Data) composed of 3200 examples in the Open Assistant dataset , each example includes an instruction and corresponding output.

  • From the ClueWeb corpus, 502K unlabeled texts (Unlabeled Data) that have been deduplicated, filtered, and potentially low-quality paragraphs have been deleted.

Now that there are labeled examples and corpus sources, the next step is the self-augment stage.

The researchers fine-tuned the base model LLaMa with the seed data to obtain an instruction prediction model . This instruction prediction model is then used to infer a candidate instruction for the unlabeled text. Then combine candidate instructions and text (instruction-output pairs) as candidate enhanced training data , which is Augmented Data A in the above figure.

However, it is not possible to use the data of A for direct training, because the quality of the unlabeled text itself is uneven, and the generated candidate instructions also have noise.

Therefore, key self-curate steps are required, using the model to predict data quality and selecting high-quality samples for training.

79f00d20498a7566400da6479b0b6f10.png

Specifically, the researchers scored candidate data using an instruction model fine-tuned only on the seed data. The full score is five points, and those with higher scores will be selected as candidate data for the next round.

In order to improve the prediction quality of model instructions, the researchers iteratively trained the model with candidate data, and the data quality will get better and better during the iterative training.

In addition, when combining seed data and augmentation data to fine-tune the model, they also use different system hint tags to distinguish between these two data sources:

  • Tips for using seed data "Answer in the style of an AI Assistant."

  • Filter data using the prompt "Answer with knowledge from web search."

After two iterations, the final model is fresh out of the oven.

Merge two training data: 1+1>2

Let's take a look at the results of the researchers' analysis:

2470200331c5d2bad4f2f2d6aa462d86.png
Instruction diversity of seed data and enhanced data. The inner circle is the common root verb and the outer circle is the common noun that corresponds to it.

The figure above shows the instruction diversity with 8% seed data and 13% enhanced data statistics.

It can be seen intuitively that the enhanced data diversity is stronger in the long tail part, and the enhanced data complements the existing artificially labeled seed data, supplementing the types that do not appear in the seed data.

Second, the researchers compared three augmented datasets: Augmented data, all (no self-management), , aa07a6e58689ccefe3a10ac478897767.pngwith less data but higher qualitya6c4989f4ef334013bff6d07b70f0928.png

7f5f39b937636ad1ac48d45516b20a76.png

Experiments have observed that although the data set becomes smaller, the performance of the model has also been improved with the improvement of the quality of the training data.

6cff45ebe9ab9f1f8d946fad7d24ca6c.png
Use self-filtering to evaluate self-augmentation data of different data sizes and qualities. The y-axis represents the win rate with text-davinci-003 when fine-tuning LLaMa 7B with a given data size and quality.

(text-davinci-003, A GPT-3-based instruction following model fine-tuned on human-written instruction data, outputs, model responses, and human preferences using reinforcement learning)

Finally, let's take a look at the results on the Alpaca leaderboard. Humpback significantly outperforms other methods without relying on distilled data and closes the gap with proprietary models.

Non-distilled (Non-distilled), refers to a training model that does not rely on any external model as any form of supervision; Distilled (Distilled), refers to the introduction of a more powerful external model during the training process, such as using data distilled from an external model; Proprietary refers to models trained using proprietary data and techniques.

87e0ca2ac7257f163c47ddbb0ef5aeaa.png
The winning rate relative to text-davinci-003

In comparison with open source models LIMA 65B, Guanaco 65B, Falcon-Instruct 40B and proprietary models davinci-003, Claude, Humpback's performance is also more in line with human preferences.

4747faa3ccd0c8e620fc866203dba16d.png

Additionally, the researchers noted limitations of the method:

Since the text data used for training comes from a web corpus, the fine-tuned model may amplify the bias of the web data. Although compared to the base model, the fine-tuned model improves the accuracy of detecting bias. However, this does not mean that the problem will be completely solved.

Portal: https://arxiv.org/abs/2308.06259 (link to paper)

Reference link:
[1] https://twitter.com/jaseweston/status/1690888779878330368/retweets/with_comments
[2] https://twitter.com/swayducky/status/1690989046749868032
[3] https://twitter.com/ ylecun/status/1691149506165747720

Guess you like

Origin blog.csdn.net/QbitAI/article/details/132386312