Platypus: fast, cheap and powerful large models

A series of fine-tuned and merged models topped the Open LLM leaderboard. How did they do it?

In recent years, model parameters have exploded to a huge number ( PaLM is 540 B ). The question raised is whether this number of parameters is necessary.

According to OpenAI , as the model grows, performance improves. In addition, emergent properties (properties that cannot be observed except at a certain scale) emerge.

This view is challenged by the fact that there is actually more data and therefore scaling is limited by the number of tokens required to optimally train the model. Furthermore, even these emergent properties may not exist.

Platypus: Fast, Cheap and Powerful LL.M.

Totoro scaling law, as the number of parameters increases, we need more data to optimally train it.

Second, these proprietary models cannot be freely analyzed or used by the scientific community. So, first with BLOOM and then with META's LLaMA , the community has moved to using open source models. LLaMA also shows that paying more attention to the data allows smaller models to compete with larger models.

However, on the other hand, small models cannot generalize as well as large models. However, this has led to a search for techniques to reduce the cost of these models, such as knowledge distillation (teacher model teaching student model). Later methods tried to further reduce costs by extracting data sets, starting from a large training data set, to smaller but simultaneously efficient data sets.

Platypus: Fast, Cheap and Powerful LL.M.

Another idea to reduce computational costs is hybrid experts , where various parts of the network are activated based on the input. For example, in a switching transformer , a different set of parameters is chosen for each example (and different tokens).

Platypus: Fast, Cheap and Powerful LL.M.

Switching transformer. On the other hand, techniques for fine-tuning large language models (LLMs) have been developed in recent months, both before LoRA and after Quantized-LoRA . This makes training more efficient and leads to the emergence of specialized models for specific tasks or domains (models dedicated to coding, biomedical fields).

Currently, however, training models is an expensive and time-consuming process. So why not learn all these elements and combine them together?

Platypus: fast, cheap and powerful

Platypus: Fast, Cheap and Powerful LL.M.

In a recently published article, Platypus attempts to bring these elements together

Specifically:

  • They released open-platypus, a carefully curated dataset with neither contamination nor redundancy between training and test sets.
  • Analysis of redundancy effects.
  • Method descriptions, code, and other resources.

Open-platypus, human dataset

The authors decided to fine-tune LLaMa-2 as the basic model. In fact, they are motivated by three ideas: the model learns most of its knowledge in pre-training, and alignment enables the model to exploit this knowledge. The baseline model has not yet reached saturation, so it can still be trained. Data quality is critical to executing the model.

Therefore, the authors aimed to maximize the quality of the dataset while minimizing its size for computational efficiency. Therefore, the authors take open datasets and select high-quality examples (with a special focus on STEM).

The authors selected up to 11 datasets, employing mainly human-generated problems (only ∼10% of the large model-generated problems).

Platypus: Fast, Cheap and Powerful LL.M.

Considering that they retrieved questions from different sources, the authors checked and excluded questions that were identical or too similar. This is to prevent the model from storing the answer:

  • They eliminate duplicate problems.
  • They use SentenceTransformers to embed questions and then eliminate similar questions (80% similarity cosine)

Do not contaminate test equipment

The authors took care to ensure that any issues in the benchmark dataset do not leak into the training set (one of the most common mistakes).

However, this is not an easy task as the questions may be similar and there are multiple ways of expressing the query. Therefore the authors filtered out all similar queries. In fact, after analysis, they identified issues that were considered potential leaks and divided them into three groups:

  • Repeat . Many duplicate queries are either exact copies or a rearrangement of sentences or the addition of some words.
  • Gray area . Questions that are not entirely repetitive and fall within the realm of common sense. These questions need to be evaluated by experts in the field because they contain synonyms, very similar instructions, or have been paraphrased.
  • Similar but different. These questions have high cosine similarity but have different answers. This is because the structure of the question has changed.

Platypus: Fast, Cheap and Powerful LL.M.

fine-tuning

The authors used low-rank approximation (LoRA) because QLoRA came later, but they plan to use it in the future. They further improved training efficiency by using the state-of-the-art Parameter Efficient Fine-tuning (PEFT) library . Regardless, they claim they were able to fine-tune the smaller 1A100 80B model for 5 hours using a single 13GB. They also took special care in selecting parameters.

Platypus: Fast, Cheap and Powerful LL.M.

Another interesting approach is to merge adapters with different models once trained.

The code for each case has been published and is available here

It also provides IPython Notebook and detailed online documentation .

Result: How did it work?

Platypus: Fast, Cheap and Powerful LL.M.

The authors decided to use the HuggingFace leaderboard to compare the results of their models. The authors note that their model reached the top spot on the leaderboard in August:

Platypus: Fast, Cheap and Powerful LL.M.

The authors note that their method improves  the performance of the base model (LLaMA2) on different benchmarks. Furthermore, especially for smaller models, merging can produce interesting results (according to the authors, merging can cause the model to access information it does not know). Merging can therefore be considered a low-cost strategy to improve model performance. Of course, this technique also has limitations: its effectiveness depends on the domain, and in fact in algebra it has less impact. Therefore, careful selection of model and application domains must be made to complete the merge.

The authors also note that this model is the first of its kind in open source.

Platypus: Fast, Cheap and Powerful LL.M.

In fact, Playtypus was recently overtaken on the HuggingFace rankings. Its performance proved to be quite impressive.

limitation

Platypus: Fast, Cheap and Powerful LL.M.

Of course, the model is not without limitations. Some of these are derived from LLaMA2 , as that model was fine-tuned on this base model. In fact, it is incompatible with continuous learning and can create illusions and generate biased and harmful content.

LLaMA2 is a model trained primarily on English text, so it has less proficiency in other languages. Later research showed that large models could be used for malicious purposes (spreading misinformation or exploring sensitive topics). The same goes for platypuses.

Although Platypus has been trained in STEM fields, it may have difficulty dealing with topics outside of its primary area of ​​expertise.

Finally, although the authors have taken care to avoid contamination, there may still be issues that have not been filtered out

in conclusion

Training models is expensive, but as you can see from the leaderboards, small models can be successful at certain tasks. The use of LoRA and other technologies has democratized access to large models. This work further demonstrates how domain expertise can be a viable approach, how to incorporate adapters, and how to obtain high-quality datasets.

Guess you like

Origin blog.csdn.net/qq_41929396/article/details/132804476