How to reduce model cost? Platypus: Fast, cheap and powerful LLM that beats the competition with only one GPU and 5 hours of LLaMA2 fine-tuning

How to reduce model cost?

In recent years, model parameters have exploded to huge numbers (540 B for PaLM). The question that was raised was whether this number of parameters was necessary.

According to OpenAI, as the model grows, so does the performance. In addition, emergent properties (properties that are not observable unless at a certain scale) emerge.

This view is challenged by the fact that there is actually more data, so scaling is limited by the number of tokens required to optimally train the model. Furthermore, even these emergent properties may not exist.

Chinchilla scaling law, as the number of parameters increases, we need more data to train it optimally
Second, these proprietary models cannot be freely analyzed or used by the scientific community. So, first with BLOOM, then with META's LLaMA, the community has moved to using open source models. LLaMA also shows that greater attention to the data enables smaller models to compete with larger models.

On the other hand, however, small models cannot generalize as well as large models. However, this has led to a search for techniques to reduce the cost of these models, such as knowledge distillation (the teacher model teaches the student model). Later approaches try to further reduce the cost by distilling datasets, starting with large training datasets, to smaller but simultaneously efficient ones.

insert image description here
Another idea to reduce computational cost is hybrid experts, where various parts of the network are activated depending on the input. For example, in a switching transformer, a different set of parameters is chosen for each example (and for different tokens).

Guess you like

Origin blog.csdn.net/iCloudEnd/article/details/132663428