Train your own Llama 2! Introduction to large model fine-tuning technology

Train your own Llama 2! Introduction to large model fine-tuning technology

Trend cloud

Trend Cloud is a development platform for AI algorithm engineers, providing engineers with functions such as model development, model training, data and code management.

Recently, many fans have asked Hashpower, does Trend Cloud support the training of large models? Of course support!

As soon as the recently popular Llama 2 came out, the little friends around the computing power gentleman had already run over. This article will introduce Llama 2 and related large-scale model fine-tuning technology. Trendcloud's powerful GPU pooling technology coupled with related software is The first choice for large model research and development~

picture

Llama 2: Android in the Age of Big Models

A few days ago, Meta open-sourced Llama 2, which is known as a big event in the field of large models because of its excellent performance and open-source features, which can be used in both research and business. Some people think that it is an open source alternative to products like ChatGPT , and it is called Android in the field of large models.

  • Usage guide: https://ai.meta.com/llama/

  • Open source code: https://github.com/facebookresearch/llama

Llama 2 is a set of pre-trained and fine-tuned large-scale language models (LLMs) . It is the second generation of large-scale language models launched by Meta AI, with parameters ranging from 7 billion to 70 billion.

Also, Llama 2-Chat is a fine-tuned version of Llama 2 , optimized for conversational scenarios. The model outperforms open-source dialogue models on most benchmarks, and Llama 2 is pretty good from a usefulness and security standpoint, according to human evaluation results.

The Llama 2 model is trained on 2 trillion tokens with twice the context length of Llama 1 . The Llama-2-chat model was additionally trained on more than 1 million new human-labeled data .

picture

Llama 2 model introduction

Llama 2 outperforms other open source language models on many external benchmarks, including inference, coding, proficiency, and knowledge tests.

picture

Llama 2 performance on benchmarks

Although Meta has greatly lowered the threshold for people to use large models by open-sourcing Llama 2, it is still not easy to retrain and fine-tune Llama 2 on customized data sets. Driven Cloud's powerful computing power, efficient scheduling, and people-friendly The price is the first choice for researching large models.

The open source author of Llama 2 provides relevant technical advice for fine-tuning the large model, let's take a look.

Large model fine-tuning technology

Parameter-efficient model fine-tuning

Representatives of this type of method are LORA (https://arxiv.org/pdf/2106.09685.pdf), LLaMA Adapter, Prefix-tuning, freezing the entire model during fine-tuning, and adding a small number of learnable parameters or networks to the model layer, only this part is fine-tuned during training.

This method is actually using the large model as a tool for extracting features. Because the huge amount of parameters of the model itself does not need to be adjusted, the calculation cost is not high, and it can even be fine-tuned on a single consumer-grade graphics card.

If your usage scenario for large models is not far from the existing capabilities of already trained large models, you should try this method first.

The benefits of this fine-tuning strategy are obvious:

1. The cost of fine-tuning is low, there is no need to face the challenges of large-scale calculation and transmission, and it can even run on consumer-grade graphics cards;

2. The cost of deployment is low, and the deployed large models can be reused, and there is no need to deploy multiple models when new business emerges;

3. Prevent catastrophic forgetting, so that the large model will not lose its ability to process previously trained tasks because it has learned new tasks.

The open source authors of Llama 2 claim that they use the PEFT library to fine-tune the large model. For the introduction and usage of the library, please refer to:

https://github.com/huggingface/peft

https://huggingface.co/blog/peft

Model fine-tuning of all or some parameters

Adjusting all or part of the parameters of the model itself also has its unique advantages, common methods such as:

1. Freeze the backbone of the pre-trained model, and only fine-tune the task layer, such as the classifier part;

2. Freeze the backbone of the pre-trained model, add a fully connected layer, and fine-tune the new part;

3. Fine-tune all layers of the model.

picture

Fine-tune task layers

picture

Fine-tune the new layer

picture

Fine tune all layers

Of course, you can also choose to fine-tune a small number of layers in the model, and there are already some guidelines to guide this choice. But the above method is more common.

At this time, one GPU is often not enough. For example, fine-tuning Llama 2 7B requires more video memory. You need to use enough video memory according to the number of fine-tuning parameters, training strategy, and parameter accuracy.

The open source author of Llama 2 said that the FSDP (Fully Sharded Data Parallel) package in PyTorch can help training at this time, and can train models that cannot be trained on a single GPU on multiple GPUs.

FSDP shards not only on data, but also on model parameters, gradients, and optimizer state. Only one slice of the model is saved per GPU, which saves a lot of memory and makes it possible to put larger models on multiple GPUs.

Additionally, to further improve the performance of fine-tuning with FSDP, several properties can be exploited:

1. Mixed precision: FSDP provides a more flexible way to set the accuracy of model parameters, buffers and gradients;

2. Activation checkpointing: saves memory by discarding intermediate activation values ​​during the forward pass and recomputing them during the backward pass;

3. auto_wrap_policy: This feature allows users to specify how FSDP fragments the model. These include default support for Transformers, which help FSDP create finer-grained communication units that optimize communication costs.

In short, FSDP is a proven and effective tool for fine-tuning the model as a whole.

Trendcloud's large video memory solution helps to train large models more calmly

Trendcloud large video memory products, the highest video memory can reach 80G :

picture

The card with 80G video memory is only 8.49 yuan per hour

Equipped with a complete development and training platform based on GPU pooling technology, the cost can be saved by up to 75%, and the research and development efficiency can be increased by 55%. It is the first choice for training and deploying large models. Register now and get computing power!

References: https://github.com/facebookresearch/llama-recipes/blob/main/docs/LLM_finetuning.md

Guess you like

Origin blog.csdn.net/sinat_37574187/article/details/132000474