In the era of AIGC, how can fine-tuning of large models play the most important role?

The rapid development of artificial intelligence has promoted the widespread application of large models, and their application effects in the fields of language, vision, and speech have become better and better. However, training a large model requires huge computing resources and time, and to reduce this waste of resources, fine-tuning has become a popular technique. Fine-tuning refers to adapting to new tasks by training on a small data set based on the pre-trained model. The emergence of AIGC (AI chip) has further accelerated the promotion of large models, which can provide faster computing speed and larger storage capacity. This article will introduce the method of fine-tuning large models under AIGC, including fine-tuning all layers, fine-tuning the top layer, freezing the bottom layer, layer-by-layer fine-tuning and transfer learning. We will use PaddlePaddle, an open source framework, to illustrate the principles and implementation steps of these methods by taking natural language processing and computer vision as examples.

Under the AIGC large model, we are currently most familiar with a large model that is Chatgpt. At present, foreign bigwigs are also studying whether to upload a picture or a video from the perspective of computer vision. I told him to do visual tasks, and he can Meet the corresponding visual requirements.

For such a large model, although we have no way to use open source models for the time being, and the parameters involved in the model are too large, if we train exhausted machines by ourselves, in the face of different business situations, our more methods are based on The method of fine-tuning the large model to realize our application

fine-tuning method

In deep learning, fine-tuning is an important technique for improving the performance of pre-trained models. In addition to fine-tuning ChatGPT, there are many other pre-trained models that can be fine-tuned. Here are some ways to fine-tune a pretrained model:

  • Fine-tune all layers : Fine-tune all layers of the pre-trained model to adapt to new tasks.
  • Fine-tune the top layers : Only fine-tune the top layers of the pre-trained model for new tasks.
  • Freeze the bottom layer : The bottom layer of the pre-trained model is fixed, and only the top layer is fine-tuned.
  • Layer-by-layer fine-tuning : Starting from the bottom layer, fine-tune the pre-trained model layer by layer until all layers are fine-tuned.
  • Transfer Learning : Transfer the knowledge of pre-trained models to new tasks to improve model performance. This method usually uses the method of fine-tuning the top layer or freezing the bottom layer.

For the parameter fine-tuning of the model, I think it can be understood in this way. Taking the original chatgpt as an example, it is like a general large model.

Like a college student who has learned all professional knowledge in college, based on past learning experience and some things in life, he already has his own set of learning methods and thinking logic(This is the parameter of the model)

Now that a college student works in a certain industry after graduation, he must start to learn the content of the job to produce the results of the job. In the process of his learning, can he also apply the learning methods of professional knowledge he learned in college? Can he use the same learning methods to learn work things?(This is fine-tuning)

Fine-tuning is to apply what I have learned in the past to the new content to produce new results.

Going back to fine-tuning the different layers, how do you choose which layers need to be fine-tuned? You need to know what experience these layers of the model have learned on the original data set?
Are those experiences we can reuse in another data set?

In a neural network, which layers are learning from experience?

In computer vision, Convolutional Neural Networks (CNNs) usually learn the following experiences:

  • Local perception: CNNs can learn local features, such as edges and textures, through convolution operations and pooling operations, thereby achieving local perception of images.
  • Translation invariance: CNNs can learn the invariance of features to translation, so that for different parts of the same object, CNNs can generate similar feature representations.
  • Hierarchical abstraction: CNNs can learn more and more abstract features through multi-layer convolution operations, from low-level features such as edges to high-level features such as parts or wholes of objects.

In addition, some attention-based models have emerged in recent years, such as the Self-Attention Model and the transformer module, which can weight and sum the features in the image through the learned attention weights, thus making them more refined. feature extraction

In natural language processing, Recurrent Neural Networks (RNNs) and Transformer networks usually learn the following experiences:

  • Timing dependencies: RNNs and Transformer networks can learn timing dependencies in text sequences, thereby realizing serialization of text.
  • Hierarchical abstraction: RNNs and Transformer networks can implement hierarchical abstraction of text through multi-layer neural networks, thereby learning higher-level text feature representations.
  • Context dependence: RNNs and Transformer networks can learn context dependencies, so that context-based text feature representations can be generated, thereby improving the performance of the model.

It can be said that for processing computer vision and natural language processing tasks, the upstream part of the model is in the process of learning experience

But when computer vision and natural language processing are doing fine-tuning models, the difference is:

For computer vision, different images and learned experience may be completely different, but for natural language processing different texts, the learned experience may be the same, because text data and features are more dependent on context, Linguistic timing. These features can be applied in texts with different contents. (For example, writing a thesis and writing an essay, there are great similarities in writing)

Model fine-tuning with paddle

The following is the sample code for the above five fine-tuning methods using the PaddlePaddle framework:

import paddle
from paddle import nn

# 加载预训练的Transformer模型
pretrained_model = paddle.vision.models.Transformer()

# 1. 微调所有层
for param in pretrained_model.parameters():
    param.trainable = True

# 2. 微调顶层
for param in pretrained_model.decoder.parameters():
    param.trainable = True

# 3. 冻结底层
for param in pretrained_model.encoder.parameters():
    param.trainable = False

# 4. 逐层微调
for i, layer in enumerate(pretrained_model.encoder.layers):
    if i >= 6:  # 只微调第6层及以上的层
        for param in layer.parameters():
            param.trainable = True
    else:
        for param in layer.parameters():
            param.trainable = False

# 5. 迁移学习
# 加载预训练的模型
pretrained_model = paddle.vision.models.ResNet50(pretrained=True)

# 新建分类器
num_classes = 10
classifier = nn.Linear(2048, num_classes)

# 冻结预训练模型的所有层
for param in pretrained_model.parameters():
    param.trainable = False

# 微调新建分类器的参数
for param in classifier.parameters():
    param.trainable = True

# 将预训练模型和新建分类器组合成新的模型
model = nn.Sequential(pretrained_model, classifier)

In the above code, we first loaded the pre-trained Transformer model through paddle.vision.models.Transformer(). Different layers of the model are then fine-tuned or frozen, respectively, depending on the fine-tuning method. Finally, we use the transfer learning method to combine the pre-trained model and the new classifier to form a new model.

Five fine-tuning methods for ChatGPT model using Paddle

fine tune all layers

import paddle
from paddlenlp.transformers import GPT2Model, GPT2ForPretraining, GPT2PretrainingCriterion

# 加载预训练模型
model = GPT2ForPretraining.from_pretrained('gpt2-medium-en')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2-medium-en')

# 定义新的分类头
class_num = 2
cls = paddle.nn.Linear(model.config["hidden_size"], class_num)

# 将新的分类头添加到模型中
model.cls = cls

# 通过微调所有层来适应新任务
optimizer = paddle.optimizer.Adam(learning_rate=1e-5, parameters=model.parameters())
criterion = GPT2PretrainingCriterion()

Fine tune the top layer

import paddle
from paddlenlp.transformers import GPT2Model, GPT2ForPretraining, GPT2PretrainingCriterion

# 加载预训练模型
model = GPT2ForPretraining.from_pretrained('gpt2-medium-en')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2-medium-en')

# 固定模型底层,只微调顶层
for param in model.parameters():
    param.trainable = False

# 定义新的分类头
class_num = 2
cls = paddle.nn.Linear(model.config["hidden_size"], class_num)

# 将新的分类头添加到模型中
model.cls = cls

# 通过微调顶层来适应新任务
for param in model.cls.parameters():
    param.trainable = True
optimizer = paddle.optimizer.Adam(learning_rate=1e-5, parameters=model.cls.parameters())
criterion = paddle.nn.CrossEntropyLoss()

Freeze bottom layer

import paddle
import paddle.nn.functional as F
from paddlenlp.transformers import GPTForPretraining, GPTChineseTokenizer

# 加载预训练模型和分词器
model = GPTForPretraining.from_pretrained('gpt-cpm-large-cn')
tokenizer = GPTChineseTokenizer.from_pretrained('gpt-cpm-large-cn')

# 构造数据集和数据加载器
train_ds = [['今天天气不错'], ['明天要下雨'], ['这个季节很适合旅游']]
train_ds = [{
    
    'text': text} for text in train_ds]

def batch_iter(data, batch_size):
    num_batches = len(data) // batch_size
    if len(data) % batch_size != 0:
        num_batches += 1
    for i in range(num_batches):
        batch = data[i * batch_size: (i + 1) * batch_size]
        yield batch

batch_size = 2
train_loader = paddle.io.DataLoader(train_ds, batch_size=batch_size, shuffle=True, drop_last=True)

# 构造优化器和损失函数
optimizer = paddle.optimizer.AdamW(parameters=model.parameters(), learning_rate=1e-4)
criterion = F.cross_entropy

# 冻结底层
for layer in model.layers[:6]:
    layer.eval()
    for param in layer.parameters():
        param.trainable = False

# 微调模型
for epoch in range(3):
    for batch in train_loader:
        texts = [example['text'] for example in batch]
        encoded_inputs = tokenizer(texts, return_attention_mask=True, return_length=True, padding=True)
        input_ids = paddle.to_tensor(encoded_inputs['input_ids'])
        attention_mask = paddle.to_tensor(encoded_inputs['attention_mask'])
        logits = model(input_ids, attention_mask=attention_mask)[0]
        loss = criterion(logits.reshape(-1, logits.shape[-1]), input_ids.reshape(-1))
        loss.backward()
        optimizer.step()
        optimizer.clear_grad()
    print(f'Epoch {
      
      epoch + 1}: loss={
      
      loss.numpy():.4f}')

# 保存微调后的模型
paddle.save(model.state_dict(), 'gpt-cpm-large-cn-finetuned

fine-tuning layer by layer

import paddle
import paddle.nn.functional as F
from paddlenlp.transformers import GPTForPretraining, GPTChineseTokenizer

# 加载预训练模型和分词器
model = GPTForPretraining.from_pretrained('gpt-cpm-large-cn')
tokenizer = GPTChineseTokenizer.from_pretrained('gpt-cpm-large-cn')

# 构造数据集和数据加载器
train_ds = [['今天天气不错'], ['明天要下雨'], ['这个季节很适合旅游']]
train_ds = [{
    
    'text': text} for text in train_ds]

def batch_iter(data, batch_size):
    num_batches = len(data) // batch_size
    if len(data) % batch_size != 0:
        num_batches += 1
    for i in range(num_batches):
        batch = data[i * batch_size: (i + 1) * batch_size]
        yield batch

batch_size = 2
train_loader = paddle.io.DataLoader(train_ds, batch_size=batch_size, shuffle=True, drop_last=True)

# 构造优化器和损失函数
optimizer = paddle.optimizer.AdamW(parameters=model.parameters(), learning_rate=1e-4)
criterion = F.cross_entropy

# 迁移学习微调模型
for epoch in range(3):
    for batch in train_loader:
        texts = [example['text'] for example in batch]
        encoded_inputs = tokenizer(texts, return_attention_mask=True, return_length=True, padding=True)
        input_ids = paddle.to_tensor(encoded_inputs['input_ids'])
        attention_mask = paddle.to_tensor(encoded_inputs['attention_mask'])
        logits = model(input_ids, attention_mask=attention_mask)[0]
        loss = criterion(logits.reshape(-1, logits.shape[-1]), input_ids.reshape(-1))
        loss.backward()
        optimizer.step()
        optimizer.clear_grad()
    print(f'Epoch {
      
      epoch + 1}: loss={
      
      loss.numpy():.4f}')

# 保存微调后的模型
paddle.save(model.state_dict(), 'gpt-cpm-large-cn-finetuned-transfer-learning.pdparams')

In the above code, I changed the method of model fine-tuning from layer-by-layer fine-tuning to transfer learning fine-tuning. Specifically, I removed the hidden state calculation and obtaining the output of each layer in the original layer-by-layer fine-tuning and other related codes, and directly passed the input and attention mask into the model, obtained the output of the last layer, and Compute loss for backpropagation and optimization.

At the same time, I changed the file name when saving the model from gpt-cpm-large-cn-finetuned-layer-wise.pdparams to gpt-cpm-large-cn-finetuned-transfer-learning.pdparams to distinguish layer-by-layer fine-tuning and transfer learning fine-tuning.

transfer learning

import paddle
import paddle.nn.functional as F
from paddlenlp.transformers import GPTForPretraining, GPTChineseTokenizer

# 加载预训练模型和分词器
model = GPTForPretraining.from_pretrained('gpt-cpm-large-cn')
tokenizer = GPTChineseTokenizer.from_pretrained('gpt-cpm-large-cn')

# 构造数据集和数据加载器
train_ds = [['今天天气不错'], ['明天要下雨'], ['这个季节很适合旅游']]
train_ds = [{
    
    'text': text} for text in train_ds]

def batch_iter(data, batch_size):
    num_batches = len(data) // batch_size
    if len(data) % batch_size != 0:
        num_batches += 1
    for i in range(num_batches):
        batch = data[i * batch_size: (i + 1) * batch_size]
        yield batch

batch_size = 2
train_loader = paddle.io.DataLoader(train_ds, batch_size=batch_size, shuffle=True, drop_last=True)

# 构造优化器和损失函数
optimizer = paddle.optimizer.AdamW(parameters=model.parameters(), learning_rate=1e-4)
criterion = F.cross_entropy

# 训练模型
epochs = 3
for epoch in range(epochs):
    for batch in train_loader:
        texts = [example['text'] for example in batch]
        encoded_inputs = tokenizer(texts, return_attention_mask=True, return_length=True, padding=True)
        input_ids = paddle.to_tensor(encoded_inputs['input_ids'])
        attention_mask = paddle.to_tensor(encoded_inputs['attention_mask'])
        logits = model(input_ids, attention_mask=attention_mask)[0]
        loss = criterion(logits.reshape(-1, logits.shape[-1]), input_ids.reshape(-1))
        loss.backward()
        optimizer.step()
        optimizer.clear_grad()
    print(f'Epoch {
      
      epoch + 1}: loss={
      
      loss.numpy():.4f}')

# 保存微调后的模型
paddle.save(model.state_dict(), 'gpt-cpm-large-cn-finetuned.pdparams')

In the above code, we first loaded the pre-trained GPT model and tokenizer, and then constructed a simple dataset and data loader. Next, we use the AdamW optimizer and the cross-entropy loss function to train the model, and save the fine-tuned model after training.

insert image description here

Guess you like

Origin blog.csdn.net/weixin_42010722/article/details/129378983