Use ChatGLM2-6b fine-tuning to solve text binary classification tasks

ChatGLM2-6B is the second-generation version of the open source Chinese-English bilingual conversation model ChatGLM-6B. While retaining many excellent features of the first-generation model such as smooth conversation and low deployment threshold, ChatGLM2-6B introduces the following new features:

More powerful performance = hybrid objective function + 1.4T Chinese and English identifiers: Based on the development experience of the first generation model of ChatGLM, we have comprehensively upgraded the base model of ChatGLM2-6B. ChatGLM2-6B uses the hybrid objective function of GLM, and has undergone pre-training of 1.4T Chinese and English identifiers and human preference alignment training. The evaluation results show that compared with the first-generation model, ChatGLM2-6B has better performance in MMLU (+23%), CEval The performance on data sets such as (+33%), GSM8K (+571%), and BBH (+60%) has been greatly improved, and it is highly competitive among open source models of the same size.
Longer context = Flash Attention technology + context length expansion to 32K + 8K training + multiple rounds of dialogue: Based on Flash Attention technology, we expanded the context length (Context Length) of the base model from 2K of ChatGLM-6B to 32K, and The dialogue phase is trained with a context length of 8K, allowing more rounds of dialogue. However, the current version of ChatGLM2-6B has limited ability to understand single-round ultra-long documents. We will focus on optimization in subsequent iterative upgrades.
More efficient inference = Multi-Query Attention technology + INT4 quantification: Based on Multi-Query Attention technology, ChatGLM2-6B has more efficient inference speed and lower memory usage: under the official model implementation, the inference speed is improved compared to the first generation By 42%, under INT4 quantification, the conversation length supported by 6G video memory has been increased from 1K to 8K.
A more open protocol: ChatGLM2-6B weights are fully open to academic research and also allow commercial use with official written permission. If you find our open source model useful for your business, we welcome your donations towards the development of the next generation model, ChatGLM3.

Recently, I tried to use the large model ChatGLM2-6b to solve a text classification task. During the fine-tuning and use process, I encountered some points that need attention. This article will give a more detailed experience summary.

1. Prepare the data set

My data set contains fields such as title, author, abstract, etc. I first read the data in csv format and then convert it into a format that the model can handle:

import pandas as pd

train_df = pd.read_csv('./csv_data/train.csv') 
test_df = pd.read_csv('./csv_data/test.csv')

## 制作数据集
res = [] 

for i in range(len(train_df)):
  paper_item = train_df.loc[i]
  tmp = {
    
    
    "instruction": "Please judge...", 
    "input": f"title:{
      
      paper_item[1]},abstract:{
      
      paper_item[3]}",
    "output": str(paper_item[5])
  }
  res.append(tmp)

import json
with open('paper_label.json', mode='w', encoding='utf-8') as f:
  json.dump(res, f, ensure_ascii=False, indent=4)

In addition, ensure_ascii=False needs to be set when storing Chinese text in JSON. This is related to Unicode encoding and can avoid garbled characters in Chinese.

2. Fine-tuning ChatGLM

Fine-tuning large models

First you need to clone the fine-tuning script: git clone https://github.com/KMnO4-zx/huanhuan-chat.git
Enter the directory installation environment: cd ./huanhuan-chat; pip install -r requirements.txt
Replace model_name_or_path in the script with your local chatglm2-6b model path, and then run the script: sh xfg_train.sh
The fine-tuning process takes about two hours (I used Alibaba Cloud A10-24G to run for about two hours). The fine-tuning process requires 18G of video memory. It is recommended to use a graphics card with 24G of video memory, such as 3090, 4090, etc.
Of course, we have already placed the trained lora weights in the warehouse, and you can run the code below directly.
We will continue to update more fine-tuning details in this warehouse, welcome to follow star! https://github.com/KMnO4-zx/huanhuan-chat.git

Use ChatGLM's fine-tuning script to fine-tune the ChatGLM2-6b model on text containing titles and abstracts. The difficulties to note here are:

Fine-tuning requires a lot of GPU computing power, and you need to prepare a high-end GPU with at least 24G of video memory.
Fine-tuning requires specifying the correct model path, otherwise errors will occur.
If you encounter the problem of insufficient memory, you can appropriately reduce the batch size.

3. Load fine-tuned weights for prediction

Fine-tuning pre-trained language models is a typical application of transfer learning. We want to let the model learn specific downstream tasks instead of training it from scratch. During the fine-tuning process, I chose a technique called LoRA. The basic idea is to insert a new classification head into the pre-trained language model, and then perform joint training of the full model on the data set of the downstream task. This method can well integrate the pre-training model and downstream tasks, and can achieve good results in many NLP competitions.

Use Peft to load the LoRA weights obtained by fine-tuning to construct a prediction function. The code is as follows:

from peft import PeftModel
from transformers import AutoTokenizer, AutoModel, GenerationConfig, AutoModelForCausalLM

model_path = "chatglm2-6b"
model = AutoModel.from_pretrained(model_path, trust_remote_code=True).half().cuda()
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# 加载LoRA权重
model = PeftModel.from_pretrained(model, 'huanhuan-chat/output/label_xfg').half()
model = model.eval()
response, history = model.chat(tokenizer, "你好", history=[])
response

4. Forecast

When loading the fine-tuned language model for prediction, we need to pay attention to switching the model to eval mode. This is a knowledge point of PyTorch. The eval mode fixes BN and Dropout, which can improve the stability of prediction. In addition, in order to obtain a certain prediction output, you can set temperature=0.01 to take argmax.

# 预测函数
def predict(text):
    response, history = model.chat(tokenizer, f"Please judge whether it is a medical field paper according to the given paper title and abstract, output 1 or 0, the following is the paper title, author and abstract -->{
      
      text}", history=[],
    temperature=0.01)
    return response

predict('title:Seizure Detection and Prediction by Parallel Memristive Convolutional Neural Networks,author:Li, Chenqi; Lammie, Corey; Dong, Xuening; Amirsoleimani, Amirali; Azghadi, Mostafa Rahimi; Genov, Roman,abstract:During the past two decades, epileptic seizure detection and prediction algorithms have evolved rapidly. However, despite significant performance improvements, their hardware implementation using conventional technologies, such as Complementary Metal-Oxide-Semiconductor (CMOS), in power and areaconstrained settings remains a challenging task; especially when many recording channels are used. In this paper, we propose a novel low-latency parallel Convolutional Neural Network (CNN) architecture that has between 2-2,800x fewer network parameters compared to State-Of-The-Art (SOTA) CNN architectures and achieves 5-fold cross validation accuracy of 99.84% for epileptic seizure detection, and 99.01% and 97.54% for epileptic seizure prediction, when evaluated using the University of Bonn Electroencephalogram (EEG), CHB-MIT and SWEC-ETHZ seizure datasets, respectively. We subsequently implement our network onto analog crossbar arrays comprising Resistive Random-Access Memory (RRAM) devices, and provide a comprehensive benchmark by simulating, laying out, and determining hardware requirements of theCNNcomponent of our system. We parallelize the execution of convolution layer kernels on separate analog crossbars to enable 2 orders of magnitude reduction in latency compared to SOTA hybrid Memristive-CMOS Deep Learning (DL) accelerators. Furthermore, we investigate the effects of non-idealities on our system and investigate Quantization Aware Training (QAT) to mitigate the performance degradation due to lowAnalog-to-Digital Converter (ADC)/Digital-to-Analog Converter (DAC) resolution. Finally, we propose a stuck weight offsetting methodology to mitigate performance degradation due to stuck RON/ROFF memristor weights, recovering up to 32% accuracy, without requiring retraining. The CNN component of our platform is estimated to consume approximately 2.791Wof power while occupying an area of 31.255 mm(2) in a 22 nm FDSOI CMOS process.')

# 预测测试集

from tqdm import tqdm

label = []

for i in tqdm(range(len(test_df))):
    test_item = test_df.loc[i]
    test_input = f"title:{
      
      test_item[1]},author:{
      
      test_item[2]},abstract:{
      
      test_item[3]}"
    label.append(int(predict(test_input)))

test_df['label'] = label

submit = test_df[['uuid', 'Keywords', 'label']]

submit.to_csv('submit.csv', index=False)

5. When fine-tuning the Transformer class model, I summarized some experiences in NLP competitions:

Try to make full use of pre-trained models: Today's pre-trained language models can already extract powerful semantic features, and direct fine-tune can often achieve good results.
Try more fine-tuning techniques: For example, using LoRA for inter-layer fine-tuning can not only improve the effect, but also make parameters more efficient.
Carefully design Prompt: Designing appropriate Prompt (statement template) according to the task can allow the model to better capture the characteristics of downstream tasks.
Multiple sets of verification experiments: Run multiple experimental combinations, such as model size, prompt length, batch size and other hyperparameters, to find the optimal configuration.
Pay attention to the phenomenon of overfitting: large models are prone to overfitting, and strategies such as early stopping can be used, or training data can be enhanced.
Mathematical skills: Combining statistical features or stacking the output of multiple models can further improve the score.